gigl.distributed.utils.init_neighbor_loader_worker#

Attributes#

logger

Functions#

`get_process_group_name`(process_rank)	Returns the name of the process group for the given process rank.
`init_neighbor_loader_worker`(master_ip_address, ...[, ...])	Sets up processes and torch device for initializing the GLT DistNeighborLoader, setting up RPC and worker groups to minimize

Module Contents#

gigl.distributed.utils.init_neighbor_loader_worker.get_process_group_name(process_rank)[source]#

Returns the name of the process group for the given process rank. :param process_rank: The rank of the process. :type process_rank: int

Returns:: The name of the process group.
Return type:: str
Parameters:: process_rank (int)

gigl.distributed.utils.init_neighbor_loader_worker.init_neighbor_loader_worker(master_ip_address, local_process_rank, local_process_world_size, rank, world_size, master_worker_port, device, should_use_cpu_workers=False, num_cpu_threads=None, process_start_gap_seconds=60.0)[source]#

Sets up processes and torch device for initializing the GLT DistNeighborLoader, setting up RPC and worker groups to minimize the memory overhead and CPU contention. Returns the torch device which current worker is assigned to. :param master_ip_address: Master IP Address to manage processes :type master_ip_address: str :param local_process_rank: Process number on the current machine :type local_process_rank: int :param local_process_world_size: Total number of processes on the current machine :type local_process_world_size: int :param rank: Rank of current machine :type rank: int :param world_size: Total number of machines :type world_size: int :param master_worker_port: Master port to use for communicating between workers during training or inference :type master_worker_port: int :param device: The device where you want to load the data onto - i.e. where is your model? :type device: torch.device :param should_use_cpu_workers: Whether we should do CPU training or inference. :type should_use_cpu_workers: bool :param num_cpu_threads: Number of cpu threads PyTorch should use for CPU training or inference.

Must be set if should_use_cpu_workers is True.

Parameters:

process_start_gap_seconds (float) – Delay between each process for initializing neighbor loader. At large scales, it is recommended to set this value to be between 60 and 120 seconds – otherwise multiple processes may attempt to initialize dataloaders at overlapping timesß, which can cause CPU memory OOM.
master_ip_address (str)
local_process_rank (int)
local_process_world_size (int)
rank (int)
world_size (int)
master_worker_port (int)
device (torch.device)
should_use_cpu_workers (bool)
num_cpu_threads (Optional[int])

Returns:

Device which current worker is assigned to

Return type:

torch.device

gigl.distributed.utils.init_neighbor_loader_worker.logger[source]#