gigl.distributed.utils#
Utility functions for distributed computing.
Submodules#
Functions#
|
Returns the available device for the current process. |
Get a free port number. |
|
|
Get free ports from master node, that can be used for communication between workers. |
Get the internal IP addresses of all ranks in a distributed setup. Internal IPs are usually not accessible |
|
|
Get the internal IP address of the master node in a distributed setup. |
|
Returns the name of the process group for the given process rank. |
|
Sets up processes and torch device for initializing the GLT DistNeighborLoader, setting up RPC and worker groups to minimize |
Package Contents#
- gigl.distributed.utils.get_available_device(local_process_rank)[source]#
Returns the available device for the current process.
- Parameters:
local_process_rank (int) – The local rank of the current process within a node.
- Returns:
The device to use.
- Return type:
torch.device
- gigl.distributed.utils.get_free_port()[source]#
Get a free port number. Note: If you call get_free_port multiple times, it can return the same port number if the port is still free. If you want multiple free ports before you init/use them, leverage get_free_ports instead. :returns: A free port number on the current machine. :rtype: int
- Return type:
int
- gigl.distributed.utils.get_free_ports_from_master_node(num_ports=1, _global_rank_override=None)[source]#
Get free ports from master node, that can be used for communication between workers. :param num_ports: Number of free ports to find. :type num_ports: int :param _global_rank_override: Override for the global rank,
useful for testing or if global rank is not accurately available.
- Returns:
A list of free port numbers on the master node.
- Return type:
List[int]
- Parameters:
_global_rank_override (Optional[int])
- gigl.distributed.utils.get_internal_ip_from_all_ranks()[source]#
Get the internal IP addresses of all ranks in a distributed setup. Internal IPs are usually not accessible from the web. i.e. the machines will have to be on the same network or VPN to get the right address so each rank can communicate with each other. This is useful for setting up RPC communication between ranks where the default torch.distributed env:// setup is not enough. Or, if you are trying to run validation checks, get local world size for a specific node, etc.
- Returns:
A list of internal IP addresses of all ranks.
- Return type:
List[str]
- gigl.distributed.utils.get_internal_ip_from_master_node(_global_rank_override=None)[source]#
Get the internal IP address of the master node in a distributed setup. This is useful for setting up RPC communication between workers where the default torch.distributed env:// setup is not enough.
i.e. when using
gigl.distributed.dataset_factory
- Returns:
The internal IP address of the master node.
- Return type:
str
- Parameters:
_global_rank_override (Optional[int])
- gigl.distributed.utils.get_process_group_name(process_rank)[source]#
Returns the name of the process group for the given process rank. :param process_rank: The rank of the process. :type process_rank: int
- Returns:
The name of the process group.
- Return type:
str
- Parameters:
process_rank (int)
- gigl.distributed.utils.init_neighbor_loader_worker(master_ip_address, local_process_rank, local_process_world_size, rank, world_size, master_worker_port, device, should_use_cpu_workers=False, num_cpu_threads=None, process_start_gap_seconds=60.0)[source]#
Sets up processes and torch device for initializing the GLT DistNeighborLoader, setting up RPC and worker groups to minimize the memory overhead and CPU contention. Returns the torch device which current worker is assigned to. :param master_ip_address: Master IP Address to manage processes :type master_ip_address: str :param local_process_rank: Process number on the current machine :type local_process_rank: int :param local_process_world_size: Total number of processes on the current machine :type local_process_world_size: int :param rank: Rank of current machine :type rank: int :param world_size: Total number of machines :type world_size: int :param master_worker_port: Master port to use for communicating between workers during training or inference :type master_worker_port: int :param device: The device where you want to load the data onto - i.e. where is your model? :type device: torch.device :param should_use_cpu_workers: Whether we should do CPU training or inference. :type should_use_cpu_workers: bool :param num_cpu_threads: Number of cpu threads PyTorch should use for CPU training or inference.
Must be set if should_use_cpu_workers is True.
- Parameters:
process_start_gap_seconds (float) – Delay between each process for initializing neighbor loader. At large scales, it is recommended to set this value to be between 60 and 120 seconds – otherwise multiple processes may attempt to initialize dataloaders at overlapping timesß, which can cause CPU memory OOM.
master_ip_address (str)
local_process_rank (int)
local_process_world_size (int)
rank (int)
world_size (int)
master_worker_port (int)
device (torch.device)
should_use_cpu_workers (bool)
num_cpu_threads (Optional[int])
- Returns:
Device which current worker is assigned to
- Return type:
torch.device