gigl.common.utils.torch_training#
Attributes#
Functions#
| 
 | Returns the distributed backend based on whether distributed training is enabled and whether CUDA is used. | 
| 
 | This is automatically set by Kubeflow PyTorchJob launcher | 
| This is automatically set by Kubeflow PyTorchJob launcher | |
| For local debugging purpose only | |
| Determines whether the process should be configured for distributed training. | 
Module Contents#
- gigl.common.utils.torch_training.get_distributed_backend(use_cuda)[source]#
- Returns the distributed backend based on whether distributed training is enabled and whether CUDA is used. :param use_cuda: Whether CUDA is used for training :type use_cuda: bool - Returns:
- The distributed backend (NCCL or GLOO) if distributed training is enabled, None otherwise 
- Return type:
- Optional[str] 
- Parameters:
- use_cuda (bool) 
 
- gigl.common.utils.torch_training.get_rank()[source]#
- This is automatically set by Kubeflow PyTorchJob launcher :returns: The index of the process involved in distributed training :rtype: int - Return type:
- int 
 
- gigl.common.utils.torch_training.get_world_size()[source]#
- This is automatically set by Kubeflow PyTorchJob launcher :returns: Total number of processes involved in distributed training :rtype: int - Return type:
- int 
 
- gigl.common.utils.torch_training.is_distributed_available_and_initialized()[source]#
- Returns:
- True if distributed training is available and initialized, False otherwise 
- Return type:
- bool 
 
- gigl.common.utils.torch_training.is_distributed_local_debug()[source]#
- For local debugging purpose only This sets necessary environment variables for distributed training at local machine :returns: If True, then should_distribute early exit and enables distributed training :rtype: bool - Return type:
- bool 
 
