gigl.common.utils.torch_training#
Attributes#
Functions#
|
Returns the distributed backend based on whether distributed training is enabled and whether CUDA is used. |
|
This is automatically set by Kubeflow PyTorchJob launcher |
This is automatically set by Kubeflow PyTorchJob launcher |
|
For local debugging purpose only |
|
Determines whether the process should be configured for distributed training. |
Module Contents#
- gigl.common.utils.torch_training.get_distributed_backend(use_cuda)[source]#
Returns the distributed backend based on whether distributed training is enabled and whether CUDA is used. :param use_cuda: Whether CUDA is used for training :type use_cuda: bool
- Returns:
The distributed backend (NCCL or GLOO) if distributed training is enabled, None otherwise
- Return type:
Optional[str]
- Parameters:
use_cuda (bool)
- gigl.common.utils.torch_training.get_rank()[source]#
This is automatically set by Kubeflow PyTorchJob launcher :returns: The index of the process involved in distributed training :rtype: int
- Return type:
int
- gigl.common.utils.torch_training.get_world_size()[source]#
This is automatically set by Kubeflow PyTorchJob launcher :returns: Total number of processes involved in distributed training :rtype: int
- Return type:
int
- gigl.common.utils.torch_training.is_distributed_available_and_initialized()[source]#
- Returns:
True if distributed training is available and initialized, False otherwise
- Return type:
bool
- gigl.common.utils.torch_training.is_distributed_local_debug()[source]#
For local debugging purpose only This sets necessary environment variables for distributed training at local machine :returns: If True, then should_distribute early exit and enables distributed training :rtype: bool
- Return type:
bool