gigl.common.utils.torch_training#

Attributes#

logger

Functions#

`get_distributed_backend`(use_cuda)	Returns the distributed backend based on whether distributed training is enabled and whether CUDA is used.
`get_rank`()	This is automatically set by Kubeflow PyTorchJob launcher
`get_world_size`()	This is automatically set by Kubeflow PyTorchJob launcher
`is_distributed_available_and_initialized`()
`is_distributed_local_debug`()	For local debugging purpose only
`should_distribute`()	Determines whether the process should be configured for distributed training.

Module Contents#

gigl.common.utils.torch_training.get_distributed_backend(use_cuda)[source]#

Returns the distributed backend based on whether distributed training is enabled and whether CUDA is used. :param use_cuda: Whether CUDA is used for training :type use_cuda: bool

Returns:: The distributed backend (NCCL or GLOO) if distributed training is enabled, None otherwise
Return type:: Optional[str]
Parameters:: use_cuda (bool)

gigl.common.utils.torch_training.get_rank()[source]#

This is automatically set by Kubeflow PyTorchJob launcher :returns: The index of the process involved in distributed training :rtype: int

Return type:: int

gigl.common.utils.torch_training.get_world_size()[source]#

This is automatically set by Kubeflow PyTorchJob launcher :returns: Total number of processes involved in distributed training :rtype: int

Return type:: int

gigl.common.utils.torch_training.is_distributed_available_and_initialized()[source]#

Returns:: True if distributed training is available and initialized, False otherwise
Return type:: bool

gigl.common.utils.torch_training.is_distributed_local_debug()[source]#

For local debugging purpose only This sets necessary environment variables for distributed training at local machine :returns: If True, then should_distribute early exit and enables distributed training :rtype: bool

Return type:: bool

gigl.common.utils.torch_training.should_distribute()[source]#

Determines whether the process should be configured for distributed training. :returns: True if the process is configured for distributed training :rtype: bool

Return type:: bool

gigl.common.utils.torch_training.logger[source]#