gigl.common.utils.torch_training#

Attributes#

Functions#

get_distributed_backend(use_cuda)

Returns the distributed backend based on whether distributed training is enabled and whether CUDA is used.

get_rank()

This is automatically set by Kubeflow PyTorchJob launcher

get_world_size()

This is automatically set by Kubeflow PyTorchJob launcher

is_distributed_available_and_initialized()

is_distributed_local_debug()

For local debugging purpose only

should_distribute()

Determines whether the process should be configured for distributed training.

Module Contents#

gigl.common.utils.torch_training.get_distributed_backend(use_cuda)[source]#

Returns the distributed backend based on whether distributed training is enabled and whether CUDA is used. :param use_cuda: Whether CUDA is used for training :type use_cuda: bool

Returns:

The distributed backend (NCCL or GLOO) if distributed training is enabled, None otherwise

Return type:

Optional[str]

Parameters:

use_cuda (bool)

gigl.common.utils.torch_training.get_rank()[source]#

This is automatically set by Kubeflow PyTorchJob launcher :returns: The index of the process involved in distributed training :rtype: int

Return type:

int

gigl.common.utils.torch_training.get_world_size()[source]#

This is automatically set by Kubeflow PyTorchJob launcher :returns: Total number of processes involved in distributed training :rtype: int

Return type:

int

gigl.common.utils.torch_training.is_distributed_available_and_initialized()[source]#
Returns:

True if distributed training is available and initialized, False otherwise

Return type:

bool

gigl.common.utils.torch_training.is_distributed_local_debug()[source]#

For local debugging purpose only This sets necessary environment variables for distributed training at local machine :returns: If True, then should_distribute early exit and enables distributed training :rtype: bool

Return type:

bool

gigl.common.utils.torch_training.should_distribute()[source]#

Determines whether the process should be configured for distributed training. :returns: True if the process is configured for distributed training :rtype: bool

Return type:

bool

gigl.common.utils.torch_training.logger[source]#