gigl.common.utils.vertex_ai_context#
Utility functions to be used by machines running on Vertex AI.
Attributes#
Classes#
| Represents the cluster specification for a Vertex AI custom job. | |
| Information about the current task running on this node. | 
Functions#
| Used to connect the worker pool. This function should be called by all workers | |
| Parse the cluster specification from the CLUSTER_SPEC environment variable. | |
| Get the current machines hostname. | |
| Hostname of the machine that will host the process with rank 0. It is used | |
| A free port on the machine that will host the process with rank 0. | |
| 
 | Rank of the current VAI process, so they will know whether it is the master or a worker. | 
| Get the Vertex AI job ID. | |
| The total number of processes that VAI creates. Note that VAI only creates one process per machine. | |
| Check if the code is running in a Vertex AI job. | 
Module Contents#
- class gigl.common.utils.vertex_ai_context.ClusterSpec[source]#
- Represents the cluster specification for a Vertex AI custom job. See the docs for more info: https://cloud.google.com/vertex-ai/docs/training/distributed-training#cluster-variables 
- class gigl.common.utils.vertex_ai_context.TaskInfo[source]#
- Information about the current task running on this node. 
- gigl.common.utils.vertex_ai_context.connect_worker_pool()[source]#
- Used to connect the worker pool. This function should be called by all workers to get the leader worker’s internal IP address and to ensure that the workers can all communicate with the leader worker. - Return type:
 
- gigl.common.utils.vertex_ai_context.get_cluster_spec()[source]#
- Parse the cluster specification from the CLUSTER_SPEC environment variable. Based on the spec given at: https://cloud.google.com/vertex-ai/docs/training/distributed-training#cluster-variables - Returns:
- Parsed cluster specification data. 
- Return type:
- Raises:
- ValueError – If not running in a Vertex AI job or CLUSTER_SPEC is not found. 
- json.JSONDecodeError – If CLUSTER_SPEC contains invalid JSON. 
 
 
- gigl.common.utils.vertex_ai_context.get_host_name()[source]#
- Get the current machines hostname. Throws if not on Vertex AI. - Return type:
- str 
 
- gigl.common.utils.vertex_ai_context.get_leader_hostname()[source]#
- Hostname of the machine that will host the process with rank 0. It is used to synchronize the workers. - VAI does not automatically set this for single-replica jobs, hence the default value of “localhost”. Throws if not on Vertex AI. - Return type:
- str 
 
- gigl.common.utils.vertex_ai_context.get_leader_port()[source]#
- A free port on the machine that will host the process with rank 0. - VAI does not automatically set this for single-replica jobs, hence the default value of 29500. This is a PyTorch convention: pytorch/pytorch Throws if not on Vertex AI. - Return type:
- int 
 
- gigl.common.utils.vertex_ai_context.get_rank()[source]#
- Rank of the current VAI process, so they will know whether it is the master or a worker. Note: that VAI only creates one process per machine. It is the user’s responsibility to create multiple processes per machine. Meaning, this function will only return one integer for the main process that VAI creates. - VAI does not automatically set this for single-replica jobs, hence the default value of 0. Throws if not on Vertex AI. - Return type:
- int 
 
- gigl.common.utils.vertex_ai_context.get_vertex_ai_job_id()[source]#
- Get the Vertex AI job ID. Throws if not on Vertex AI. - Return type:
- str 
 
- gigl.common.utils.vertex_ai_context.get_world_size()[source]#
- The total number of processes that VAI creates. Note that VAI only creates one process per machine. It is the user’s responsibility to create multiple processes per machine. - VAI does not automatically set this for single-replica jobs, hence the default value of 1. Throws if not on Vertex AI. - Return type:
- int 
 
