gigl.common.utils.vertex_ai_context#

Utility functions to be used by machines running on Vertex AI.

Attributes#

Classes#

ClusterSpec

Represents the cluster specification for a Vertex AI custom job.

TaskInfo

Information about the current task running on this node.

Functions#

connect_worker_pool()

Used to connect the worker pool. This function should be called by all workers

get_cluster_spec()

Parse the cluster specification from the CLUSTER_SPEC environment variable.

get_host_name()

Get the current machines hostname.

get_leader_hostname()

Hostname of the machine that will host the process with rank 0. It is used

get_leader_port()

A free port on the machine that will host the process with rank 0.

get_rank()

Rank of the current VAI process, so they will know whether it is the master or a worker.

get_vertex_ai_job_id()

Get the Vertex AI job ID.

get_world_size()

The total number of processes that VAI creates. Note that VAI only creates one process per machine.

is_currently_running_in_vertex_ai_job()

Check if the code is running in a Vertex AI job.

Module Contents#

class gigl.common.utils.vertex_ai_context.ClusterSpec[source]#

Represents the cluster specification for a Vertex AI custom job. See the docs for more info: https://cloud.google.com/vertex-ai/docs/training/distributed-training#cluster-variables

classmethod from_json(json_str)[source]#

Instantiates ClusterSpec from a JSON string.

Parameters:

json_str (str)

Return type:

ClusterSpec

cluster: dict[str, list[str]][source]#
environment: str[source]#
job: google.cloud.aiplatform_v1.types.CustomJobSpec | None = None[source]#
task: TaskInfo[source]#
class gigl.common.utils.vertex_ai_context.TaskInfo[source]#

Information about the current task running on this node.

index: int[source]#
trial: str | None = None[source]#
type: str[source]#
gigl.common.utils.vertex_ai_context.connect_worker_pool()[source]#

Used to connect the worker pool. This function should be called by all workers to get the leader worker’s internal IP address and to ensure that the workers can all communicate with the leader worker.

Return type:

gigl.env.distributed.DistributedContext

gigl.common.utils.vertex_ai_context.get_cluster_spec()[source]#

Parse the cluster specification from the CLUSTER_SPEC environment variable. Based on the spec given at: https://cloud.google.com/vertex-ai/docs/training/distributed-training#cluster-variables

Returns:

Parsed cluster specification data.

Return type:

ClusterSpec

Raises:
  • ValueError – If not running in a Vertex AI job or CLUSTER_SPEC is not found.

  • json.JSONDecodeError – If CLUSTER_SPEC contains invalid JSON.

gigl.common.utils.vertex_ai_context.get_host_name()[source]#

Get the current machines hostname. Throws if not on Vertex AI.

Return type:

str

gigl.common.utils.vertex_ai_context.get_leader_hostname()[source]#

Hostname of the machine that will host the process with rank 0. It is used to synchronize the workers.

VAI does not automatically set this for single-replica jobs, hence the default value of “localhost”. Throws if not on Vertex AI.

Return type:

str

gigl.common.utils.vertex_ai_context.get_leader_port()[source]#

A free port on the machine that will host the process with rank 0.

VAI does not automatically set this for single-replica jobs, hence the default value of 29500. This is a PyTorch convention: pytorch/pytorch Throws if not on Vertex AI.

Return type:

int

gigl.common.utils.vertex_ai_context.get_rank()[source]#

Rank of the current VAI process, so they will know whether it is the master or a worker. Note: that VAI only creates one process per machine. It is the user’s responsibility to create multiple processes per machine. Meaning, this function will only return one integer for the main process that VAI creates.

VAI does not automatically set this for single-replica jobs, hence the default value of 0. Throws if not on Vertex AI.

Return type:

int

gigl.common.utils.vertex_ai_context.get_vertex_ai_job_id()[source]#

Get the Vertex AI job ID. Throws if not on Vertex AI.

Return type:

str

gigl.common.utils.vertex_ai_context.get_world_size()[source]#

The total number of processes that VAI creates. Note that VAI only creates one process per machine. It is the user’s responsibility to create multiple processes per machine.

VAI does not automatically set this for single-replica jobs, hence the default value of 1. Throws if not on Vertex AI.

Return type:

int

gigl.common.utils.vertex_ai_context.is_currently_running_in_vertex_ai_job()[source]#

Check if the code is running in a Vertex AI job.

Returns:

True if running in a Vertex AI job, False otherwise.

Return type:

bool

gigl.common.utils.vertex_ai_context.logger[source]#