gigl.common.services.dataproc#

Attributes#

Classes#

DataprocService

A service class that provides methods to interact with Google Cloud Dataproc.

Module Contents#

class gigl.common.services.dataproc.DataprocService(project_id, region)[source]#

A service class that provides methods to interact with Google Cloud Dataproc.

Parameters:
  • project_id (str) – The ID of the Google Cloud project.

  • region (str) – The region where the Dataproc cluster is located.

create_cluster(cluster_spec)[source]#

Creates a dataproc cluster

Parameters:

cluster_spec (dict) – A dictionary containing the cluster specification. For more details, refer to the documentation at: https://cloud.google.com/python/docs/reference/dataproc/latest/google.cloud.dataproc_v1.types.Cluster

Returns:

None

Return type:

None

delete_cluster(cluster_name)[source]#

Deletes a cluster with the given name.

Parameters:

cluster_name (str) – The name of the cluster to delete.

Returns:

None

Return type:

None

does_cluster_exist(cluster_name)[source]#

Checks if a cluster with the given name exists.

Parameters:

cluster_name (str) – The name of the cluster to check.

Returns:

True if the cluster exists, False otherwise.

Return type:

bool

get_running_job_ids_on_cluster(cluster_name)[source]#

Retrieves the running job IDs on the specified cluster.

Parameters:

cluster_name (str) – The name of the cluster.

Returns:

The running job IDs on the cluster.

Return type:

List[str]

get_submitted_job_ids(cluster_name)[source]#

Retrieves the job IDs of all active jobs submitted to a specific cluster.

Parameters:

cluster_name (str) – The name of the cluster.

Returns:

The job IDs of all active jobs submitted to the cluster.

Return type:

List[str]

submit_and_wait_scala_spark_job(cluster_name, max_job_duration, main_jar_file_uri, runtime_args=[], extra_jar_file_uris=[], properties={}, fail_if_job_already_running_on_cluster=True)[source]#

Submits a Scala Spark job to a Dataproc cluster and waits for its completion.

Parameters:
  • cluster_name (str) – The name of the Dataproc cluster.

  • max_job_duration (datetime.timedelta) – The maximum duration allowed for the job to run.

  • main_jar_file_uri (Uri) – The URI of the main jar file for the Spark job.

  • (Optional[List[str]] (extra_jar_file_uris) – Additional runtime arguments for the Spark job. Defaults to [].

  • (Optional[List[str]] – Additional jar file URIs for the Spark job. Defaults to [].

  • fail_if_job_already_running_on_cluster (Optional[bool]) – Whether to fail if there are already running jobs on the cluster. Defaults to True.

  • runtime_args (Optional[List[str]])

  • extra_jar_file_uris (Optional[List[str]])

  • properties (Optional[dict])

Returns:

None

Return type:

None

cluster_client[source]#
job_client[source]#
project_id[source]#
region[source]#
gigl.common.services.dataproc.logger[source]#