gigl.common.services.dataproc#
Attributes#
Classes#
| A service class that provides methods to interact with Google Cloud Dataproc. | 
Module Contents#
- class gigl.common.services.dataproc.DataprocService(project_id, region)[source]#
- A service class that provides methods to interact with Google Cloud Dataproc. - Parameters:
- project_id (str) – The ID of the Google Cloud project. 
- region (str) – The region where the Dataproc cluster is located. 
 
 - create_cluster(cluster_spec)[source]#
- Creates a dataproc cluster - Parameters:
- cluster_spec (dict) – A dictionary containing the cluster specification. For more details, refer to the documentation at: https://cloud.google.com/python/docs/reference/dataproc/latest/google.cloud.dataproc_v1.types.Cluster 
- Returns:
- None 
- Return type:
- None 
 
 - delete_cluster(cluster_name)[source]#
- Deletes a cluster with the given name. - Parameters:
- cluster_name (str) – The name of the cluster to delete. 
- Returns:
- None 
- Return type:
- None 
 
 - does_cluster_exist(cluster_name)[source]#
- Checks if a cluster with the given name exists. - Parameters:
- cluster_name (str) – The name of the cluster to check. 
- Returns:
- True if the cluster exists, False otherwise. 
- Return type:
- bool 
 
 - get_running_job_ids_on_cluster(cluster_name)[source]#
- Retrieves the running job IDs on the specified cluster. - Parameters:
- cluster_name (str) – The name of the cluster. 
- Returns:
- The running job IDs on the cluster. 
- Return type:
- list[str] 
 
 - get_submitted_job_ids(cluster_name)[source]#
- Retrieves the job IDs of all active jobs submitted to a specific cluster. - Parameters:
- cluster_name (str) – The name of the cluster. 
- Returns:
- The job IDs of all active jobs submitted to the cluster. 
- Return type:
- list[str] 
 
 - submit_and_wait_scala_spark_job(cluster_name, max_job_duration, main_jar_file_uri, runtime_args=[], extra_jar_file_uris=[], properties={}, fail_if_job_already_running_on_cluster=True)[source]#
- Submits a Scala Spark job to a Dataproc cluster and waits for its completion. - Parameters:
- cluster_name (str) – The name of the Dataproc cluster. 
- max_job_duration (datetime.timedelta) – The maximum duration allowed for the job to run. 
- main_jar_file_uri (Uri) – The URI of the main jar file for the Spark job. 
- (Optional[list[str]] (extra_jar_file_uris) – Additional runtime arguments for the Spark job. Defaults to []. 
- (Optional[list[str]] – Additional jar file URIs for the Spark job. Defaults to []. 
- fail_if_job_already_running_on_cluster (Optional[bool]) – Whether to fail if there are already running jobs on the cluster. Defaults to True. 
- runtime_args (Optional[list[str]]) 
- extra_jar_file_uris (Optional[list[str]]) 
- properties (Optional[dict]) 
 
- Returns:
- None 
- Return type:
- None 
 
 
