gigl.src.data_preprocessor.lib.ingest.bigquery#

Attributes#

Classes#

BigqueryEdgeDataReference

Data reference for running enumeration on edge data in BigQuery.

BigqueryNodeDataReference

Data reference for running enumeration on node data in BigQuery.

Module Contents#

class gigl.src.data_preprocessor.lib.ingest.bigquery.BigqueryEdgeDataReference[source]#

Bases: gigl.src.data_preprocessor.lib.ingest.reference.EdgeDataReference

Data reference for running enumeration on edge data in BigQuery. We provide the ability to perform sharded reads using the sharded_read_config field, where the input table is split into smaller shards and each shard is read separately. This is useful for large tables that would otherwise cause oversized status update payloads, leading to job failures. The sharded_read_config field is optional and if not provided, the table will not be sharded and will be read in one ReadFromBigQuery call. General guidance is to use 10-30 shards for large tables, but may need tuning depending on the table size. :param reference_uri: BigQuery table URI for the edge data. :type reference_uri: str :param edge_type: Edge type for the current reference :type edge_type: EdgeType :param edge_usage_type: Edge usage type for the current reference. Defaults to EdgeUsageType.MAIN. :type edge_usage_type: EdgeUsageType :param src_identifier: Identifier for the source node. This field is overridden by the src identifier

from the corresponding edge data preprocessing spec.

Parameters:
  • dst_identifier (Optional[str]) – Identifier for the destination node. This field is overridden by the dst identifier from the corresponding edge data preprocessing spec.

  • sharded_read_config (Optional[BigQueryShardedReadConfig]) – Configuration for performing sharded reads for the edge table. If not provided, the table will not be sharded and will be read in one ReadFromBigQuery call.

yield_instance_dict_ptransform(*args, **kwargs)[source]#

Returns a PTransform whose expand method returns a PCollection of InstanceDicts, which can be subsequently ingested and transformed via Tensorflow Transform.

TODO: extend to support multiple edge types being in the same table. :param args: :param kwargs: :return:

Return type:

gigl.src.data_preprocessor.lib.types.InstanceDictPTransform

sharded_read_config: gigl.common.beam.sharded_read.BigQueryShardedReadConfig | None = None[source]#
class gigl.src.data_preprocessor.lib.ingest.bigquery.BigqueryNodeDataReference[source]#

Bases: gigl.src.data_preprocessor.lib.ingest.reference.NodeDataReference

Data reference for running enumeration on node data in BigQuery. We provide the ability to perform sharded reads using the sharded_read_config field, where the input table is split into smaller shards and each shard is read separately. This is useful for large tables that would otherwise cause oversized status update payloads, leading to job failures. The sharded_read_config field is optional and if not provided, the table will not be sharded and will be read in one ReadFromBigQuery call. General guidance is to use 10-30 shards for large tables, but may need tuning depending on the table size.

Parameters:
  • reference_uri (str) – BigQuery table URI for the node data.

  • node_type (NodeType) – Node type for the current reference

  • identifier (Optional[str]) – Identifier for the node. This field is overridden by the identifier from the corresponding node data preprocessing spec.

  • sharded_read_config (Optional[BigQueryShardedReadConfig]) – Configuration for performing sharded reads for the node table. If not provided, the table will not be sharded and will be read in one ReadFromBigQuery call.

yield_instance_dict_ptransform(*args, **kwargs)[source]#

Returns a PTransform whose expand method returns a PCollection of InstanceDicts, which can be subsequently ingested and transformed via Tensorflow Transform.

TODO: extend to support multiple edge types being in the same table. :param args: :param kwargs: :return:

Return type:

gigl.src.data_preprocessor.lib.types.InstanceDictPTransform

sharded_read_config: gigl.common.beam.sharded_read.BigQueryShardedReadConfig | None = None[source]#
gigl.src.data_preprocessor.lib.ingest.bigquery.logger[source]#