gigl.src.data_preprocessor.lib.ingest.bigquery#
Attributes#
Classes#
Data reference for running enumeration on edge data in BigQuery. |
|
Data reference for running enumeration on node data in BigQuery. |
Module Contents#
- class gigl.src.data_preprocessor.lib.ingest.bigquery.BigqueryEdgeDataReference[source]#
Bases:
gigl.src.data_preprocessor.lib.ingest.reference.EdgeDataReference
Data reference for running enumeration on edge data in BigQuery. We provide the ability to perform sharded reads using the sharded_read_config field, where the input table is split into smaller shards and each shard is read separately. This is useful for large tables that would otherwise cause oversized status update payloads, leading to job failures. The sharded_read_config field is optional and if not provided, the table will not be sharded and will be read in one ReadFromBigQuery call. General guidance is to use 10-30 shards for large tables, but may need tuning depending on the table size. :param reference_uri: BigQuery table URI for the edge data. :type reference_uri: str :param edge_type: Edge type for the current reference :type edge_type: EdgeType :param edge_usage_type: Edge usage type for the current reference. Defaults to EdgeUsageType.MAIN. :type edge_usage_type: EdgeUsageType :param src_identifier: Identifier for the source node. This field is overridden by the src identifier
from the corresponding edge data preprocessing spec.
- Parameters:
dst_identifier (Optional[str]) – Identifier for the destination node. This field is overridden by the dst identifier from the corresponding edge data preprocessing spec.
sharded_read_config (Optional[BigQueryShardedReadConfig]) – Configuration for performing sharded reads for the edge table. If not provided, the table will not be sharded and will be read in one ReadFromBigQuery call.
- yield_instance_dict_ptransform(*args, **kwargs)[source]#
Returns a PTransform whose expand method returns a PCollection of InstanceDicts, which can be subsequently ingested and transformed via Tensorflow Transform.
TODO: extend to support multiple edge types being in the same table. :param args: :param kwargs: :return:
- sharded_read_config: gigl.common.beam.sharded_read.BigQueryShardedReadConfig | None = None[source]#
- class gigl.src.data_preprocessor.lib.ingest.bigquery.BigqueryNodeDataReference[source]#
Bases:
gigl.src.data_preprocessor.lib.ingest.reference.NodeDataReference
Data reference for running enumeration on node data in BigQuery. We provide the ability to perform sharded reads using the sharded_read_config field, where the input table is split into smaller shards and each shard is read separately. This is useful for large tables that would otherwise cause oversized status update payloads, leading to job failures. The sharded_read_config field is optional and if not provided, the table will not be sharded and will be read in one ReadFromBigQuery call. General guidance is to use 10-30 shards for large tables, but may need tuning depending on the table size.
- Parameters:
reference_uri (str) – BigQuery table URI for the node data.
node_type (NodeType) – Node type for the current reference
identifier (Optional[str]) – Identifier for the node. This field is overridden by the identifier from the corresponding node data preprocessing spec.
sharded_read_config (Optional[BigQueryShardedReadConfig]) – Configuration for performing sharded reads for the node table. If not provided, the table will not be sharded and will be read in one ReadFromBigQuery call.
- yield_instance_dict_ptransform(*args, **kwargs)[source]#
Returns a PTransform whose expand method returns a PCollection of InstanceDicts, which can be subsequently ingested and transformed via Tensorflow Transform.
TODO: extend to support multiple edge types being in the same table. :param args: :param kwargs: :return:
- sharded_read_config: gigl.common.beam.sharded_read.BigQueryShardedReadConfig | None = None[source]#