gigl.experimental.knowledge_graph_embedding.lib.data.edge_dataset#
Attributes#
Classes#
Enumeration of supported edge dataset output formats. |
|
Configuration parameters to filter BigQuery tables by split (train/val/test). |
|
Handles creation of edge dataset resources (BigQuery tables and GCS exports). |
|
Configuration parameters to build filtered datasets by split (train/val/test). |
|
Strategy for creating BigQuery-based iterable datasets with filtered datasets for each split. |
|
Factory for creating per-split edge datasets using appropriate strategies. |
|
Strategy for creating GCS-based edge datasets (JSONL or Parquet). |
|
Protocol for creating different types of iterable datasets with filtered datasets for each split. |
Functions#
|
Build edge datasets for training, validation, and testing. This function |
Module Contents#
- class gigl.experimental.knowledge_graph_embedding.lib.data.edge_dataset.EdgeDatasetFormat[source]#
Bases:
str
,enum.Enum
Enumeration of supported edge dataset output formats.
This enum defines the different formats in which edge datasets can be stored and accessed. Each format has different performance characteristics and use cases:
GCS_JSONL: Stores data as JSONL (JSON Lines) files in Google Cloud Storage. Good for debugging and human-readable data inspection.
GCS_PARQUET: Stores data as Parquet files in Google Cloud Storage. Optimized for analytical workloads with efficient compression and columnar storage.
BIGQUERY: Keeps data in BigQuery tables for direct querying. Best for large-scale datasets that benefit from BigQuery’s distributed processing.
Initialize self. See help(type(self)) for accurate signature.
- class gigl.experimental.knowledge_graph_embedding.lib.data.edge_dataset.PerSplitFilteredEdgeBigqueryMetadata[source]#
Configuration parameters to filter BigQuery tables by split (train/val/test).
- clause_per_split()[source]#
- Return type:
list[tuple[gigl.src.common.types.dataset_split.DatasetSplit, str]]
- class gigl.experimental.knowledge_graph_embedding.lib.data.edge_dataset.PerSplitFilteredEdgeDatasetBuilder(config)[source]#
Handles creation of edge dataset resources (BigQuery tables and GCS exports).
- Parameters:
config (PerSplitFilteredEdgeDatasetConfig)
- class gigl.experimental.knowledge_graph_embedding.lib.data.edge_dataset.PerSplitFilteredEdgeDatasetConfig[source]#
Configuration parameters to build filtered datasets by split (train/val/test).
- distributed_context: gigl.distributed.dist_context.DistributedContext[source]#
- enumerated_edge_metadata: list[gigl.src.data_preprocessor.lib.enumerate.utils.EnumeratorEdgeTypeMetadata][source]#
- split_config: PerSplitFilteredEdgeBigqueryMetadata[source]#
- split_dataset_format: EdgeDatasetFormat[source]#
- class gigl.experimental.knowledge_graph_embedding.lib.data.edge_dataset.PerSplitIterableDatasetBigqueryStrategy[source]#
Strategy for creating BigQuery-based iterable datasets with filtered datasets for each split.
- class gigl.experimental.knowledge_graph_embedding.lib.data.edge_dataset.PerSplitIterableDatasetFactory(config)[source]#
Factory for creating per-split edge datasets using appropriate strategies.
- Parameters:
config (PerSplitFilteredEdgeDatasetConfig)
- create_datasets()[source]#
Create and return the edge datasets for each data split.
- Return type:
dict[gigl.src.common.types.dataset_split.DatasetSplit, torch.utils.data.IterableDataset]
- strategy_map: dict[EdgeDatasetFormat, PerSplitIterableDatasetStrategy][source]#
- class gigl.experimental.knowledge_graph_embedding.lib.data.edge_dataset.PerSplitIterableDatasetGcsStrategy(format_type)[source]#
Strategy for creating GCS-based edge datasets (JSONL or Parquet).
- Parameters:
format_type (EdgeDatasetFormat)
- class gigl.experimental.knowledge_graph_embedding.lib.data.edge_dataset.PerSplitIterableDatasetStrategy[source]#
Bases:
Protocol
Protocol for creating different types of iterable datasets with filtered datasets for each split.
- gigl.experimental.knowledge_graph_embedding.lib.data.edge_dataset.build_edge_datasets(distributed_context, enumerated_edge_metadata, applied_task_identifier, output_bq_dataset, graph_metadata, split_columns=None, train_split_clause='rand_split >= 0 AND rand_split < 0.8', val_split_clause='rand_split >= 0.8 AND rand_split < 0.9', test_split_clause='rand_split >= 0.9 AND rand_split <= 1', format=EdgeDatasetFormat.GCS_PARQUET)[source]#
Build edge datasets for training, validation, and testing. This function reads edge data from BigQuery, filters it based on the provided split clauses, and writes the filtered data to either BigQuery or GCS in the specified format.
This function is designed to work in a distributed environment (e.g. at start of training), where multiple processes may be running in parallel. It ensures that the resources are created only once and that all processes wait for each other to finish before proceeding. It uses PyTorch’s distributed package to manage the distributed context. It also handles the initialization and destruction of the distributed process group if necessary.
- Parameters:
distributed_context (gigl.distributed.dist_context.DistributedContext) – The distributed context for the current process.
enumerated_edge_metadata (list[gigl.src.data_preprocessor.lib.enumerate.utils.EnumeratorEdgeTypeMetadata]) – Metadata for the edges to be processed.
applied_task_identifier (gigl.src.common.types.AppliedTaskIdentifier) – Identifier for the applied task.
output_bq_dataset (str) – BigQuery dataset to write the output to.
graph_metadata (gigl.src.common.types.pb_wrappers.graph_metadata.GraphMetadataPbWrapper) – Metadata for the graph.
split_columns (Optional[list[str]]) – List of columns to use for splitting the data.
train_split_clause (str) – SQL clause for training data split.
val_split_clause (str) – SQL clause for validation data split.
test_split_clause (str) – SQL clause for testing data split.
format (EdgeDatasetFormat) – Format of the output datasets (GCS or BigQuery).
project – GCP project ID.
- Return type:
dict[gigl.src.common.types.dataset_split.DatasetSplit, torch.utils.data.IterableDataset]