gigl.common.data.dataloaders#

Attributes#

Classes#

SerializedTFRecordInfo

Stores information pertaining to how a single entity (node, edge, positive label, negative label) and single node/edge type in the heterogeneous case is serialized on disk.

TFDatasetOptions

Options for tuning a tf.data.Dataset.

TFRecordDataLoader

Module Contents#

class gigl.common.data.dataloaders.SerializedTFRecordInfo[source]#

Stores information pertaining to how a single entity (node, edge, positive label, negative label) and single node/edge type in the heterogeneous case is serialized on disk. This field is used as input to the TFRecordDataLoader.load_as_torch_tensor() function for loading torch tensors.

entity_key: str | Tuple[str, str][source]#
feature_dim: int[source]#
feature_keys: Sequence[str][source]#
feature_spec: gigl.src.data_preprocessor.lib.types.FeatureSpecDict[source]#
property is_node_entity: bool[source]#

Returns whether this serialized entity contains node or edge information by checking the type of entity_key

Return type:

bool

tfrecord_uri_pattern: str = '.*-of-.*\\.tfrecord(\\.gz)?$'[source]#
tfrecord_uri_prefix: gigl.common.Uri[source]#
class gigl.common.data.dataloaders.TFDatasetOptions[source]#

Options for tuning a tf.data.Dataset.

Choosing between interleave or not is not straightforward. We’ve found that interleave is faster for large numbers (>100) of small (<20M) files. Though this is highly variable, you should do your own benchmarks to find the best settings for your use case.

Deterministic processing is much (100%!) slower for larger (>10M entities) datasets, but has very little impact on smaller datasets.

Parameters:
  • batch_size (int) – How large each batch should be while processing the data.

  • file_buffer_size (int) – The size of the buffer to use when reading files.

  • deterministic (bool) – Whether to use deterministic processing, if False then the order of elements can be non-deterministic.

  • use_interleave (bool) – Whether to use tf.data.Dataset.interleave to read files in parallel, if not set then num_parallel_file_reads will be used.

  • num_parallel_file_reads (int) – The number of files to read in parallel if use_interleave is False.

  • ram_budget_multiplier (float) – The multiplier of the total system memory to set as the tf.data RAM budget..

batch_size: int = 10000[source]#
deterministic: bool = False[source]#
file_buffer_size: int = 104857600[source]#
num_parallel_file_reads: int = 64[source]#
ram_budget_multiplier: float = 0.5[source]#
use_interleave: bool = True[source]#
class gigl.common.data.dataloaders.TFRecordDataLoader(rank, world_size)[source]#
Parameters:
  • rank (int)

  • world_size (int)

load_as_torch_tensors(serialized_tf_record_info, tf_dataset_options=TFDatasetOptions())[source]#

Loads torch tensors from a set of TFRecord files.

Parameters:
  • serialized_tf_record_info (SerializedTFRecordInfo) – Information for how TFRecord files are serialized on disk.

  • tf_dataset_options (TFDatasetOptions) – The options to use when building the dataset.

Returns:

The (id_tensor, feature_tensor) for the loaded entities.

Return type:

Tuple[torch.Tensor, Optional[torch.Tensor]]

gigl.common.data.dataloaders.logger[source]#