gigl.common.data.dataloaders#

Attributes#

logger

Classes#

`SerializedTFRecordInfo`	Stores information pertaining to how a single entity (node, edge, positive label, negative label) and single node/edge type in the heterogeneous case is serialized on disk.
`TFDatasetOptions`	Options for tuning a tf.data.Dataset.
`TFRecordDataLoader`

Module Contents#

class gigl.common.data.dataloaders.SerializedTFRecordInfo[source]#

Stores information pertaining to how a single entity (node, edge, positive label, negative label) and single node/edge type in the heterogeneous case is serialized on disk. This field is used as input to the TFRecordDataLoader.load_as_torch_tensor() function for loading torch tensors.

entity_key: str | Tuple[str, str][source]#

feature_dim: int[source]#

feature_keys: Sequence[str][source]#

feature_spec: gigl.src.data_preprocessor.lib.types.FeatureSpecDict[source]#

property is_node_entity: bool[source]#

Returns whether this serialized entity contains node or edge information by checking the type of entity_key

Return type:: bool

tfrecord_uri_pattern: str = '.*-of-.*\\.tfrecord(\\.gz)?$'[source]#

tfrecord_uri_prefix: gigl.common.Uri[source]#

class gigl.common.data.dataloaders.TFDatasetOptions[source]#

Options for tuning a tf.data.Dataset.

Choosing between interleave or not is not straightforward. We’ve found that interleave is faster for large numbers (>100) of small (<20M) files. Though this is highly variable, you should do your own benchmarks to find the best settings for your use case.

Deterministic processing is much (100%!) slower for larger (>10M entities) datasets, but has very little impact on smaller datasets.

Parameters:

batch_size (int) – How large each batch should be while processing the data.
file_buffer_size (int) – The size of the buffer to use when reading files.
deterministic (bool) – Whether to use deterministic processing, if False then the order of elements can be non-deterministic.
use_interleave (bool) – Whether to use tf.data.Dataset.interleave to read files in parallel, if not set then num_parallel_file_reads will be used.
num_parallel_file_reads (int) – The number of files to read in parallel if use_interleave is False.
ram_budget_multiplier (float) – The multiplier of the total system memory to set as the tf.data RAM budget..

batch_size: int = 10000[source]#

deterministic: bool = False[source]#

file_buffer_size: int = 104857600[source]#

num_parallel_file_reads: int = 64[source]#

ram_budget_multiplier: float = 0.5[source]#

use_interleave: bool = True[source]#

class gigl.common.data.dataloaders.TFRecordDataLoader(rank, world_size)[source]#

Parameters:

rank (int)
world_size (int)

load_as_torch_tensors(serialized_tf_record_info, tf_dataset_options=TFDatasetOptions())[source]#

Loads torch tensors from a set of TFRecord files.

Parameters:

serialized_tf_record_info (SerializedTFRecordInfo) – Information for how TFRecord files are serialized on disk.
tf_dataset_options (TFDatasetOptions) – The options to use when building the dataset.

Returns:

The (id_tensor, feature_tensor) for the loaded entities.

Return type:

Tuple[torch.Tensor, Optional[torch.Tensor]]

gigl.common.data.dataloaders.logger[source]#