gigl.common.data.dataloaders#
Attributes#
Classes#
Stores information pertaining to how a single entity (node, edge, positive label, negative label) and single node/edge type in the heterogeneous case is serialized on disk. |
|
Options for tuning a tf.data.Dataset. |
|
Module Contents#
- class gigl.common.data.dataloaders.SerializedTFRecordInfo[source]#
Stores information pertaining to how a single entity (node, edge, positive label, negative label) and single node/edge type in the heterogeneous case is serialized on disk. This field is used as input to the TFRecordDataLoader.load_as_torch_tensor() function for loading torch tensors.
- property is_node_entity: bool[source]#
Returns whether this serialized entity contains node or edge information by checking the type of entity_key
- Return type:
bool
- tfrecord_uri_prefix: gigl.common.Uri[source]#
- class gigl.common.data.dataloaders.TFDatasetOptions[source]#
Options for tuning a tf.data.Dataset.
Choosing between interleave or not is not straightforward. We’ve found that interleave is faster for large numbers (>100) of small (<20M) files. Though this is highly variable, you should do your own benchmarks to find the best settings for your use case.
Deterministic processing is much (100%!) slower for larger (>10M entities) datasets, but has very little impact on smaller datasets.
- Parameters:
batch_size (int) – How large each batch should be while processing the data.
file_buffer_size (int) – The size of the buffer to use when reading files.
deterministic (bool) – Whether to use deterministic processing, if False then the order of elements can be non-deterministic.
use_interleave (bool) – Whether to use tf.data.Dataset.interleave to read files in parallel, if not set then num_parallel_file_reads will be used.
num_parallel_file_reads (int) – The number of files to read in parallel if use_interleave is False.
ram_budget_multiplier (float) – The multiplier of the total system memory to set as the tf.data RAM budget..
- class gigl.common.data.dataloaders.TFRecordDataLoader(rank, world_size)[source]#
- Parameters:
rank (int)
world_size (int)
- load_as_torch_tensors(serialized_tf_record_info, tf_dataset_options=TFDatasetOptions())[source]#
Loads torch tensors from a set of TFRecord files.
- Parameters:
serialized_tf_record_info (SerializedTFRecordInfo) – Information for how TFRecord files are serialized on disk.
tf_dataset_options (TFDatasetOptions) – The options to use when building the dataset.
- Returns:
The (id_tensor, feature_tensor) for the loaded entities.
- Return type:
Tuple[torch.Tensor, Optional[torch.Tensor]]