gigl.common.data.dataloaders#
Attributes#
Classes#
| Stores information pertaining to how a single entity (node, edge, positive label, negative label) and single node/edge type in the heterogeneous case is serialized on disk. | |
| Options for tuning the loading of a tf.data.Dataset. Note that this dataclass is tied to TFRecord loading specifically for the load_as_torch_tensors function. | |
Module Contents#
- class gigl.common.data.dataloaders.SerializedTFRecordInfo[source]#
- Stores information pertaining to how a single entity (node, edge, positive label, negative label) and single node/edge type in the heterogeneous case is serialized on disk. This field is used as input to the TFRecordDataLoader.load_as_torch_tensor() function for loading torch tensors. - property is_node_entity: bool[source]#
- Returns whether this serialized entity contains node or edge information by checking the type of entity_key - Return type:
- bool 
 
 - tfrecord_uri_prefix: gigl.common.Uri[source]#
 
- class gigl.common.data.dataloaders.TFDatasetOptions[source]#
- Options for tuning the loading of a tf.data.Dataset. Note that this dataclass is tied to TFRecord loading specifically for the load_as_torch_tensors function. - Choosing between interleave or not is not straightforward. We’ve found that interleave is faster for large numbers (>100) of small (<20M) files. Though this is highly variable, you should do your own benchmarks to find the best settings for your use case. - Deterministic processing is much (100%!) slower for larger (>10M entities) datasets, but has very little impact on smaller datasets. - Parameters:
- batch_size (int) – How large each batch should be while processing the data. 
- file_buffer_size (int) – The size of the buffer to use when reading files. 
- deterministic (bool) – Whether to use deterministic processing, if False then the order of elements can be non-deterministic. 
- use_interleave (bool) – Whether to use tf.data.Dataset.interleave to read files in parallel, if not set then num_parallel_file_reads will be used. 
- num_parallel_file_reads (int) – The number of files to read in parallel if use_interleave is False. 
- ram_budget_multiplier (float) – The multiplier of the total system memory to set as the tf.data RAM budget. 
- log_every_n_batch (int) – Frequency that we should log information while looping through the dataset 
 
 
- class gigl.common.data.dataloaders.TFRecordDataLoader(rank, world_size)[source]#
- Parameters:
- rank (int) 
- world_size (int) 
 
 - load_as_torch_tensors(serialized_tf_record_info, tf_dataset_options=TFDatasetOptions())[source]#
- Loads torch tensors from a set of TFRecord files. - Parameters:
- serialized_tf_record_info (SerializedTFRecordInfo) – Information for how TFRecord files are serialized on disk. 
- tf_dataset_options (TFDatasetOptions) – The options to use when building the dataset. 
 
- Returns:
- The (id_tensor, feature_tensor, label_tensor) for the loaded entities. 
- Return type:
 
 
