gigl.common.beam.better_tfrecordio#
Internal fork of WriteToTFRecord with improved TFRecord sink. Specifically we add functionality to cap the max bytes per shard - a feature supported by file based sinks but something not implemented for tensorflow sinks. Also has support for specifying deferred tft.tf_metadata.dataset_metadata.DatasetMetadata, so it can be used in pipelines where DatasetMetadata is derived on runtime.
Attributes#
Classes#
Transform for writing to TFRecord sinks. |
Module Contents#
- class gigl.common.beam.better_tfrecordio.BetterWriteToTFRecord(file_path_prefix, transformed_metadata=None, file_name_suffix='.tfrecord', compression_type=CompressionTypes.AUTO, max_bytes_per_shard=int(200000000.0), num_shards=0)[source]#
Bases:
apache_beam.transforms.PTransform
Transform for writing to TFRecord sinks.
Initialize BetterWriteToTFRecord transform.
We improve the default WriteToTFRecord implementation by first simplifying needed params, adding functionality to cap the max bytes per shard. And, adding support for both easily serializing generic protobuff messages and tf.train.Example messages with capacity to specify deferred (computed at runtime) tft.tf_metadata.dataset_metadata.DatasetMetadata.
- Parameters:
file_path_prefix (str) – The file path to write to. The files written will begin with this prefix, followed by a shard identifier, and end in a common extension, if given by file_name_suffix.
transformed_metadata (Optional[Union[dataset_metadata.DatasetMetadata, apache_beam.pvalue.AsSingleton]]) – Useful for encoding tf.train.Example, when reading a TFTransform fn (dataset_metadata.DatasetMetadata) or when building it for the first time (apache_beam.pvalue.AsSingleton[dataset_metadata.DatasetMetadata]). Defaults to None, meaning a generic protobuf message is assumed which will be encoded using SerializeToString().
file_name_suffix (str, optional) – Suffix for the files written. Defaults to “.tfrecord”.
compression_type (str, optional) – Used to handle compressed output files. Typical value is CompressionTypes.AUTO, in which case the file_path’s extension will be used to detect the compression.
max_bytes_per_shard (int, optional) – The data is sharded into separate files to promote faster/distributed writes. This parameter controls the max size of these shards. Defaults to int(2e8), or ~200 Mb.
num_shards (int, optional) – The number of files (shards) used for output. If not set, the service will decide on the optimal number of shards based off of max_bytes_per_shard. WARNING: Constraining the number of shards is likely to reduce the performance of a pipeline - only use if you know what you are doing.