gigl.src.data_preprocessor.lib.transform.utils#
Attributes#
Classes#
A transform object used to modify one or more PCollections. |
|
A transform object used to modify one or more PCollections. |
|
A transform object used to modify one or more PCollections. |
|
Uses a feature spec to process a raw instance dict (read from some tabular data) as a TFExample. These |
|
A transform object used to modify one or more PCollections. |
|
A transform object used to modify one or more PCollections. |
Functions#
|
Generate a Beam pipeline to conduct transformation, given a source feature table in BQ and an output path in GCS. |
Module Contents#
- class gigl.src.data_preprocessor.lib.transform.utils.AnalyzeAndBuildTFTransformFn(tensor_adapter_config, preprocessing_fn)[source]#
Bases:
apache_beam.PTransform
A transform object used to modify one or more PCollections.
Subclasses must define an expand() method that will be used when the transform is applied to some arguments. Typical usage pattern will be:
input | CustomTransform(…)
The expand() method of the CustomTransform object passed in will be called with input as an argument.
- Parameters:
tensor_adapter_config (tfx_bsl.tfxio.tensor_adapter.TensorAdapterConfig)
preprocessing_fn (Callable[[gigl.src.data_preprocessor.lib.types.TFTensorDict], gigl.src.data_preprocessor.lib.types.TFTensorDict])
- class gigl.src.data_preprocessor.lib.transform.utils.GenerateAndVisualizeStats(facets_report_uri, stats_output_uri)[source]#
Bases:
apache_beam.PTransform
A transform object used to modify one or more PCollections.
Subclasses must define an expand() method that will be used when the transform is applied to some arguments. Typical usage pattern will be:
input | CustomTransform(…)
The expand() method of the CustomTransform object passed in will be called with input as an argument.
- Parameters:
facets_report_uri (gigl.common.GcsUri)
stats_output_uri (gigl.common.GcsUri)
- class gigl.src.data_preprocessor.lib.transform.utils.IngestRawFeatures(data_reference, feature_spec, schema, beam_record_tfxio)[source]#
Bases:
apache_beam.PTransform
A transform object used to modify one or more PCollections.
Subclasses must define an expand() method that will be used when the transform is applied to some arguments. Typical usage pattern will be:
input | CustomTransform(…)
The expand() method of the CustomTransform object passed in will be called with input as an argument.
- Parameters:
data_reference (gigl.src.data_preprocessor.lib.ingest.reference.DataReference)
feature_spec (gigl.src.data_preprocessor.lib.types.FeatureSpecDict)
schema (tensorflow_metadata.proto.v0.schema_pb2.Schema)
beam_record_tfxio (tfx_bsl.tfxio.record_based_tfxio.RecordBasedTFXIO)
- class gigl.src.data_preprocessor.lib.transform.utils.InstanceDictToTFExample(feature_spec, schema)[source]#
Bases:
apache_beam.DoFn
Uses a feature spec to process a raw instance dict (read from some tabular data) as a TFExample. These instance dict inputs could allow us to read tabular input data from BQ, GSC or anything else. As long as we have a way of yielding instance dicts and parsing them with a feature spec, we should be able to transform this data into TFRecords during ingestion, which allows for more efficient operations in TFT. See https://www.tensorflow.org/tfx/transform/get_started#the_tfxio_format.
- Parameters:
feature_spec (gigl.src.data_preprocessor.lib.types.FeatureSpecDict)
schema (tensorflow_metadata.proto.v0.schema_pb2.Schema)
- process(element)[source]#
Method to use for processing elements.
This is invoked by
DoFnRunner
for each element of a inputPCollection
.The following parameters can be used as default values on
process
arguments to indicate that a DoFn accepts the corresponding parameters. For example, a DoFn might accept the element and its timestamp with the following signature:def process(element=DoFn.ElementParam, timestamp=DoFn.TimestampParam): ...
The full set of parameters is:
DoFn.ElementParam
: element to be processed, should not be mutated.DoFn.SideInputParam
: a side input that may be used when processing.DoFn.TimestampParam
: timestamp of the input element.DoFn.WindowParam
:Window
the input element belongs to.DoFn.TimerParam
: auserstate.RuntimeTimer
object defined by the spec of the parameter.DoFn.StateParam
: auserstate.RuntimeState
object defined by the spec of the parameter.DoFn.KeyParam
: key associated with the element.DoFn.RestrictionParam
: aniobase.RestrictionTracker
will be provided here to allow treatment as a SplittableDoFn
. The restriction tracker will be derived from the restriction provider in the parameter.DoFn.WatermarkEstimatorParam
: a function that can be used to track output watermark of SplittableDoFn
implementations.
- Parameters:
element (gigl.src.data_preprocessor.lib.types.InstanceDict) – The element to be processed
*args – side inputs
**kwargs – other keyword arguments.
- Returns:
An Iterable of output elements or None.
- Return type:
Iterable[bytes]
- class gigl.src.data_preprocessor.lib.transform.utils.ReadExistingTFTransformFn(tf_transform_directory)[source]#
Bases:
apache_beam.PTransform
A transform object used to modify one or more PCollections.
Subclasses must define an expand() method that will be used when the transform is applied to some arguments. Typical usage pattern will be:
input | CustomTransform(…)
The expand() method of the CustomTransform object passed in will be called with input as an argument.
- Parameters:
tf_transform_directory (gigl.common.Uri)
- class gigl.src.data_preprocessor.lib.transform.utils.WriteTFSchema(schema, target_uri, schema_descriptor)[source]#
Bases:
apache_beam.PTransform
A transform object used to modify one or more PCollections.
Subclasses must define an expand() method that will be used when the transform is applied to some arguments. Typical usage pattern will be:
input | CustomTransform(…)
The expand() method of the CustomTransform object passed in will be called with input as an argument.
- Parameters:
schema (tensorflow_metadata.proto.v0.schema_pb2.Schema)
target_uri (gigl.common.GcsUri)
schema_descriptor (str)
- gigl.src.data_preprocessor.lib.transform.utils.get_load_data_and_transform_pipeline_component(applied_task_identifier, data_reference, preprocessing_spec, transformed_features_info, num_shards, custom_worker_image_uri=None)[source]#
Generate a Beam pipeline to conduct transformation, given a source feature table in BQ and an output path in GCS.
- Parameters:
applied_task_identifier (gigl.src.common.types.AppliedTaskIdentifier)
data_reference (gigl.src.data_preprocessor.lib.ingest.reference.DataReference)
preprocessing_spec (Union[gigl.src.data_preprocessor.lib.types.NodeDataPreprocessingSpec, gigl.src.data_preprocessor.lib.types.EdgeDataPreprocessingSpec])
transformed_features_info (gigl.src.data_preprocessor.lib.transform.transformed_features_info.TransformedFeaturesInfo)
num_shards (int)
custom_worker_image_uri (Optional[str])
- Return type:
apache_beam.Pipeline