gigl.src.data_preprocessor.lib.transform.utils#
Attributes#
Classes#
| A transform object used to modify one or more PCollections. | |
| A transform object used to modify one or more PCollections. | |
| A transform object used to modify one or more PCollections. | |
| Uses a feature spec to process a raw instance dict (read from some tabular data) as a TFExample. These | |
| A transform object used to modify one or more PCollections. | |
| A transform object used to modify one or more PCollections. | 
Functions#
| 
 | Generate a Beam pipeline to conduct transformation, given a source feature table in BQ and an output path in GCS. | 
Module Contents#
- class gigl.src.data_preprocessor.lib.transform.utils.AnalyzeAndBuildTFTransformFn(tensor_adapter_config, preprocessing_fn)[source]#
- Bases: - apache_beam.PTransform- A transform object used to modify one or more PCollections. - Subclasses must define an expand() method that will be used when the transform is applied to some arguments. Typical usage pattern will be: - input | CustomTransform(…) - The expand() method of the CustomTransform object passed in will be called with input as an argument. - Parameters:
- tensor_adapter_config (tfx_bsl.tfxio.tensor_adapter.TensorAdapterConfig) 
- preprocessing_fn (Callable[[gigl.src.data_preprocessor.lib.types.TFTensorDict], gigl.src.data_preprocessor.lib.types.TFTensorDict]) 
 
 
- class gigl.src.data_preprocessor.lib.transform.utils.GenerateAndVisualizeStats(facets_report_uri, stats_output_uri)[source]#
- Bases: - apache_beam.PTransform- A transform object used to modify one or more PCollections. - Subclasses must define an expand() method that will be used when the transform is applied to some arguments. Typical usage pattern will be: - input | CustomTransform(…) - The expand() method of the CustomTransform object passed in will be called with input as an argument. - Parameters:
- facets_report_uri (gigl.common.GcsUri) 
- stats_output_uri (gigl.common.GcsUri) 
 
 
- class gigl.src.data_preprocessor.lib.transform.utils.IngestRawFeatures(data_reference, feature_spec, schema, beam_record_tfxio)[source]#
- Bases: - apache_beam.PTransform- A transform object used to modify one or more PCollections. - Subclasses must define an expand() method that will be used when the transform is applied to some arguments. Typical usage pattern will be: - input | CustomTransform(…) - The expand() method of the CustomTransform object passed in will be called with input as an argument. - Parameters:
- data_reference (gigl.src.data_preprocessor.lib.ingest.reference.DataReference) 
- feature_spec (gigl.src.data_preprocessor.lib.types.FeatureSpecDict) 
- schema (tensorflow_metadata.proto.v0.schema_pb2.Schema) 
- beam_record_tfxio (tfx_bsl.tfxio.record_based_tfxio.RecordBasedTFXIO) 
 
 
- class gigl.src.data_preprocessor.lib.transform.utils.InstanceDictToTFExample(feature_spec, schema)[source]#
- Bases: - apache_beam.DoFn- Uses a feature spec to process a raw instance dict (read from some tabular data) as a TFExample. These instance dict inputs could allow us to read tabular input data from BQ, GSC or anything else. As long as we have a way of yielding instance dicts and parsing them with a feature spec, we should be able to transform this data into TFRecords during ingestion, which allows for more efficient operations in TFT. See https://www.tensorflow.org/tfx/transform/get_started#the_tfxio_format. - Parameters:
- feature_spec (gigl.src.data_preprocessor.lib.types.FeatureSpecDict) 
- schema (tensorflow_metadata.proto.v0.schema_pb2.Schema) 
 
 - process(element)[source]#
- Method to use for processing elements. - This is invoked by - DoFnRunnerfor each element of a input- PCollection.- The following parameters can be used as default values on - processarguments to indicate that a DoFn accepts the corresponding parameters. For example, a DoFn might accept the element and its timestamp with the following signature:- def process(element=DoFn.ElementParam, timestamp=DoFn.TimestampParam): ... - The full set of parameters is: - DoFn.ElementParam: element to be processed, should not be mutated.
- DoFn.SideInputParam: a side input that may be used when processing.
- DoFn.TimestampParam: timestamp of the input element.
- DoFn.WindowParam:- Windowthe input element belongs to.
- DoFn.TimerParam: a- userstate.RuntimeTimerobject defined by the spec of the parameter.
- DoFn.StateParam: a- userstate.RuntimeStateobject defined by the spec of the parameter.
- DoFn.KeyParam: key associated with the element.
- DoFn.RestrictionParam: an- iobase.RestrictionTrackerwill be provided here to allow treatment as a Splittable- DoFn. The restriction tracker will be derived from the restriction provider in the parameter.
- DoFn.WatermarkEstimatorParam: a function that can be used to track output watermark of Splittable- DoFnimplementations.
 - Parameters:
- element (gigl.src.data_preprocessor.lib.types.InstanceDict) – The element to be processed 
- *args – side inputs 
- **kwargs – other keyword arguments. 
 
- Returns:
- An Iterable of output elements or None. 
- Return type:
- Iterable[bytes] 
 
 
- class gigl.src.data_preprocessor.lib.transform.utils.ReadExistingTFTransformFn(tf_transform_directory)[source]#
- Bases: - apache_beam.PTransform- A transform object used to modify one or more PCollections. - Subclasses must define an expand() method that will be used when the transform is applied to some arguments. Typical usage pattern will be: - input | CustomTransform(…) - The expand() method of the CustomTransform object passed in will be called with input as an argument. - Parameters:
- tf_transform_directory (gigl.common.Uri) 
 
- class gigl.src.data_preprocessor.lib.transform.utils.WriteTFSchema(schema, target_uri, schema_descriptor)[source]#
- Bases: - apache_beam.PTransform- A transform object used to modify one or more PCollections. - Subclasses must define an expand() method that will be used when the transform is applied to some arguments. Typical usage pattern will be: - input | CustomTransform(…) - The expand() method of the CustomTransform object passed in will be called with input as an argument. - Parameters:
- schema (tensorflow_metadata.proto.v0.schema_pb2.Schema) 
- target_uri (gigl.common.GcsUri) 
- schema_descriptor (str) 
 
 
- gigl.src.data_preprocessor.lib.transform.utils.get_load_data_and_transform_pipeline_component(applied_task_identifier, data_reference, preprocessing_spec, transformed_features_info, num_shards, custom_worker_image_uri=None)[source]#
- Generate a Beam pipeline to conduct transformation, given a source feature table in BQ and an output path in GCS. - Parameters:
- applied_task_identifier (gigl.src.common.types.AppliedTaskIdentifier) 
- data_reference (gigl.src.data_preprocessor.lib.ingest.reference.DataReference) 
- preprocessing_spec (Union[gigl.src.data_preprocessor.lib.types.NodeDataPreprocessingSpec, gigl.src.data_preprocessor.lib.types.EdgeDataPreprocessingSpec]) 
- transformed_features_info (gigl.src.data_preprocessor.lib.transform.transformed_features_info.TransformedFeaturesInfo) 
- num_shards (int) 
- custom_worker_image_uri (Optional[str]) 
 
- Return type:
- apache_beam.Pipeline 
 
