-
Dataflow
-
Is a managed service for executing a wide variety of data processing patterns
- Serverless, fully managed data processing
- Batch and stream processing with autoscale
- Open source programming using Apache Beam
-
Dataprep
-
Is an intelligent data service for visually exploring, cleaning and preparing structured and unstructured data for analysis reporting and machine learning
- Serverless, works at any scale
- Suggest ideal data transformation
- Focus on data analysis
- Integrated partner service operated by Trifacta
-
Dataproc
-
Fully managed cloud service for running Apache Spark and Apache Hadoop clusters
- Low cost (per-second, preemptible)
- Super fast to start, scale, and shut down
- Integrated with others Google Cloud services (BigQuery, Cloud Storage, Cloud Bigtable)
-
Dataproc or Dataflow
-
Can both be used for data processing
- If you have dependencies on specific tools or packages in the Apache Hadoop or Spark, use Dataproc
- If you prefer a hands-on or dev ops approach to operations, use Dataproc
- If you prefer a hands-off or serverless approach, use Dataflow
-
Data Catalog
- Automatically catalogs metadata from Google Cloud sources as BigQuery, Vertex AI, Pub/Sub, Spanner, Bigtable
- indexes table abd fileset metadata from Cloud Storage
-
Three main functions
- Searching for data entries for which you have access
- Tagging data entries with metadata
- Providing column-level security for BigQuery tables