Deploying and Implementing Big Data Solutions

BigQuery
1. Overview
  1. Fully managed, serverless, data warehousing
  2. Big data exploration and processing
  3. Not ideal for operational database
  4. Analyze using familiar SQL queries
  5. Hierarchy is project > dataset > table
  6. IAM roles applied to project/dataset, not tables
2. IAM
  1. BigQuery Admin can view all jobs
  2. BigQuery User and Job User roles can only view their own jobs
  3. BigQuery DataViewer role can query data
3. Authorized Views
  1. Access controls cannot be directly signed to tables or views
  2. Authorized views are used to enable 3rd party dataset access
  3. Allows users to share query results with other users/groups
  4. Allows users to restrict access to underlying tables
  5. Allows users to restrict access to columns (fields)
  6. Must be created in a separate dataset
4. Partitioning
  1. Queries against a large table (or all entries in a column) equates to increased costs
  2. Partitioning divides a large table into smaller logical segments called partitions
  3. Improves query performance and reduces cost
  4. Can partition data at ingest time, when data arrives at BQ table
  5. Table can be partitioned by timestamp/date in a certain column
  6. Table sharding is an alternative to partitioning
  7. Completely separate tables divided by date
  8. Partitioning is recommended over sharding for performance
5. Expiration
  1. Set expiration date to automatically remove data older than x days
  2. When using partitioned tables, expiration is applied to individual partitions
  3. bq mk --time_partitioning_type=DAY --time_partitioning_expiration=259200 [DATASET] [TABLE]
  4. Tables not edited for 90 days auto-convert to long term storage pricing
  5. Same rate as Google Cloud Storage Nearline
  6. Each partition qualifies separately for long term storage pricing
6. External Data
  1. BigQuery can run queries on external Cloud Storage, Bigtable, Google Drive data
  2. Enables users to clean and load data from external source
  3. Eliminates the nead to load changing data into BigQuery
  4. May not be recommended for large quantities of data
  5. BigQuery has infinite size, so consider loading data directly into BigQuery
7. Exporting Data
  1. Data can be exported in CSV, JSON, Avro (Dataproc)
  2. Up to 1GB data to a single file > 1GB across multiple files
  3. Can only export to Cloud Storage
8. Operations
  1. View BQ jobs per person + details
  2. bq ls -j -a myproject or BigQuery > Job history
  3. Job history persists for 6 months
  4. Contact support to delete sooner
  5. Export and manage lifecycle in Cloud Storage
  6. Configure expiration settings to set time limit on retention
Dataproc
1. Hadoop ecosystem is a suite of popular big data products
2. Dataproc is managed Hadoop/Spark service on GCP
3. Spark is a Hadoop ecosystem machine learning product
4. Supports existing Hadoop/Spark workflows/jobs
5. Dataproc manages infrastructure to enable users to focus on Hadoop/Spark workflows
6. Enables on-premises Hadoop cluster to be migrated to the cloud
7. Typical use case is data processing and analytics
8. Use Dataproc when tied to Hadoop ecosystem or for more configuration control
Dataflow
1. Built on Apache Beam
2. Changes and transforms data from one format into another
3. Dataflow is suitable where data needs to be transformed before storing
4. Can process both streaming and batch data in the same pipeline
5. Streaming data is typically a continuous asynchronous stream
6. Usually small bits of data from many sources, e.g. sensor data
7. Batch is typically represents large amounts of stored data
8. Transferred in bulk from few sources, once in a while
9. Can consolidate multiple streaming and batch data sources
10. Use Dataflow over Dataproc for a serverless managed service
Pub/Sub
1. Asynchronous messaging - many to many
2. Decouples sender and receivers to provide great flexibility
3. Source publishes message, and other services subscribe to published messages
4. Ingest streaming data from anywhere without worrying about capacity
5. Typically paired with Dataflow for processing after ingest
6. Global, infinite capacity data ingestion
7. Use case is data ingestion
8. Similar to Apache Kafka