-
BigQuery
-
Overview
- Fully managed, serverless, data warehousing
- Big data exploration and processing
- Not ideal for operational database
- Analyze using familiar SQL queries
- Hierarchy is project > dataset > table
- IAM roles applied to project/dataset, not tables
-
IAM
- BigQuery Admin can view all jobs
- BigQuery User and Job User roles can only view their own jobs
- BigQuery DataViewer role can query data
-
Authorized Views
- Access controls cannot be directly signed to tables or views
- Authorized views are used to enable 3rd party dataset access
- Allows users to share query results with other users/groups
- Allows users to restrict access to underlying tables
- Allows users to restrict access to columns (fields)
- Must be created in a separate dataset
-
Partitioning
- Queries against a large table (or all entries in a column) equates to increased costs
- Partitioning divides a large table into smaller logical segments called partitions
- Improves query performance and reduces cost
- Can partition data at ingest time, when data arrives at BQ table
- Table can be partitioned by timestamp/date in a certain column
- Table sharding is an alternative to partitioning
- Completely separate tables divided by date
- Partitioning is recommended over sharding for performance
-
Expiration
- Set expiration date to automatically remove data older than x days
- When using partitioned tables, expiration is applied to individual partitions
- bq mk --time_partitioning_type=DAY --time_partitioning_expiration=259200 [DATASET] [TABLE]
- Tables not edited for 90 days auto-convert to long term storage pricing
- Same rate as Google Cloud Storage Nearline
- Each partition qualifies separately for long term storage pricing
-
External Data
- BigQuery can run queries on external Cloud Storage, Bigtable, Google Drive data
- Enables users to clean and load data from external source
- Eliminates the nead to load changing data into BigQuery
- May not be recommended for large quantities of data
- BigQuery has infinite size, so consider loading data directly into BigQuery
-
Exporting Data
- Data can be exported in CSV, JSON, Avro (Dataproc)
- Up to 1GB data to a single file > 1GB across multiple files
- Can only export to Cloud Storage
-
Operations
- View BQ jobs per person + details
- bq ls -j -a myproject or BigQuery > Job history
- Job history persists for 6 months
- Contact support to delete sooner
- Export and manage lifecycle in Cloud Storage
- Configure expiration settings to set time limit on retention
-
Dataproc
- Hadoop ecosystem is a suite of popular big data products
- Dataproc is managed Hadoop/Spark service on GCP
- Spark is a Hadoop ecosystem machine learning product
- Supports existing Hadoop/Spark workflows/jobs
- Dataproc manages infrastructure to enable users to focus on Hadoop/Spark workflows
- Enables on-premises Hadoop cluster to be migrated to the cloud
- Typical use case is data processing and analytics
- Use Dataproc when tied to Hadoop ecosystem or for more configuration control
-
Dataflow
- Built on Apache Beam
- Changes and transforms data from one format into another
- Dataflow is suitable where data needs to be transformed before storing
- Can process both streaming and batch data in the same pipeline
- Streaming data is typically a continuous asynchronous stream
- Usually small bits of data from many sources, e.g. sensor data
- Batch is typically represents large amounts of stored data
- Transferred in bulk from few sources, once in a while
- Can consolidate multiple streaming and batch data sources
- Use Dataflow over Dataproc for a serverless managed service
-
Pub/Sub
- Asynchronous messaging - many to many
- Decouples sender and receivers to provide great flexibility
- Source publishes message, and other services subscribe to published messages
- Ingest streaming data from anywhere without worrying about capacity
- Typically paired with Dataflow for processing after ingest
- Global, infinite capacity data ingestion
- Use case is data ingestion
- Similar to Apache Kafka