-
Characteristics
- Suitable for big data users who wish to manage cluster size
- Fast, easy, managed way to run Hadoop, Spark, Hive, and Pig on Google Cloud
- Apache Hadoop is an open source framework for Big Data
- Apache Hadoop is based on the MapReduce programming model.
- Reduce function builds a final result set based on all intermediate results
- Hadoop is commonly used to refer to Apache Hadoop and related projects such as Apache Spark, Apache Pig, and Apache Hive
- When using Dataproc, Spark and Spark SQL can be used for data mining
- MMLib, Apache Spark's machine learning libraries can be used to discover patterns through machine learning
-
Performance
- Dataproc can build a Hadoop cluster in 90 seconds or less on Compute Engine virtual machines
- Datapoc users have control over the number and type of virtual machines in the cluster
- Dataproc clusters can be scaled up or down while running, based on needs
-
Operations
- Hadoop cluster can be monitored using Operations capabilities
-
Billing
- Running Hadoop jobs in Dataproc enables users to only pay for hardware resources used during the life of the cluster
- Dataproc is billed in one second clock time increments, subject to a one minute minimum billing
- Dataproc cluster has to be deleted to stop billing
- Dataproc clusters can use spot compute engine instances for batch processing
- Dataproc users get a significant break in the cost of the instances with spot VMs
- To use spot virtual machines, it must be possible to restart jobs cleanly