Cloud Dataproc

Characteristics
1. Suitable for big data users who wish to manage cluster size
2. Fast, easy, managed way to run Hadoop, Spark, Hive, and Pig on Google Cloud
3. Apache Hadoop is an open source framework for Big Data
4. Apache Hadoop is based on the MapReduce programming model.
5. Reduce function builds a final result set based on all intermediate results
6. Hadoop is commonly used to refer to Apache Hadoop and related projects such as Apache Spark, Apache Pig, and Apache Hive
7. When using Dataproc, Spark and Spark SQL can be used for data mining
8. MMLib, Apache Spark's machine learning libraries can be used to discover patterns through machine learning
Performance
1. Dataproc can build a Hadoop cluster in 90 seconds or less on Compute Engine virtual machines
2. Datapoc users have control over the number and type of virtual machines in the cluster
3. Dataproc clusters can be scaled up or down while running, based on needs
Operations
1. Hadoop cluster can be monitored using Operations capabilities
Billing
1. Running Hadoop jobs in Dataproc enables users to only pay for hardware resources used during the life of the cluster
2. Dataproc is billed in one second clock time increments, subject to a one minute minimum billing
3. Dataproc cluster has to be deleted to stop billing
4. Dataproc clusters can use spot compute engine instances for batch processing
5. Dataproc users get a significant break in the cost of the instances with spot VMs
6. To use spot virtual machines, it must be possible to restart jobs cleanly