2 Comments

Summary:

Google rolled out a slew of new cloud services at I/O, including one called Dataflow that’s meant to put standard MapReduce to shame. It’s advertised a much simpler way to build data pipelines that can handle both batch processing and streaming data.

Photo by Janko Roettgers/Gigaom
photo: Janko Roettgers/Gigaom

The cloud computing news coming out of Google I/O might not set the larger world afire, but it might light a fire under market leader Amazon Web Services. Google used its annual developer conference to unveil a slew of new cloud services Wednesday, including one called Dataflow that makes it easy to write data-processing pipelines that incorporate both batch and stream-processing capabilities.

Based on Google’s FlumeJava data-pipeline tool and its MillWheel stream-processing system, Dataflow is the company’s answer to Amazon’s Elastic MapReduce and Kinesis, all in one package. Although users can still run their own Hadoop clusters on Google Compute Engine, the company’s infrastructure-as-a-service cloud, Google Cloud platform marketing head Brian Goldfarb described Dataflow’s underlying technologies as having been created, essentially, to overcome the complexity and latency limitations inherent in MapReduce (both the Google version and Hadoop MapReduce).

(A collection of open source tools that cover these same capabilities would Apache Storm or Apache Spark Streaming for stream processing, and Cascading or Apache Crunch — which is based on FlumeJava — for writing data pipelines.)

“[MapReduce] was good for simple jobs, but when you needed to run pipelines it wasn’t so easy,” Goldfarb said. He added, “Internally, we don’t use it anymore because we don’t think it’s the right solution for the overwhelming number of situations.”

A screenshot of Dataflow from the Google I/O keynote.

A screenshot of Dataflow from the Google I/O keynote.

He said Dataflow is designed to handle very large datasets and complex workflows, and to be relatively simple. Batch and streaming jobs both use the same code, and Dataflow automatically optimizes the pipelines and manages the infrastructure. Dataflow itself is language-agnostic, he added, although the first SDK will be for Java — presumably because Hadoop is written in Java, so its users are already used to programming in that language.

Goldfarb cited real-time anomaly detection as a prime use case for Dataflow, which is the same type of use case highlighted in the MillWheel paper from 2013. A live demonstration at Google I/O involved analyzing streaming World Cup data against historical data in order to spot anomalies. The system could be set to automatically take actions when something is detected, although Goldfarb noted a user could also immediately begin investigating events in Google BigQuery with just a few lines of SQL code.

“I think this will become the centerpoint of a lot of the [data] work we’re going to do,” he added.

World Cup data sent to BigQuery from Dataflow.

World Cup data sent to BigQuery from Dataflow.

Google’s other cloud computing announcements were more about making sure applications launched in Compute Engine keep running smoothly. There’s Cloud Debugger to identify problems in production applications without affecting their performance; Cloud Trace to identify performance bottlenecks and trace them back to their cause; and Cloud Monitoring — which incorporates some of the intellectual property Google acquired from Stackdriver — to provide dashboards, monitoring and alerts.

The latter feature is also tuned to understand the intricacies of more than a dozen open source technologies, including MySQL, MongoDB, Elasticsearch and Apache Tomcat.

A screenshot of Cloud Monitoring from the Google I/O keynote.

A screenshot of Cloud Monitoring from the Google I/O keynote.

Google’s final new cloud feature, Cloud Save, targets Android developers specifically. It’s a synchronization service between users’ devices and Google Cloud Datastore — similar in theory to what the Couchbase Mobile database provides — that Google claims involves minimal backend coding. The classic applications for a service like this are being able sync work done offline, or even progress made in a mobile game, once a device is back online, or ensuring that data is consistent across devices or in the case that someone needs to re-download an app on a new device.

Google’s approach to cloud computing clearly seems to be positioning itself as the provider most determined to make developers’ lives easy. The things it enables with its growing set of features and services aren’t necessarily impossible to do on other cloud platforms, such as the still much-larger AWS, but they would often require stitching together a handful of services and writing some clever code. Google, on the other hand, is trying to automate as much of the process as possible, exposing some of the technologies it has built in-house in order to make this happen.

Urs Holzle at Structure 2014. (c) Jakub Mosur

Urs Holzle at Structure 2014. (c) Jakub Mosur

As Google SVP Technical Infrastructure and Google Fellow Urs Hölzle made clear at our Structure conference last week, the company believes its strategy, deep pockets and engineering prowess will make it a force to be reckoned with in the years to come. We might a good sense of how much its biggest competition shares that sentiment in November at the annual Amazon Web Services re:Invent conference. AWS has used it to announce several products over the past couple years, but it’s also used to having the spotlight largely to itself.

Now that Google’s cloud keeps on making waves, we’ll see if AWS cranks up its pace of innovation even more.

Google IO ticker

  1. Thanks Derrick and Gigaom for the fantastic coverage! Its great to see Google coming of age and progressing its cloud offering. Competition is a wonderful thing in the free markets.

    It is noteworthy that Dataflow is language agnostic, considering new programming language options have emerged in recent years especially from open source projects, apt for “big data” processing such as Clojure and Julia; as well as closed-source (so far) from major corporation’s such as Apple’s new Swift language.

    Reply Share