(cache) Big Data の調査：Google の DataFlow は、MapReduce の正当な継承者になり得るのか？

Big Data の調査：Google の DataFlow は、MapReduce の正当な継承者になり得るのか？

Posted in Big Data, Google, Hadoop, MapReduce, On Monday by Agile Cat on July 28, 2014

Data Cloud/Big Data: Google Introduces DataFlow as Successor to MapReduce
http://wp.me/pwo1E-7HE
By Dick Weisinger – July 25, 2014
http://formtek.com/blog/data-cloudbig-data-google-introduces-dataflow-as-successor-to-mapreduce/

Do you feel left behind when it comes to technologies like Hadoop and MapReduce? The great thing about the rapid speed that technology is changing and obsolescing is that if you miss one trend it’s not long before it’s been superseded by something else. That lets you leapfrog directly into the newer technology without having wasted time and resources on the older technology. Although you’ve got to jump in sometime!

Hadoop や MapReduce といったテクノロジーの話になると、時代に取り残されていると感じるだろうか？そして、それらのテクノロジーにおける素晴らしいスピードは、それ自身を変化させ、また、旧式化させていく。したがって、何らかのトレンドを見逃したとしても、それほど時間を置くことなく、それらに取って代わるものを見出すことができる。つまり、古いテクノロジーに時間と資源を浪費することなく、新しいテクノロジーへ向けて、ダイレクトにジャンプすることが可能なのだ。どんなタイミングでジャンプするのかという、課題は残されるのだけどね！

Google announced in June that they’ve long ago dropped MapReduce technologies like Hadoop. And in fact they’re even going to open up their ‘better way’ of analyzing Big Data sets to the public. It’s part of the Google Cloud Platform. And the components of the new Google technology called DataFlow have cool names like Flume and MillWheel.

Google が 6月に発表したのは、ずっと以前に MapReduce（Hadoop の原型）テクノロジーを廃止していたことである。実際のところ、Big Data の分析を開かれたものにするために、Google としての Better Way に取り組もうとしているのだ。それは、Google Cloud Platform の一部も構成する。この、Google における新しいテクノロジー･コンポーネントは、Flume や MillWheel のようにクールな、DataFlow という名前を与えられている。

The limitation of MapReduce strategies are that they are run as batch jobs. To use MapReduce and standard Hadoop, all the data needs to already exist and to have been collected before the job begins.

MapReduce ストラテジーにおける制約は、バッチ･ジョブとして実行される点にある。MapReduce や標準的な Hadoop を使用するには、そのジョブの開始する前に、存在すべき全データが揃っていなくてはならない。

Greg DeMichillie, Director of Product Management, wrote that ”a decade ago, Google invented MapReduce to process massive data sets using distributed computing. Since then, more devices and information require more capable analytics pipelines—though they are difficult to create and maintain. Cloud Dataflow makes it easy for you to get actionable insights from your data while lowering operational costs without the hassles of deploying, maintaining or scaling infrastructure. You can use Cloud Dataflow for use cases like ETL, batch data processing and streaming analytics, and it will automatically optimize, deploy and manage the code and resources required.”

Director of Product Management である Greg DeMichillie は、「 Google は 10年前に発明した MapReduce は、分散コンピューティングを用いて、大規模なデータセットを処理するためのものである。それ以来、より高機能な分析パイプラインが、数多くのデバイスと情報のために必要とされてきたが、それらを開発／維持していくのは困難なことであった。Cloud Dataflow を用いれば、それらのデータから、実用的な洞察を容易に得られるようになる。その一方で、インフラストラクチャのディプロイ／メンテナンス／スケーリングに煩わされることもなく、運用コストを削減できる。この Cloud Dataflow は、ETL／バッチ･データ処理／ストリーミング分析のようなユースケースに対して、持ちることが可能になっている。そして、必要とされるコードとリソースを、自動的に最適化し、展開し、管理していく」と述べている。

Brian Goldfarb, Google Cloud Platform head of marketing, said that with Big Data that “the program models are different. The technologies are different. It requires developers to learn a lot and manage a lot to make it happen. It [Google DataFlow] is a fully managed service that lets you create data pipelines for ingesting, transforming and analyzing arbitrary amounts of data in both batch or streaming mode, using the same programming model.”

Google Cloud Platform の Head of Marketing である Brian Goldfarb は、Big Data との対比について、「プログラム·モデルが異なり、また、テクノロジーも異なる。それを実現するためには、デベロッパーが必要とするのは、より多くのことを学び、より多くのことを管理することである。Google DataFlow は、バッチとストリーミングのモードにおいて、同じプログラミング･モデルを用いて、大量のデータを洞察／変換／分析する、データ･パイプラインを作成するための完全なマネージド･サービスである」と発言している。

Urs Hölzle, senior vice president of technical infrastructure Google, said that ”Cloud Dataflow is the result of over a decade of experience in analytics. It will run faster and scale better than pretty much any other system out there.”

Google の Senior VP of Technical Infrastructure である Urs Hölzle は、「 Cloud Dataflow は、分析における、私たちの 10年以上にもおよぶ経験から生まれたものである。それは、他のシステムと比べて、高速で動作し、スケーリングにも優れている」と、述べている。

ーーーーー

Todd Hoff さんの、「Google Instant では、リアルタイム検索のために MapReduce を排除！」というポストによると、Google が MapReduce を止めたのは 2010年ということになります。それから、すでに、4年が経っているのですね。 Hoff さんは、「 Google の 3つの世代を振り返る – Batch, Warehouse, Instant 」という素晴らしい記事も書いています。どちらも、読み応え十分の記事ですが、よろしければど〜ぞ！