20140908 spark sql & catalyst
Upcoming SlideShare
Loading in...5
×
 

20140908 spark sql & catalyst

on

  • 237 views

A presentation material for Spark Meetup 2014/09/08.

A presentation material for Spark Meetup 2014/09/08.

Statistics

Views

Total Views
237
Views on SlideShare
211
Embed Views
26

Actions

Likes
3
Downloads
4
Comments
0

1 Embed 26

https://twitter.com 26

Accessibility

Categories

Upload Details

Uploaded via as Adobe PDF

Usage Rights

© All Rights Reserved

Report content

Flagged as inappropriate Flag as inappropriate
Flag as inappropriate

Select your reason for flagging this presentation as inappropriate.

Cancel
  • Full Name Full Name Comment goes here.
    Are you sure you want to
    Your message goes here
    Processing…
Post Comment
Edit your comment

20140908 spark sql & catalyst 20140908 spark sql & catalyst Presentation Transcript

  • Introduction to Spark SQL & Catalyst Takuya UESHIN ! Spark Meetup 2014/09/08(Mon)
  • Who am I? Takuya UESHIN @ueshin github.com/ueshin Nautilus Technologies, Inc. A Spark contributor 2
  • Agenda What is Spark SQL? Catalyst in depth SQL core in depth Interesting issues How to contribute 3
  • What is Spark SQL?
  • What is Spark SQL? Spark SQL is one of Spark components. Executes SQL on Spark Builds SchemaRDD like LINQ Optimizes execution plan. 5
  • What is Spark SQL? Catalyst provides a execution planning framework for relational operations. Including: SQL parser & analyzer Logical operators & general expressions Logical optimizer A framework to transform operator tree. 6
  • What is Spark SQL? To execute query needs some steps. Parse Analyze Logical Plan Optimize Physical Plan Execute 7
  • What is Spark SQL? To execute query needs some steps. Parse Analyze Logical Plan Optimize Physical Plan Execute 8 Catalyst
  • What is Spark SQL? To execute query needs some steps. Parse Analyze Logical Plan Optimize Physical Plan Execute 9 SQL core
  • Catalyst in depth
  • Catalyst in depth Provides a execution planning framework for relational operations. Row & DataType’s Trees & Rules Logical Operators Expressions Optimizations 11
  • Row & DataType’s o.a.s.sql.catalyst.types.DataType Long, Int, Short, Byte, Float, Double, Decimal String, Binary, Boolean, Timestamp Array, Map, Struct o.a.s.sql.catalyst.expressions.Row Represents a single row. Can contain complex types. 12
  • Trees & Rules o.a.s.sql.catalyst.trees.TreeNode Provides transformations of tree. foreach, map, flatMap, collect transform, transformUp, transformDown Used for operator tree, expression tree. 13
  • Trees & Rules o.a.s.sql.catalyst.rules.Rule Represents a tree transform rule. o.a.s.sql.catalyst.rules.RuleExecutor A framework to transform trees based on rules. 14
  • Logical Operators Basic Operators Project, Filter, … Binary Operators Join, Except, Intersect, Union, … Aggregate Generate, Distinct Sort, Limit InsertInto, WriteToFile 15 Project Filter Join Table Table
  • Expressions Literal Arithmetics UnaryMinus, Sqrt, MaxOf Add, Subtract, Multiply, … Predicates EqualTo, LessThan, LessThanOrEqual, GreaterThan, GreaterThanOrEqual Not, And, Or, In, If, CaseWhen 16 + 1 2
  • Expressions Cast GetItem, GetField Coalesce, IsNull, IsNotNull StringOperations Like, Upper, Lower, Contains, StartsWith, EndsWith, Substring, … 17
  • Optimizations ConstantFolding NullPropagation ConstantFolding BooleanSimplification SimplifyFilters FilterPushdown CombineFilters PushPredicateThroughProject PushPredicateThroughJoin ColumnPruning 18
  • Optimizations NullPropagation, ConstantFolding Replace expressions that can be evaluated with some literal value to the value. ex) 1 + null => null 1 + 2 => 3 Count(null) => 0 19
  • Optimizations BooleanSimplification Simplifies boolean expressions that can be determined. ex) false AND $right => false true AND $right => $right true OR $right => true false OR $right => $right If(true, $then, $else) => $then 20
  • Optimizations SimplifyFilters Removes filters that can be evaluated trivially. ex) Filter(true, child) => child Filter(false, child) => empty 21
  • Optimizations CombineFilters Merges two filters. ex) Filter($fc, Filter($nc, child)) => Filter(AND($fc, $nc), child) 22
  • Optimizations PushPredicateThroughProject Pushes Filter operators through Project operator. ex) Filter(‘i === 1, Project(‘i, ‘j, child)) => Project(‘i, ‘j, Filter(‘i === 1, child)) 23
  • Optimizations PushPredicateThroughJoin Pushes Filter operators through Join operator. ex) Filter(“left.i”.attr === 1, Join(left, right) => Join(Filter(‘i === 1, left), right) 24
  • Optimizations ColumnPruning Eliminates the reading of unused columns. ex) Join(left, right, LeftSemi, “left.id”.attr === “right.id”.attr) => Join(left, Project(‘id, right), LeftSemi) 25
  • SQL core in depth
  • SQL core in depth Provides: Physical operators to build RDD Conversion from Existing RDD of Product to SchemaRDD support Parquet file read/write support JSON file read support Columnar in-memory table support 27
  • SQL core in depth o.a.s.sql.SchemaRDD Extends RDD[Row]. Has logical plan tree. Provides LINQ-like interfaces to construct logical plan. select, where, join, orderBy, … Executes the plan. 28
  • SQL core in depth o.a.s.sql.execution.SparkStrategies Converts logical plan to physical. Some rules are based on statistics of the operators. 29
  • SQL core in depth Parquet read/write support Columnar storage format for Hadoop Reads existing Parquet files. Converts Parquet schema to row schema. Writes new Parquet files. Currently DecimalType and TimestampType are not supported. 30
  • SQL core in depth JSON read support Loads a JSON file (one object per line) Infers row schema from the entire dataset. Giving the schema is experimental. Inferring the schema by sampling is also experimental. 31
  • SQL core in depth Columnar in-memory table support Caches table like RDD.cache, but as columnar style. Can prune unnecessary columns when read data. 32
  • Interesting issues
  • Interesting issues Support the GroupingSet/ROLLUP/CUBE https://issues.apache.org/jira/browse/SPARK-2663 Use statistics to skip partitions when reading from in-memory columnar data https://issues.apache.org/jira/browse/SPARK-2961 34
  • Interesting issues Pluggable interface for shuffles https://issues.apache.org/jira/browse/SPARK-2044 Sort-merge join https://issues.apache.org/jira/browse/SPARK-2213 Cost-based join reordering https://issues.apache.org/jira/browse/SPARK-2216 35
  • How to contribute
  • How to contribute See: Contributing to Spark Open an issue on JIRA Send pull-request at GitHub Communicate with committers and reviewers Congratulations! 37
  • Conclusion Introduced Spark SQL & Catalyst Now you know them very well! And you know how to contribute. ! Let’s contribute to Spark & Spark SQL!! 38
  • an addition
  • What are we doing?
  • What are we doing? To execute query needs some steps. Parse Analyze Logical Plan Optimize Physical Plan Execute
  • What are we doing? To execute query needs some steps. Parse Analyze Logical Plan Optimize Physical Plan Execute
  • What are we doing? To execute query needs some steps. Parse Analyze Logical Plan Optimize Physical Plan Execute DSL ASG ++ ++ ++ business logic
  • Thanks!