20140908 spark sql & catalyst

Introduction to Spark SQL & Catalyst Takuya UESHIN ! Spark Meetup 2014/09/08(Mon)

Who am I? Takuya UESHIN @ueshin github.com/ueshin Nautilus Technologies, Inc. A Spark contributor 2

Agenda What is Spark SQL? Catalyst in depth SQL core in depth Interesting issues How to contribute 3

What is Spark SQL?

What is Spark SQL? Spark SQL is one of Spark components. Executes SQL on Spark Builds SchemaRDD like LINQ Optimizes execution plan. 5

What is Spark SQL? Catalyst provides a execution planning framework for relational operations. Including: SQL parser & analyzer Logical operators & general expressions Logical optimizer A framework to transform operator tree. 6

What is Spark SQL? To execute query needs some steps. Parse Analyze Logical Plan Optimize Physical Plan Execute 7

What is Spark SQL? To execute query needs some steps. Parse Analyze Logical Plan Optimize Physical Plan Execute 8 Catalyst

What is Spark SQL? To execute query needs some steps. Parse Analyze Logical Plan Optimize Physical Plan Execute 9 SQL core

Catalyst in depth

Catalyst in depth Provides a execution planning framework for relational operations. Row & DataType’s Trees & Rules Logical Operators Expressions Optimizations 11

Row & DataType’s o.a.s.sql.catalyst.types.DataType Long, Int, Short, Byte, Float, Double, Decimal String, Binary, Boolean, Timestamp Array, Map, Struct o.a.s.sql.catalyst.expressions.Row Represents a single row. Can contain complex types. 12

Trees & Rules o.a.s.sql.catalyst.trees.TreeNode Provides transformations of tree. foreach, map, flatMap, collect transform, transformUp, transformDown Used for operator tree, expression tree. 13

Trees & Rules o.a.s.sql.catalyst.rules.Rule Represents a tree transform rule. o.a.s.sql.catalyst.rules.RuleExecutor A framework to transform trees based on rules. 14

Logical Operators Basic Operators Project, Filter, … Binary Operators Join, Except, Intersect, Union, … Aggregate Generate, Distinct Sort, Limit InsertInto, WriteToFile 15 Project Filter Join Table Table

Expressions Literal Arithmetics UnaryMinus, Sqrt, MaxOf Add, Subtract, Multiply, … Predicates EqualTo, LessThan, LessThanOrEqual, GreaterThan, GreaterThanOrEqual Not, And, Or, In, If, CaseWhen 16 + 1 2

Expressions Cast GetItem, GetField Coalesce, IsNull, IsNotNull StringOperations Like, Upper, Lower, Contains, StartsWith, EndsWith, Substring, … 17

Optimizations ConstantFolding NullPropagation ConstantFolding BooleanSimplification SimplifyFilters FilterPushdown CombineFilters PushPredicateThroughProject PushPredicateThroughJoin ColumnPruning 18

Optimizations NullPropagation, ConstantFolding Replace expressions that can be evaluated with some literal value to the value. ex) 1 + null => null 1 + 2 => 3 Count(null) => 0 19

Optimizations BooleanSimplification Simplifies boolean expressions that can be determined. ex) false AND $right => false true AND $right => $right true OR $right => true false OR $right => $right If(true, $then, $else) => $then 20

Optimizations SimplifyFilters Removes filters that can be evaluated trivially. ex) Filter(true, child) => child Filter(false, child) => empty 21

Optimizations CombineFilters Merges two filters. ex) Filter($fc, Filter($nc, child)) => Filter(AND($fc, $nc), child) 22

Optimizations PushPredicateThroughProject Pushes Filter operators through Project operator. ex) Filter(‘i === 1, Project(‘i, ‘j, child)) => Project(‘i, ‘j, Filter(‘i === 1, child)) 23

Optimizations PushPredicateThroughJoin Pushes Filter operators through Join operator. ex) Filter(“left.i”.attr === 1, Join(left, right) => Join(Filter(‘i === 1, left), right) 24

Optimizations ColumnPruning Eliminates the reading of unused columns. ex) Join(left, right, LeftSemi, “left.id”.attr === “right.id”.attr) => Join(left, Project(‘id, right), LeftSemi) 25

SQL core in depth

SQL core in depth Provides: Physical operators to build RDD Conversion from Existing RDD of Product to SchemaRDD support Parquet file read/write support JSON file read support Columnar in-memory table support 27

SQL core in depth o.a.s.sql.SchemaRDD Extends RDD[Row]. Has logical plan tree. Provides LINQ-like interfaces to construct logical plan. select, where, join, orderBy, … Executes the plan. 28

SQL core in depth o.a.s.sql.execution.SparkStrategies Converts logical plan to physical. Some rules are based on statistics of the operators. 29

SQL core in depth Parquet read/write support Columnar storage format for Hadoop Reads existing Parquet files. Converts Parquet schema to row schema. Writes new Parquet files. Currently DecimalType and TimestampType are not supported. 30

SQL core in depth JSON read support Loads a JSON file (one object per line) Infers row schema from the entire dataset. Giving the schema is experimental. Inferring the schema by sampling is also experimental. 31

SQL core in depth Columnar in-memory table support Caches table like RDD.cache, but as columnar style. Can prune unnecessary columns when read data. 32

Interesting issues

Interesting issues Support the GroupingSet/ROLLUP/CUBE https://issues.apache.org/jira/browse/SPARK-2663 Use statistics to skip partitions when reading from in-memory columnar data https://issues.apache.org/jira/browse/SPARK-2961 34

Interesting issues Pluggable interface for shuffles https://issues.apache.org/jira/browse/SPARK-2044 Sort-merge join https://issues.apache.org/jira/browse/SPARK-2213 Cost-based join reordering https://issues.apache.org/jira/browse/SPARK-2216 35

How to contribute

How to contribute See: Contributing to Spark Open an issue on JIRA Send pull-request at GitHub Communicate with committers and reviewers Congratulations! 37

Conclusion Introduced Spark SQL & Catalyst Now you know them very well! And you know how to contribute. ! Let’s contribute to Spark & Spark SQL!! 38

an addition

What are we doing?

What are we doing? To execute query needs some steps. Parse Analyze Logical Plan Optimize Physical Plan Execute

What are we doing? To execute query needs some steps. Parse Analyze Logical Plan Optimize Physical Plan Execute DSL ASG ++ ++ ++ business logic

Thanks!

20140908 spark sql & catalyst

by Takuya UESHIN , Working at Nautilus Technologies, Inc.

on Sep 08, 2014

Statistics

Views

Actions

1 Embed 26

Accessibility

Categories

Upload Details

Usage Rights

Report content

20140908 spark sql & catalyst Presentation Transcript