Scala at Treasure Data

1. T R E A S U R E D A T A Scala at Treasure Data Taro L. Saito - GitHub:@xerial Ph.D., Software Engineer at Treasure Data, Inc. Treasure Data Tech Talk @ Tokyo, June 13, 2017 1

2. Why Scala? • Scala is not an oﬃcial programming language of Treasure Data • I was the only engineer who can write Scala in TD • 3 years ago • Now all of my team members can write Scala • Fact: Java experts can quickly learn Scala https://www.treasuredata.com/company/careers/ 2

3. Challenge: Increased Presto Usage at Treasure Data (2017) Processing 15 Trillion Rows / Day   (= 173 Million Rows / sec.) 150,000~ Queries / Day 1,500~ Users • How do we improve the service by utilizing this massive amount of query logs? 3 Query Logs Store Analyze SQL Improve & Optimize

4. A Success Story: Using Scala in Genome Science 4

5. Scala Use Cases in TD • Analyzing Query Engine Logs • Data analytics workflows written in Scala • For finding effective optimization approaches • Prestobase • Management Base of Presto • Gateway to access Presto (Finagle + Presto) • Monitoring + Runtime Analysis • Spark Integration • Accessing to Treasure Data from Spark 5

6. Open-Source Scala Libraries Developed at TD • Libraries that make Scala programming fun • wvlet-log: handy logging library: https://github.com/wvlet/log • Airframe: Dependency Injection Library http://wvlet.org/airframe • Airframe Conﬁg: YAML-based conﬁguration library (a module in Airframe) • Heavy use of meta-programing via Scalamacros • sbt plugins • Data analytics • sbt-sql: https://github.com/xerial/sbt-sql • Deployment • sbt-pack: https://github.com/xerial/sbt-pack • sbt-sonatype: https://github.com/xerial/sbt-sonatype 6

7. What is Scalamacros? • Generates Scala code at compile-time • Meta-programming (Writing a program that writes programs) • Experimental State at Scala 2.10, 2.11, and 2.12 • Scalamacros will no longer be experimental • Productization within 2017 • https://github.com/scalamacros/scalamacros • Scala Macro author (@xeno-by), IntelliJ team, EPFL Ph.D student • Support Scala 2.12 (and maybe Scala 2.11) and 3.0 • Announced at Scala Meetup at Twitter HQ, San Francisco 7

8. What is Scala 3.x? • Scala 3.x • Replaces the compiler to Dotty for faster compilation and better integration with IDE • Dotty: Compilers Are Databases (Martin Odersky, Scala’s creator) • https://www.youtube.com/watch?v=WxyyJyB_Ssc • Because compiler needs to answer … • Q: What is the signature of   method A.f at a given point of time? • class A[T] { def f(x: T): T = … } • Compiler itself, IDE (e.g., IntelliJ), etc. • Need to know these temporal types (Denotation) 8

9. Open-Source Scala Libraries in TD 9

10. Logging Library: Hard to Use • Logging configuration is hard • slf4j, log4j, logback-classic, etc. • XML configuration, etc. • Need to have redundant getLogger calls embulk log configuration with logback-classic 10

11. Dependency Hell of slf4j • slf4j (simple logger for Java) • The de facto standard of Java logging library • scala-logging: slf4j wrapper for Scala • Switches log outputs • Using a binding library in classpath • slf4j-nop (no output) • slf4j-simple (console output) • slf4j-log4j (output to log4j) • Pitfall • Cannot have multiple binders • But must have 1 binder (!!!) • de facto = many bad users • e.g., hadoop • Doesn’t care the other people: Including slf4j-log4j in the direct dependency • Need to exclude slf4j-log4j bindings from all of hadoop-related projects 11

12. wvlet-log github.com/wvlet/log • Favors Simplicity • Use Scalamacros to simplify user codes • Only need to extend LogSupport trait • No getLogger call • Using standard java.util.logging • No other dependency required • Features • Show source code locations of logs • Log format is conﬁgurable in the code (No XML nor plugin!) • Changing log levels with ﬁles or JMX • log.properties • log-test.properties • Built-in log handlers • log-rotate handler, async handler • Works with Scala.js to show logs in Web browser console 12

13. wvlet-log: Logging code generation with Scalamacros • Generate low-overhead logging code • Quasiquote • q”… scala code “ • Just writing Scala code template in macros 13

14. Airframe: wvlet.org/airframe/ • Dependency Injection Library for Scala • Best practices of building objects in Scala • We needed Google Guice for Scala • But there is no good alternative • Guice, Dagger2, Scaldi, Macwire, etc. • http://wvlet.org/airframe/docs/comparison.html • Using Google Guice in Scala • PlayFramework • Weird syntax • Airframe uses Scalamacros to simplify DI in Scala 14 ???

15. Airframe • Three step DI in Scala • Bind • Design • Build • Built-in life cycle manager • Session start/shutdown • e.g., connection open/close • Session • Manage singletons and   binding rules 15

16. Clear Separation of Concerns • Traditional Service Building: • With Airframe: • Clear separation of concerns: • How to build objects (design) • How to use objects (bind) • Simplest DI patten for Scala 16 How to build dependencies Just use components! Need to remember argument orders

17. Airframe Internals (Advanced) • Code generation with Scalamacros • Passing a Session when building App and A • http://wvlet.org/airframe/docs/internals.html 17

18. Customizing Prestobase Filters with Airframe • Prestobase Proxy: Gateway to access Presto • Adding TD specific binding • Finagle filters -> Injecting TD Specific filters 18

19. VCR Record/Replay for Testing Presto • Launching Presto requires a lot of memory (e.g., 2GB or more) • Often crashes CI service containers (TravisCI, CircleCI, etc.) • Recording Presto responses (prestobase-vcr) • with sqlite-jdbc: https://github.com/xerial/sqlite-jdbc • DB ﬁle for each test suite • Enabled small-memory footprint testing • Can run many Presto tests in CI 19

20. Airframe Config • YAML is useful for configuring applications • Embedding YAML configurations inside docker images • Provide credentials in a separate manner • password, API keys, instance specific param, etc. • properties file, environment variables, etc. • YAML + overrides + object mapping • http://wvlet.org/airframe/docs/config.html 20

21. Airframe Internal: Surface • Surface: Object surface (shape) inspector library • https://github.com/wvlet/airframe/tree/master/surface • case class A(id:Int, name:String) • surface.of[A] • => Surface(“A”, Seq(Param(“id”, surface.of[Int]), Param(“name”, surface.of[String])) • Extract object type parameters with Scala Runtime Reflection • Scala generates this type information at compile type • Used as Type Identifiers of Airframe and Airframe Config • e.g., [A], [Seq[B]], [Map[Int, String]], [A @@ Tag], etc. • Generating serializer/deserializer of Scala classes • Surface => Serialize object parameters => Encoding in MessagePack.gz => Embulk 21

22. td-spark • Access TD from Spark • Binding components with Airframe • IO Manager, Presto Client, etc. • Passing Design through SparkContext • Integration • TD -> Spark Dataframe • TD Presto Query -> DataFrame 22

23. Data Analytics with Scala 23

24. New Directions Explored By Presto • Traditional Database Usage • Required Database Administrator (DBA) • DBA designs the schema and queries • DBA tunes query performance • After Presto • Schema is designed by data providers • 1st data (user’s customer data) • 3rd party data sources • Analysts or Marketers explore the data with Presto • Don’t know the schema in advance • Many Analytical SQL queries 24

25. Bridging Gaps Between SQL and Programming Language • Traditional Approach • OR-Mapper: app developer design objects and schema, then generate SQLs • New Approach: SQL First • Need to manage various SQL results inside programming language • But How? 25

26. An Instinct 26

27. sbt-sql: https://github.com/xerial/sbt-sql • Scala SBT plugin for generating model classes from SQL ﬁles • src/main/sql/presto/*.sql (Presto Queries) • Using SQL as a function • Read Presto SQL Results as Objects • Enabled managing SQL queries in GitHub • Type-safe data analysis 27

28. Scala at Production 28

29. Packaging • Do you need to install Scala? • No. Only JDK is required • sbt-pack • https://github.com/xerial/sbt-pack • Create Scala code packages for releasing • At ./target/pack folder • Folder structure: • bin/ - launch scripts • lib/ - Scala/Java libraries • Makes easier to create docker images • Also used for creating distributable packages of td-spark 29

30. Deploying to Maven Central • Necessary Steps • Upload artifacts -> Close -> Release -> Drop • Painful • Need to login to Nexus Web UI • Many manual steps • Bintray? • Uploading to Bintray -> Automatic sync to Maven Central 30

31. sbt-sonatype plugin • Enable one-command release to Maven Central • Using REST APIs of Sonatype NEXUS Repository Manager • Developed at 2015 New Year holiday • Jan 5: Test Nexus REST API • Jan 20: First release (Just 1 day eﬀort) • Released sbt-sonatype using sbt-sonatype • 2,000+ projects are using sbt-sonatype • Supporting sbt 0.13.x and 1.0.0 • And can be used for Java projects too • Nexus to Maven Central sync is now fast • Less than 10 minutes (June 2017) 31

32. Summary • TD is a heavy user of Scala • Analytics pipelines • Production services • Many libraries helping development • Airframe, wvlet-log • sbt plugins • For details about Presto analysis • Join Presto Meetup on Thursday! 32 Presto Meetup Tokyo: June 15, 2017 (Thu)

33. T R E A S U R E D A T A 33