Your SlideShare is downloading. ×
Upcoming SlideShare
Loading in...5
×

Thanks for flagging this SlideShare!

Oops! An error has occurred.

×

Saving this for later?

Get the SlideShare app to save on your phone or tablet. Read anywhere, anytime - even offline.

Text the download link to your phone

Standard text messaging rates apply

BigQuery JavaScript User-Defined Functions by THOMAS PARK and FELIPE HOFFA at Big Data Spain 2014

355
views

Published on

Performing ETL on big data can be slow, expensive and painful - but it doesn't have to be! In this session, we'll take an in-depth look at several real-world examples of computations that don't fit …

Performing ETL on big data can be slow, expensive and painful - but it doesn't have to be! In this session, we'll take an in-depth look at several real-world examples of computations that don't fit well with the SQL language model and how to solve them with user-defined functions in Google BigQuery.

Session presented at Big Data Spain 2014 Conference
18th Nov 2014
Kinépolis Madrid
http://www.bigdataspain.org
Event promoted by: http://www.paradigmatecnologico.com
Abstract: http://www.bigdataspain.org/2014/conference/hands-on-with-bigquery-javascript-user-defined-functions

Published in: Technology

0 Comments
2 Likes
Statistics
Notes
  • Be the first to comment

No Downloads
Views
Total Views
355
On Slideshare
0
From Embeds
0
Number of Embeds
2
Actions
Shares
0
Downloads
2
Comments
0
Likes
2
Embeds 0
No embeds

Report content
Flagged as inappropriate Flag as inappropriate
Flag as inappropriate

Select your reason for flagging this presentation as inappropriate.

Cancel
No notes for slide

Transcript

  • 1. HANDS ON WITH BIGQUERY JAVASCRIPT UDFS THOMAS PARK SOFTWARE ENGINEER - GOOGLE
  • 2. Hands-on with BigQuery JavaScript User-Defined Functions Thomas Park Software Engineer - Google Felipe Hoffa @felipehoffa Developer Advocate - Google
  • 3. Agenda Background Example: Cross-row intervals Under the hood Example: Codebreaking I. II. III. IV.
  • 4. Agenda Background Example: Cross-row intervals Under the hood Example: Codebreaking I. II. III. IV.
  • 5. What is BigQuery?
  • 6. BigQuery: Big Data Analytics in the Cloud Unrivaled Performance and Scale ● Scan multiple TB’s in seconds ● Interactive query performance ● No limits on amount of data Ease of Use and Adoption ● No administration / provisioning ● Convenience of SQL ● Open interfaces (REST, WebUI, ODBC) ● First 1 TB of data processed per month is free Advanced “Big Data” Storage ● Familiar database structure ● Easy data management and ACL’s ● Fast, atomic imports
  • 7. Google confidential │ Do not distribute How many pageviews does Wikipedia have in a month? SELECT COUNT(*)FROM [fh-bigquery:wikipedia.wikipedia_views_201308] https://bigquery.cloud.google.com/table/fh-bigquery:wikipedia.pagecounts_20140602_18
  • 8. Google confidential │ Do not distribute $500 in Cloud Platform credit to launch your idea! Build. Store. Analyze. On the same infrastructure that powers Google Start building Click ‘Apply Now’ and complete the application with promo code: bigdata-spain Starter Pack Offer Description 1 2 3 Go to cloud.google.com/developers/starterpack
  • 9. Agenda Background Example: Cross-row intervals Under the hood Example: Codebreaking I. II. III. IV.
  • 10. Images by Connie Zhou Scenario: Door access records from a Very Well-Secured Lab where users must badge in to enter or leave Image by Tod Kurt
  • 11. Images by Connie Zhou Example: Time-series analysis from discrete user action data Image by Tod Kurt
  • 12. user_id timestamp Beep!! 9h: arrive @ lab thomas 2014.07.15 09:00 Beep!! 10h: leave to pick up prototype thomas 2014.07.15 10:00 Beep!! 10h15: return with prototype thomas 2014.07.15 10:15 Beep!! 12h: out for lunch thomas 2014.07.15 12:00
  • 13. How can we find out how much time each user spent in the lab? ...where each scan of the user’s access card is represented as a discrete row?
  • 14. rownum user_id timestamp 1 thomas 2014.07.15 09:00 2 thomas 2014.07.15 10:00 3 thomas 2014.07.15 10:15 4 thomas 2014.07.15 12:00 60 minutes 105 minutes Our analysis with data in this format via SQL is horrid and painful A BigQuery + JS friendly format: data for each user in separate rows user_id timestamps thomas [ 09:00, 10:00, 10:15, 12:00, ... ] hoffa [ 08:10, 11:30, 12:00, 12:15, ... ] SELECT user_id, NEST(timestamp) AS timestamps FROM T GROUP BY user_id; Producing this format is trivial in BigQuery...
  • 15. // This function will be called once for each user, // and receive an array of timestamps. function(record, emit) { var total_time = 0; // Order of records built by NEST are not guaranteed! // Sort to guarantee ascending timestamps. var ts = record.ts.sort( function (a, b) {return a > b;}); // Loop over timestamp pairs, calculate interval. for (var i = 0; i < ts.length - 1; i += 2) { total_time += (ts[i+1] - ts[i]); } // Emit total time for this user. emit({user: record.user_id, total_time: total_time}); } JS: Total time for each user
  • 16. SELECT * FROM js( // Input table or query. [secret-lab:door_scans.201411] // Input columns. user_id, timestamps, // Output schema. "[{name: 'user_id', type:'string'}, {name: 'tot_time', type:'integer'}]", // The function. "function(r, emit) { ... emit(...); }")
  • 17. SELECT * FROM js( // Input table or query. [secret-lab:door_scans.201411] // Input columns. user_id, timestamps, // Output schema. "[{name: 'user_id', type:'string'}, {name: 'tot_time', type:'integer'}]", // The function. "function(r, emit) { ... emit(...); }") The JS function
  • 18. SELECT * FROM js( // Input table or query. [secret-lab:door_scans.201411] // Input columns. user_id, timestamps, // Output schema. "[{name: 'user_id', type:'string'}, {name: 'tot_time', type:'integer'}]", // The function. "function(r, emit) { ... emit(...); }") Input schema (column names only!)
  • 19. SELECT * FROM js( // Input table or query. [secret-lab:door_scans.201411] // Input columns. user_id, timestamps, // Output schema. "[{name: 'user_id', type:'string'}, {name: 'tot_time', type:'integer'}]", // The function. "function(r, emit) { ... emit(...); }") Output schema (full declaration)
  • 20. SELECT * FROM js( // Input table or query. [secret-lab:door_scans.201411] // Input columns. user_id, timestamps, // Output schema. "[{name: 'user_id', type:'string'}, {name: 'tot_time', type:'integer'}]", // The function. "function(r, emit) { ... emit(...); }") Input table or subquery
  • 21. SELECT * FROM js( // Input table or query. [secret-lab:door_scans.201411] // Input columns. user_id, timestamps, // Output schema. "[{name: 'user_id', type:'string'}, {name: 'tot_time', type:'integer'}]", // The function. "function(r, emit) { ... emit(...); }")
  • 22. Agenda Background Example: Cross-row intervals Under the hood Example: Codebreaking I. II. III. IV.
  • 23. How BigQuery works Get data from lower levels, filter / join / transform, send rows up Tree Structured Query Dispatch and Aggregation Distributed Storage SELECT title, requests Leaf Leaf Leaf Leaf SUM(requests) GROUP BY title WHERE REGEX_MATCH(title, 'pat.*rn') Mixer 1 Mixer 1 SUM(requests) GROUP BY title Mixer 0 LIMIT 10 ORDER BY c DESC SUM(requests) GROUP BY title
  • 24. Data for each row is calculated and streamed through a “Row Iterator” Subquery0 Subquery1 JOIN Row Iterator 0 Row Iterator 1 Row Iterator 2
  • 25. Can insert JavaScript Functions wherever we have a Row Iterator Subquery0 Subquery1 JOIN Row Iterator 0 Row Iterator 1 Row Iterator 2 UDF1 UDF0
  • 26. Join order item info with web hits info SELECT item FROM orders SELECT query string FROM hits JOIN Row Iterator 0 Row Iterator 1 Row Iterator 2 UDF1 UDF0
  • 27. http://www.store.com/?q=7%2e1+Speakers SELECT item FROM orders SELECT query string FROM hits JOIN Row Iterator 0 Row Iterator 1 Row Iterator 2 UDF1 UDF0
  • 28. http://www.store.com/?q=7%2e1+Speakers Extract and decode query term => “7.1 Speakers” SELECT item FROM orders SELECT query string FROM hits JOIN Row Iterator 0 Row Iterator 1 Row Iterator 2 UDF1 UDF0
  • 29. UDF execution Subquery0 Subquery1 JOIN UDF1 Process boundary UDF0UDF0 User Code
  • 30. Agenda Background Example: Cross-row intervals Under the hood Example: Codebreaking I. II. III. IV.
  • 31. Demos: Ñ
  • 32. Image: El Hormiguero (Flickr CC)
  • 33. http://jsfiddle.net/fhoffa/y4pt9s23/ Image: TheVanCats (Flickr CC)
  • 34. Questions? News: reddit.com/r/bigquery Ask: stackoverflow.com Share: bigqueri.es Thomas Park Felipe Hoffa @felipehoffa +FelipeHoffa Rate us? http://goo.gl/k3bzdw
  • 35. Backup slides / screenshots
  • 36. 17TH ~ 18th NOV 2014 MADRID (SPAIN)