BigQuery supports user-defined functions (UDFs) written in JavaScript. A UDF is similar to the "Map" function in a MapReduce: it takes a single row as input and produces zero or more rows as output. The output can potentially have a different schema than the input.
Contents
- UDF example
- UDF structure
- UDFs and the web UI
- UDFs and the BigQuery API
- Best practices
- Limits
- Limitations
UDF example
// UDF definition function urlDecode(row, emit) { emit({title: decodeHelper(row.title), requests: row.num_requests}); } // Helper function with error handling function decodeHelper(s) { try { return decodeURI(s); } catch (ex) { return s; } } // UDF registration bigquery.defineFunction( 'urlDecode', // Name used to call the function from SQL ['title', 'num_requests'], // Input column names // JSON representation of the output schema [{name: 'title', type: 'string'}, {name: 'requests', type: 'integer'}], urlDecode // The function reference );
UDF structure
function name(row, emit) { emit(<output data>); }
BigQuery UDFs operate on individual rows of a table or subselect query results. The UDF has two formal parameters:
row
: an input row.emit
: a hook used by BigQuery to collect output data. Theemit
function takes one parameter: a JavaScript object that represents a single row of output data. Theemit
function can be called more than once, such as in a loop, to output multiple rows of data.
The following code example shows a basic UDF.
function urlDecode(row, emit) { emit({title: decodeURI(row.title), requests: row.num_requests}); }
Registering the UDF
You must register a name for your function so that it can be invoked from BigQuery SQL. The registered name doesn't have to match the name you used for your function in JavaScript.
bigquery.defineFunction( '<UDF name>', // Name used to call the function from SQL ['<col1>', '<col2>'], // Input column names // JSON representation of the output schema [<output schema>], // UDF definition or reference <UDF definition or reference> );
The input column names must match the names (or aliases, if applicable) of the columns in the input table or subquery.
Output schema
You must provide BigQuery with the schema or structure of the records your UDF produces, represented as JSON. The schema can contain any supported BigQuery data types, including nested records.
The following code example shows the syntax for records in the output schema. Each output field requires a name
and type
attribute. Nested fields must also contain a fields
attribute.
[{name: 'foo_bar', type: 'record', fields: [{name: 'a', type: 'string'}, {name: 'b', type: 'integer'}, {name: 'c', type: 'boolean'}] }]
Rows passed to the emit()
function must match the data types of the output schema. Fields represented in the output schema that are omitted in the emit function will output as nulls.
UDF definition or reference
If you prefer, you can define the UDF inline in bigquery.defineFunction
. For example:
bigquery.defineFunction( 'urlDecode', // Name used to call the function from SQL ['title', 'num_requests'], // Input column names // JSON representation of the output schema [{name: 'title', type: 'string'}, {name: 'requests', type: 'integer'}], // The UDF function(row, emit) { title: decodeURI(row.title), requests: row.num_requests }); );
Otherwise, you can define the UDF separately, and pass a reference to the function in bigquery.defineFunction
. For example:
// The UDF function urlDecode(row, emit) { emit({title: decodeURI(row.title), requests: row.num_requests}); } // UDF registration bigquery.defineFunction( 'urlDecode', // Name used to call the function from SQL ['title', 'num_requests'], // Input column names // JSON representation of the output schema [{name: 'title', type: 'string'}, {name: 'requests', type: 'integer'}], urlDecode // The function reference );
Error handling
If an exception or error is thrown during the processing of a UDF, the entire query will fail. You can use a try-catch block to handle errors. For example:
// The UDF function urlDecode(row, emit) { emit({title: decodeHelper(row.title), requests: row.num_requests}); } // Helper function with error handling function decodeHelper(s) { try { return decodeURI(s); } catch (ex) { return s; } } // UDF registration bigquery.defineFunction( 'urlDecode', // Name used to call the function from SQL ['title', 'num_requests'], // Input column names // JSON representation of the output schema [{name: 'title', type: 'string'}, {name: 'requests', type: 'integer'}], urlDecode // The function reference );
UDFs and the web UI
You can use the BigQuery web UI to add UDFs and use them when running queries.
Prerequisites
To use the BigQuery web UI, your account needs to have access to a BigQuery-enabled project in the Google Developers Console.
-
If you haven't used the Developers Console before, go to the console, accept the terms of service, and create a new project.
Adding the UDF
Click the COMPOSE QUERY button.
-
Click the UDF Editor tab to add the UDF.
-
Copy and paste the following code into the UDF Editor text area:
// The UDF function urlDecode(row, emit) { emit({title: decodeHelper(row.title), requests: row.num_requests}); } // Helper function for error handling function decodeHelper(s) { try { return decodeURI(s); } catch (ex) { return s; } } // UDF registration bigquery.defineFunction( 'urlDecode', // Name used to call the function from SQL ['title', 'num_requests'], // Input column names // JSON representation of the output schema [{name: 'title', type: 'string'}, {name: 'requests', type: 'integer'}], urlDecode // The function reference );
Running a query in the web UI
-
Click the Query Editor tab.
-
Copy and paste the following query into the Query Editor text area.
SELECT requests, title FROM urlDecode( SELECT title, sum(requests) AS num_requests FROM [fh-bigquery:wikipedia.pagecounts_201504] WHERE language = 'fr' GROUP EACH BY title ) WHERE title LIKE '%ç%' ORDER BY requests DESC LIMIT 100
The above query looks for the most visited French Wikipedia articles from April 2015 that contain a cédille character (ç) in the title.
-
Click the RUN QUERY button. The query results display underneath the buttons.
Referencing code from Google Cloud Storage
In the above example, you added the UDF directly into the BigQuery web UI. Alternately, you can store all or part of your JavaScript code in Google Cloud Storage. Regardless of where the UDF code is located, each UDF must be registered using a bigquery.defineFunction
invocation in your code. The bigquery.definefunction
invocation can be provided in the web UI or in a remote code resources. Remote source files must have a ".js" extension.
For example, you could maintain third-party libraries, your own UDF code, and UDF registration function calls in separate files. These files would be loaded as separate external resource in your query.
Referencing an external UDF in the web UI
-
Click the Show Options button underneath the text area.
-
Click the Edit button next to the UDF Source URIs heading.
-
For each remote source file, click the Add UDF Source URI button and input the Google Cloud Storage URI.
If you'd like to try the previous URL-decoding example using an external UDF, paste
bigquery-sandbox-udf/url_decode.js
into the URI field. After completing these steps, ensure that the contents of the UDF Editor are removed. -
Click the OK button.
You can then switch to the Query Editor tab and follow the same steps as the above example to use the UDF in a query.
In general, when using external UDFs, you can also add additional JavaScript code in the UDF Editor text area, as long as the additional UDFs are registered in a defineFunction
block in the web UI or in an external file.
UDFs and the BigQuery API
configuration.query
Queries that use UDFs must contain userDefinedFunctionResources
elements that provide the code, or locations to code resources, to be used in the query. The supplied code must include registration function invocations for any UDFs referenced by the query.
Code resources
Your query configuration may include JavaScript code blobs, as well as references to JavaScript source files stored in Google Cloud Storage.
Inline JavaScript code blobs are populated in the inlineCode
section of a userDefinedFunctionResource
element. However, code that will be reused or referenced across multiple queries should be persisted in Google Cloud Storage and referenced as an external resource.
To reference a JavaScript source file from Google Cloud Storage, set the resourceURI
section of the userDefinedFunctionResource
element to the file's gs://
URI.
The query configuration may contain multiple userDefinedFunctionResource
elements. Each element may contain either an inlineCode
or a resourceUri
section.
Example
The following JSON example illustrates a query request that references two UDF resources: one blob of inline code, and one file lib.js
to be read from Google Cloud Storage. In this example, myFunc
and the registration invocation for myFunc
are provided by lib.js
.
{ "configuration": { "query": { "userDefinedFunctionResources": [ { "inlineCode": "var someCode = 'here';" }, { "resourceUri": "gs://some-bucket/js/lib.js" } ], "query": "select a from myFunc(T);" } } }
Best practices
Developing your UDF
You can use our UDF test tool to test and debug your UDF without running up your BigQuery bill.
Pre-filter your input
If your input can be easily filtered down before being passed to a UDF, your query will likely be faster and cheaper.
In the running a query example, a subquery is passed as the input to urlDecode
, instead of a full table. The [fh-bigquery:wikipedia.pagecounts_201504]
table has approximately 5.6 billion rows, and if we ran the UDF on the entire table, the JavaScript framework would need to process over 21 times more rows than it would with the filtered subquery.
Avoid persistant mutable state
Do not store or access mutable state across UDF calls. The following code example describes this scenario:
// myCode.js var numRows = 0; function dontDoThis(r, emit) { emit(rowCount: ++numRows); } // The query. SELECT max(rowCount) FROM dontDoThis(t);
The above example will not behave as expected, because BigQuery shards your query across many nodes. Each node has a standalone JavaScript processing environment that accumulates separate values for numRows
.
Use memory efficiently
The JavaScript processing environment has limited memory available per query. UDF queries that accumulate too much local state may fail due to memory exhaustion.
Expand select queries
You must explicitly list the columns being selected from a UDF. SELECT * FROM <UDF name>(...)
isn't supported.
Limits
- The amount of data that your UDF outputs when processing a single row should be approxiately 5 Mb or less.
- Each user is limited to running approximately 3 UDF queries in a specific project at the same time. If you receive an error that you're over the concurrent query limit, wait a few minutes and try again.
Limitations
- The DOM objects
Window
,Document
andNode
, and functions that require them, are unsupported. - JavaScript functions that rely on native code are unsupported.
- Bitwise operations in JavaScript handle only the most significant 32 bits.