S3 Select and Glacier Select – Retrieving Subsets of Objects

Amazon Simple Storage Service (S3) stores data for millions of applications used by market leaders in every industry. Many of these customers also use Amazon Glacier for secure, durable, and extremely low-cost archival storage. With S3, I can store as many objects as I want and individual objects can be as large as 5 terabytes. Data in object storage have traditionally been accessed as a whole entities, meaning when you ask for a 5 gigabyte object you get all 5 gigabytes. It’s the nature of object storage. Today we’re challenging that paradigm by announcing two new capabilities for S3 and Glacier that allow you to use simple SQL expressions to pull out only the bytes you need from those objects. This fundamentally enhances virtually every application that accesses objects in S3 or Glacier.

S3 Select

S3 Select, launching in preview, enables applications to retrieve only a subset of data from an object by using simple SQL expressions. By using S3 Select to retrieve only the data needed by your application, you can achieve drastic performance increases – in many cases you can get as much as a 400% improvement.

As an example let’s imagine you’re a developer at a large retailer and you need to analyze the weekly sales data from a single store, but the data for all 200 stores is saved in a new GZIP-ed CSV every day. Without S3 Select, you would need to download, decompress and process the entire CSV to get the data you needed. With S3 Select, you can use a simple SQL expression to return only the data from the store you’re interested in, instead of retrieving the entire object. This means you’re dealing with an order of magnitude less data which improves the performance of your underlying applications.

Let’s look at a quick Python example.

Python

import boto3
from s3select import ResponseHandler

class PrintingResponseHandler(ResponseHandler):
    def handle_records(self, record_data):
        print(record_data.decode('utf-8'))

handler = PrintingResponseHandler()
s3 = boto3.client('s3')
response = s3.select_object_content(
    Bucket="super-secret-reinvent-stuff",
    Key="stuff.csv",
    SelectRequest={
        'ExpressionType': 'SQL',
     'Expression': 'SELECT s._1 FROM S3Object AS s'',
        'InputSerialization': {
            'CompressionType': 'NONE',
            'CSV': {
                'FileHeaderInfo': 'IGNORE',
                'RecordDelimiter': '\n',
                'FieldDelimiter': ',',
            }
        },
        'OutputSerialization': {
            'CSV': {
                'RecordDelimiter': '\n',
                'FieldDelimiter': ',',
            }
        }
    }
)
handler.handle_response(response['Body'])

Pretty cool! To enable this behavior S3 Select uses a binary wire-protocol to return the objects. For now, that requires the use of a small additional library to help with deserialization.

We expect customers to use S3 Select to accelerate all sorts of applications. For example, this partial data retrieval ability is especially useful for serverless applications built with AWS Lambda. When we modified the Serverless MapReduce reference architecture to retrieve only the data needed using S3 Select we saw a 2X improvement in performance and an 80% reduction in cost.

The S3 Select team also created a Presto connector which can provide an immediate performance boost for Amazon EMR with no changes to your queries. We put this connector to test by running a complex query that filtered almost 99% of the data retrieved from S3. With S3 Select disabled, Presto had to scan and filter entire objects from S3 but with S3 Select enabled, Presto leveraged S3 Select to retrieve only the data needed for the query.

Bash

[hadoop@ip-172-31-19-123 ~]$ time presto-cli --catalog hive --schema default --session hive.s3_optimized_select_enabled=false -f query.sql
"31.965496","127178","5976","70.89902","130147","6996","37.17715","138092","8678","135.49536","103926","11446","82.35177","116816","8484","67.308304","135811","10104"
 
real  0m35.910s
user  0m2.320s
sys   0m0.124s
[hadoop@ip-172-31-19-123 ~]$ time presto-cli --catalog hive --schema default --session hive.s3_optimized_select_enabled=true -f query.sql
"31.965496","127178","5976","70.89902","130147","6996","37.17715","138092","8678","135.49536","103926","11446","82.35177","116816","8484","67.308304","135811","10104"
 
real  0m6.566s
user  0m2.136s
sys   0m0.088s

That query took 35.9 seconds without S3 Select and only 6.5 seconds with S3 Select. That’s 5X faster!

Things To Know

While in preview S3 Select supports CSV or JSON files with or without GZIP compression. During the preview objects that are encrypted at rest are not supported.
There are no charges for S3 Select while in preview.
Amazon Athena, Amazon Redshift, and Amazon EMR as well as partners like Cloudera, DataBricks, and Hortonworks will all support S3 Select.

Glacier Select

Some companies in highly regulated industries like Financial Services, Healthcare, and others, write data directly to Amazon Glacier to satisfy compliance needs like SEC Rule 17a-4 or HIPAA. Many S3 users have lifecycle policies designed to save on storage costs by moving their data into Glacier when they no longer need to access it on a regular basis. Most legacy archival solutions, like on premise tape libraries, have highly restricted data retrieval throughput and are unsuitable for rapid analytics or processing. If you want to make use of data stored on one of those tapes you might have to wait for weeks to get useful results. In contrast, cold data stored in Glacier can now be easily queried within minutes.

This unlocks a lot of exciting new business value for your archived data. Glacier Select allows you to to perform filtering directly against a Glacier object using standard SQL statements.

Glacier Select works just like any other retrieval job except it has an additional set of parameters you can pass in initiate job request. SelectParameters

Here a quick example:

Python

import boto3
glacier = boto3.client("glacier")

jobParameters = {
    "Type": "select", "ArchiveId": "ID",
    "Tier": "Expedited",
    "SelectParameters": {
        "InputSerialization": {"csv": {}},
        "ExpressionType": "SQL",
        "Expression": "SELECT * FROM archive WHERE _5='498960'",
        "OutputSerialization": {
            "csv": {}
        }
    },
    "OutputLocation": {
        "S3": {"BucketName": "glacier-select-output", "Prefix": "1"}
    }
}

glacier.initiate_job(vaultName="reInventSecrets", jobParameters=jobParameters)

Things To Know

Glacier Select is generally available in all commercial regions that have Glacier.

Glacier is priced in 3 dimensions.

GB of Data Scanned
GB of Data Returned
Select Requests

Pricing for each dimension is determined by the speed at which you want your results returned: expedited (1-5 minutes), standard (3-5 hours), and bulk (5-12 hours).

Soon, in 2018, Athena will be integrated with Glacier using Glacier Select.

I hope you’re able to get started enhancing your applications or building new ones with these capabilities. You can apply for this preview at the S3 Select Preview Application Page.

– Randall

View Comments

Related Posts

S3 Select and Glacier Select – Retrieving Subsets of Objects

S3 Select

Things To Know

Glacier Select

Things To Know