Best Practices for Using Amazon S3

Articles & Tutorials>Best Practices for Using Amazon S3
Amazon S3 is storage for the Internet, designed to make web-scale computing easier for developers.While Amazon strives to make use of the service as straight-forward as possible, observing some easy-to-implement best practices can dramatically improve your experience.

Details

Submitted By: Dan@AWS
AWS Products Used: Amazon S3
Created On: November 26, 2008 8:27 PM GMT
Last Updated: May 20, 2009 9:40 PM GMT

By the Amazon Simple Storage Service Team

Setting Up Your Account

When using Amazon S3 for a production system, it's a good idea not to use an individual's email address to set up the account. Many organizations create an email alias specifically for their use of Amazon Web Services and can then manage who receives mail sent to that alias when responsibility for the AWS account shifts from one individual to another. It's also often a good idea to set up separate development and production accounts to isolate the production system from development work.

Choosing a Development Library

The sample client libraries published by Amazon were designed for ease of code perusal, not for production use. There are great, production-grade libraries available in the community for most major platforms. Use them.

Some best-of-breed libraries:

Choosing Tools

Make development easier by taking advantage of some of the great tools available for use with S3.

For a list of reviewed tools, see the Amazon S3 Solutions Catalog.

Choosing your Bucket Location

Data stored in any given Amazon S3 bucket is replicated across multiple datacenters in a geographical region. Because response latencies grow when requests have to travel long distances over the internet, you should consider the region you place your objects in.

The service API provides you an explicit choice as to whether to put your bucket in the US or the EU. For best performance, choose the region closer to your most latency-sensitive customers.

Naming Buckets and Keys

Though buckets can be named with any alpha-numeric character, following some simple naming rules will ensure that you can reference your bucket using the convention <bucketname>.s3.amazonaws.com.

  1. Use 3 to 63 characters.
  2. Use only lower case letters (at least one), numbers, '.' and '-'.
  3. Don't start or end the bucket name with '.' and don't follow or precede a '.' with a '-'.

Keys can be named with any properly encoded UTF-8 character. Literal '+' characters should always be URL encoded.

Protecting Your Data

Your "secret key" is crucial to the security of your account. Divulging it could expose data in your account to being read or changed by others. As such, don't embed your secret key in a web page or other publicly accessible source code. Also, don't transmit it over insecure channels.

It's also generally a good idea to encrypt highly sensitive data.

Ensuring Data Integrity

Data being sent to or retrieved from S3 must often pass over many miles of network and through many network devices. Though unlikely for any given request, data loss or corruption in transit does occasionally occur. Fortunately, Amazon S3 provides a mechanism to detect this and retransmit the data.

Amazon S3's REST PUT operation provides the ability to specify an MD5 checksum (http://en.wikipedia.org/wiki/Checksum) for the data being sent to S3. When the request arrives at S3, an MD5 checksum will be recalculated for the object data received and compared to the provided MD5 checksum. If there's a mismatch, the PUT will be failed, preventing data that was corrupted on the wire from being written into S3. At that point, you can retry the PUT.

MD5 checksums are also returned in the response to REST GET requests and may be used client-side to ensure that the data returned by the GET wasn't corrupted in transit. If you need to ensure that values returned by a GET request are byte-for-byte what was stored in the service, calculate the returned value's MD5 checksum and compare it to the checksum returned along with the value by the service.

Handling Errors

Any code you write to call Amazon S3 APIs should expect to receive and handle errors from the service. A given error may be returned for multiple reasons, so it's a good idea to look at (and handle in code), the error messages returned with the error number.

400-series errors indicate that you cannot perform the requested action, the most common reasons being that you don't have permission or that a referred-to entity doesn't exist. Another occasional cause of 400 series errors is that the client machine's clock isn't set properly, which will result in a 403 "RequestTimeTooSkewed" error.

500-series errors indicate that a request didn't succeed, but may be retried. Though infrequent, these errors are to be expected as part of normal interaction with the service and should be explicitly handled with an exponential backoff algorithm (ideally one that utilizes jitter). One such algorithm can be found at http://en.wikipedia.org/wiki/Truncated_binary_exponential_backoff.

Particularly if you suddenly begin executing hundreds of PUTs per second into a single bucket, you may find that some requests return a 503 "Slow Down" error while the service works to repartition the load. As with all 500 series errors, these should be handled with exponential backoff.

Failed connection attempts should also be handled with exponential backoff.

Improving PUT and GET Throughput

Amazon S3 evolves its storage and load partitioning automatically over time to improve your performance and distribute pressure across the system. There are, however, a number of strategies you can employ that may have a significant impact on your throughput with the service when making a high volume of PUT and GET requests.

Performing PUTs against a particular bucket in alphanumerically increasing order by key name can reduce the total response time of each individual call. Performing GETs in any sorted order can have a similar effect. The smaller the objects, the more significantly this will likely impact overall throughput.

When executing many requests from a single client, use multi-threading to enable concurrent request execution.

Consider prefacing keys with a hash utilizing a small set of characters. Decimal hashes work nicely.

Consider utilizing multiple buckets that start with different alphanumeric characters. This will ensure a degree of partitioning from the start. The higher your volume of concurrent PUT and GET requests, the more impact this will likely have.

If you'll be making GET requests against Amazon S3 from within Amazon EC2 instances, you can minimize network latency on these calls by performing the PUT for these objects from within Amazon EC2 as well.

Efficiently Deleting Objects

Deleting a large number of objects for which you don't have an external list can be made faster and easier by using the marker function of the LIST operation. Call LIST to get an initial set of objects to DELETE. Step through the list deleting each and saving the key of the last object in the list. Call LIST again using the last object deleted as a marker, and repeat this process until all desired objects have been deleted.

Avoiding Unnecessary Requests

Avoiding unnecessary requests can improve the performance of your application and reduce your usage bill.

  • When making a PUT or DELETE request, there's no need to call HEAD or LIST to verify the success of the request. If the PUT or DELETE returns success (200), the object has been stored or deleted respectively.
  • Rather than changing metadata (such as Content-Type) after uploading an object to Amazon S3, set it properly in the initial PUT.
  • Consider caching bucket and key names locally if your application logic allows it.
  • Don't over-check the existence of buckets. If your application uses fixed buckets, you don't need to check their existence prior to PUTs. Also, object PUTs to non-existent buckets return 404 "NoSuchBucket" errors, which can be handled explicitly by a client.
  • If you have to check the existence of a bucket, do so by making a ListBucket or HEAD request on the bucket, specifying the max-keys query-string parameter as 0. In REST, this would translate to the URL http://bucket.s3.amazonaws.com/?max-keys=0 with the appropriate signature. Avoid using bucket PUTs to test bucket existence.
  • If you're mapping your bucket as a virtual drive on a desktop computer, be conscious of any tools such as virus scanners which may drive unnecessary and unwanted usage of your bucket.

Optimizing Network Performance

Ensure that your operating system's TCP settings are tuned appropriately. This is taken care of by default on the newest versions of Linux (kernel v2.6.8 and beyond) and Windows (Vista). Other or older operating systems may require manually setting two particular options that can have a significant impact on throughput over high speed internet connections:

Also, don't overuse a connection. Amazon S3 will accept up to 100 requests before it closes a connection (resulting in 'connection reset'). Rather than having this happen, use a connection for 80-90 requests before closing and re-opening a new connection.

Finally, if you use a CNAME to map one of your subdomains to Amazon S3, you can avoid redirects and other performance issues by making sure that the CNAME record target uses the style <bucketname>.s3.amazonaws.com.

Utilizing Compression

Consider compressing data you store in Amazon S3. You'll minimize data transfer over the network as well as storage utilization. If your application or site is more network latency than CPU bound, you'll also likely improve overall performance. All the major modern browsers support compression transparently. For further details on transmitting compressed data over HTTP see http://www.http-compression.com/.

Seeking Help with Troubleshooting

The Amazon S3 Forum is a great place to get help with troubleshooting problems you encounter. It's generally worth doing a search there before starting a new thread on an issue you've bumped into.

When posting to the forum for help with an error you're receiving, make sure to include the request and response HTTP headers and as much of the information that was returned with the error as possible including number, message, date and time. If you have it, post the full XML returned by the service.

If you're posting about connectivity issues it's usually a good idea to get and post the trace-route between your client and Amazon S3.

The following tools can help gather this data:

©2015, Amazon Web Services, Inc. or its affiliates. All rights reserved.