GCE BigQuery vs AWS Redshift vs AWS Athena
6/22/17 update
Almost 3,000 people read the article and I have received a lot of feedback. This is the first update of the article and I will try to update it further later.- I converted the CSV format to Parquet and re-tested Athena which did give much better results as expected (Thanks Rahul Pathak, Alex Casalboni, openasocket, Robert Synnott, the amazon Redshift team with Joe Harris, Jenny Chen, Maor Kleider and the Amazon Athena/EMR team with Abhishek Sinha)
- Using Redshift admin tables I was able to add the data scanned per query for Redshift (Thanks rockostrich)
- I added a note about partitioning
- I added a links section with useful articles
Introduction
For this test we will be loading a CSV/Parquet file which is basically an enlarged version of the STAR Experiment star2002 dataset. The final file is close to 1Tb (997Gb).
Some information regarding the dataset:
- CSV size: 997GB (~1TB)
- Parquet size: 232GB
- 7 928 812 500 lines (~8 billion)
- 16 coma separated columns
- All columns are either integers, double precision or floats
Loading time
To calculate load time I initially sent the files to both Amazon S3 and Google Cloud Storage then loaded them into each datastore. The times below reflects the load time from S3 / Google Cloud Storage to the datastore and not the time it took to transfer the file from a server to both cloud storage solutions.BigQuery | Redshift Dense Compute dc1.8xlarge |
Redshift Dense Storage ds2.xlarge |
Athena (CSV) |
Athena (Parquet) |
---|---|---|---|---|
46 m | 9h 30m | 8h 23m | 0s (no need to load the data) |
0s (no need to load the data) |
Regardless, I was surprised with how slow Redshift was in comparison of BigQuery, even with a "Dense Compute" cluster. It is very likely due to bandwidth limitation on Amazon S3.
Queries speed
After loading the same exact dataset in each big datastore I have tested the query time of a few sample queries against each one.As expected and since none of the datastores were MongoDB, the number of rows and results were consistent across each datastore.
Query | BigQuery | Redshift Dense Compute dc1.8xlarge |
Redshift Dense Storage ds2.xlarge |
Athena (CSV) |
Athena (Parquet) |
Rows found |
---|---|---|---|---|---|---|
SELECT count(*) FROM t | 2.2s | 3.2s | 16.7s | 1m 16s | 3.76s | 7,928,812,500 |
SELECT count(*) FROM t WHERE eventnumber > 1 | 2.6s | 4.3s | 56.2s | 1m 34s | 8.72s | 7,928,486,500 |
SELECT count(*) FROM t WHERE eventnumber > 20000 | 3.0s | 3.0s | 44.2s | 1m 32s | 7.75s | 3,871,550,500 |
SELECT count(*) FROM t WHERE eventnumber > 500000 | 4.1s | 1.6s | 10.6s | 1m 32s | 7.25s | 42,853,500 |
SELECT eventFile, count(*) FROM t GROUP BY eventFile | 17.5s | 15.2s | 2m 8s | 1m 36s | 8.1s | 102,021 |
SELECT eventFile, count(*) FROM t WHERE eventnumber > 525000 GROUP BY eventFile |
2.8s | 3.0s | 19.4s | 1m 33s | 9.45s | 137 |
SELECT eventFile, eventTime, count(*) FROM t WHERE eventnumber > 525000 GROUP BY eventFile, eventTime ORDER BY eventFile DESC, eventTime ASC |
7.2s | 12.4s | 27.1s | 1m 33s | 11.79s | 3,007 |
SELECT MAX(runNumber) FROM t | 2.4s | 3.8s | 41.6s | 1m 34s | 6.29s | 1 |
SELECT AVG(eventTime) FROM t WHERE eventnumber > 20000 | 3.3s | 13.0s | 1m 44s | 1m 44s | 9.71s | 1 |
SELECT eventFile, AVG(eventTime), AVG(multiplicity), MAX(runNumber), count(*) FROM t WHERE eventnumber > 20000 GROUP BY eventFile |
14.5s | 18.2s | 4m 18s | 1m 54s | 22.49s | 38,400 |
Queries cost
BigQuery and Athena both cost $5/TB. Looking at the amount of data scanned, I was able to calculate approximately the cost of each query.Redshift charges hourly which makes it very difficult to compare. If you have a dataset with a linear growth and an expectable volume of data scanned per query you could calculate the point at which it would be cheaper to pay an hourly price versus a data scanned based price. I could have calculated this data here but it would have very little value since these queries and this dataset is probably very different from most usages.
Query | BigQuery | Redshift Dense Compute dc1.8xlarge |
Redshift Dense Storage ds2.xlarge |
Athena (CSV) |
Athena (Parquet) |
---|---|---|---|---|---|
Dataset storage cost | $19.80 / month | $0 | $23.00 / month (S3 pricing) |
$5.34 / month (S3 pricing) |
|
SELECT count(*) FROM t | $0 (cached) |
$4.8 / hour / node Minimum 2 nodes |
$0.85 / hour / node | $4.43 | $0 (cached) |
SELECT count(*) FROM t WHERE eventnumber > 1 | $0.30 | $4.43 | $0.14 | ||
SELECT count(*) FROM t WHERE eventnumber > 20000 | $0.30 | $4.43 | $0.14 | ||
SELECT count(*) FROM t WHERE eventnumber > 500000 | $0.30 | $4.43 | $0.14 | ||
SELECT eventFile, count(*) FROM t GROUP BY eventFile | $0.30 | $4.43 | $0.01 | ||
SELECT eventFile, count(*) FROM t WHERE eventnumber > 525000 GROUP BY eventFile |
$0.59 | $4.43 | $0.15 | ||
SELECT eventFile, eventTime, count(*) FROM t WHERE eventnumber > 525000 GROUP BY eventFile, eventTime ORDER BY eventFile DESC, eventTime ASC |
$0.89 | $4.43 | $0.18 | ||
SELECT MAX(runNumber) FROM t | $0.30 | $4.43 | $0.01 | ||
SELECT AVG(eventTime) FROM t WHERE eventnumber > 20000 | $0.59 | $4.43 | $0.17 | ||
SELECT eventFile, AVG(eventTime), AVG(multiplicity), MAX(runNumber), count(*) FROM t WHERE eventnumber > 20000 GROUP BY eventFile |
$1.48 | $4.43 | $0.28 |
Data scanned per query
Query | BigQuery | Redshift Dense Compute dc1.8xlarge |
Redshift Dense Storage ds2.xlarge |
Athena (CSV) |
Athena (Parquet) |
---|---|---|---|---|---|
SELECT count(*) FROM t | 0 GB | 59.07 GB | 59.07 GB | 885.5 GB | 0 GB |
SELECT count(*) FROM t WHERE eventnumber > 1 | 59.1 GB | 118.14 GB | 118.14 GB | 885.47 GB | 27.91 GB |
SELECT count(*) FROM t WHERE eventnumber > 20000 | 59.1 GB | 57.69 GB | 57.69 GB | 885.5 GB | 27.91 GB |
SELECT count(*) FROM t WHERE eventnumber > 500000 | 59.1 GB | 0.64 GB | 0.64 GB | 885.49 GB | 27.9 GB |
SELECT eventFile, count(*) FROM t GROUP BY eventFile | 59.1 GB | 118.15 GB | 118.15 GB | 885.5 GB | 1.64 GB |
SELECT eventFile, count(*) FROM t WHERE eventnumber > 525000 GROUP BY eventFile |
118 GB | 0.68 GB | 0.68 GB | 885.47 GB | 29.53 GB |
SELECT eventFile, eventTime, count(*) FROM t WHERE eventnumber > 525000 GROUP BY eventFile, eventTime ORDER BY eventFile DESC, eventTime ASC |
177 GB | 0.90 GB | 0.90 GB | 885.51 GB | 36.33 GB |
SELECT MAX(runNumber) FROM t | 59.1 GB | 118.15 GB | 118.15 GB | 885.49 GB | 1.41 GB |
SELECT AVG(eventTime) FROM t WHERE eventnumber > 20000 | 118 GB | 86.53 GB | 86.53 GB | 885.51 GB | 34.66 GB |
SELECT eventFile, AVG(eventTime), AVG(multiplicity), MAX(runNumber), count(*) FROM t WHERE eventnumber > 20000 GROUP BY eventFile |
295 GB | 173.07 GB | 173.07 GB | 885.51 GB | 56.38 GB |
A note about partitioning
For the tests above, the data was not partitioned. BigQuery, Redshift and Athena all support partitioning but it seems that it would defeat the purpose of trying to query a large file if the queries ended up hitting a much smaller subset of the file.However, it is likely that some datastores have better performance than others and I will do my best to write a new article about partition performance in the future.
Conclusion
Athena is a great solution to analyze large files in a variety of formats (CSV, Json, Nginx logs...) stored on Amazon S3. Using columnar storage like Parquet or ORC it ends up being a powerful and cost effective solution as well.Redshift benefits from being the big datastore living in the AWS ecosystem. Since Redshift was created on top of PostgreSQL, a lot of the features and syntax is identical which greatly reduces the learning curve.
In most cases, I believe BigQuery will be the better solution. It provides less flexibility than Redshift but beyond storage you pay only for your usage which will save a lot of money for companies that don't query their dataset extensively. It would require a lot of queries on very large datasets for Redshift clusters to become cheaper than BigQuery.
Links
A few links shared by the Amazon teams:- Using Amazon Redshift Spectrum to query external data
- Improving Amazon Redshift Spectrum query performance
- Top 10 performance tuning tips for Athena
- Converting a large dataset to Parquet
- Converting to columnar formats
Last update of this page:
June 24, 2017
Please do not copy any of the content of this website without asking me first.
Please do not copy any of the content of this website without asking me first.