You have 2 free member-only stories left this month.

Comparison of AWS EC2, RDS, and Aurora for Database Solutions

When companies move their infrastructure from on-premise data centers to the clouds — AWS, Google Cloud Platform (GCP), Microsoft Azure, etc., it has to decide whether to use a database service or to set up the database on virtual machines. Specifically to AWS, one of the three options — EC2, RDS and Aurora have to be chosen based on the requirements of the specific company.

If you are interested in Google Cloud, please check my other comparison article of CloudSQL and Compute Engine (VMs).

A detailed comparison of EC2, RDS and Aurora for db solutions will be described first, and the general suggestions for making the choice between these three will be proposed in the end.

The background is that we need a database server with 16 CPU cores,128 GB memory and 2TB database size and a proper failover solution. The table below lists the comparison results of EC2, RDS and Aurora in 7 aspects based on our database requirements.

Image for post

There are several advantages of using EC2 instances. First of all, EC2 instances are more cost-effective. From the row ‘Pricing per instance’ in the table, the per-instance cost of RDS is twice as the EC2 instances, and the cost of Aurora is approximately 23% more than RDS. From the row ‘Failover’, if we want to set up the failover and a read replication, we will need two additional instances for RDS, but we will only need one additional instance for the EC2 or Aurora solution because the read replica can be used as a fail-over. In total, the EC2 or Aurora solution could save ⅓ of the db instances. In addition, we will gain more flexibility with EC2 instances. From the row ‘Multiple read replicas’, RDS or Aurora does not support chain-like structure and the replica can not be used as a master/primary database. While, for EC2, we can decide the replication structure as what we want. Since the current state of dbs are chain-like replications, if we want to use RDS or Aurora, we will need to make some changes. From the rows ‘MySQL compatibility’, EC2 instances support all the features of MySQL. While for RDS or Aurora, there are a few MySQL features that are not supported, but we are not using these features, so it doesn’t affect our decision. However, these features may matter for other projects, so when making a choice among EC2, Aurora and RDS, this must be considered.

There is only one major disadvantage of EC2 solution, which is that we will have to take the work to set up and maintain the database. From the rows ‘Fail-over’, ‘Maintenance’ and ‘Set up and configure’, for EC2, the OS, MySQL and fail-over are set up by AWS which saves our time and work. But for EC2 solution, we will have to set it up by ourselves. Since we’ve been doing the similar works of setting up and maintaining MySQL for quite a while, we should be able to set it up properly in EC2.

Suggestions for making the decision would be as follows. Considering the cost, if the database is only for a small project, as the resources including CPUs and memory would be very small, the cost difference between RDS/Aurora and the EC2 solution would not be as significant as big project. Aurora would be recommended rather than RDS because Aurora can fail over to the read replica without needing an additional stand-by instance. Otherwise, the EC2 solution should be the first choice. With regard to the operational work, if the team has the expertise of setting up database and the saved cost of the EC2 solution is worth the time spent on setting up and maintaining the database, the EC2 solution should be the preferred. Otherwise, Aurora should be considered because without any expertise, the database and its fail-over and replicas are ready on AWS.

Written by

As there are more and more companies moving from on-premise data centers to the clouds — AWS, Google Cloud Platform (GCP), Microsoft Azure, etc., one of the critical decisions to be made is to decide to use a database service or to set up the database on virtual machines. Specifically to GCP, whether to use Cloud SQL or Compute Engine (VMs) with the database installed will have to be decided based on the requirements of the specific company.

If you are interested in AWS, please check my other comparison article of EC2, RDS and Aurora.

A detailed comparison of Cloud SQL and database VMs will be described first, and the general suggestions for making the decision between these two will be proposed in the end. …


A concrete example to illustrate three fundamental PySpark programming methods for data transform

Introduction

After using pypark to develop code for data processing for a while, I have noticed that most of the programming falls in the three methods — the first method is to use pure SQL queries to extract data from the given data, the second approach is to apply a python function for each row to do some calculation to produce new fields, and the third way is to apply a python function for each row but the function has to scan the whole table. This article will use a concrete example to explain how these three fundamental PySpark programming methods are used for data processing. …


Note: If you are interested in setting on-premise data lake with on-premise database, please refer to another article of setting up on-premise data lake.

Introduction

Image for post
Migration process: Data migrated from on-premise MySQL to AWS S3. After the migration, Amazon Athena can query the data directly from AWS S3

Most of the websites have been built on the Relational DataBase Management System(RDBMS), such as MySQL, but the query speed can get slower and slower as the amount of data grows. Therefore, RDBMS cannot handle the requests of big data analysis. Fortunately, a number of good cloud platforms have emerged in the past decade, which provide excellent tools for big data analysis, such as AWS cloud. Amazon Athena offers much faster query speed than querying the traditional RDBMS by leveraging the distributed computing system, which looks like a perfect solution of big data analysis. However, migrating the data from traditional RDBMS to AWS S3 data lake, from which various AWS tools such as Amazon Athena can directly query, can be challenging. …


— Apache Spark + Hadoop + Sqoop to take in data from RDBMS (MySQL)

Note:For readers who are NOT ready but plan to install on-premise data lake, please skip the middle part of the article and only read the Introduction and the Conclusion at the bottom; For readers who are interested in AWS data lake, please refer to the article of migrating on-premise data to AWS data lake.

Introduction

Image for post
Data lake infrastructure diagram

In our scenario, we’ve got a website with MySQL running for years, and the volume of the database grows very quickly with the largest table having hundreds of millions of records. …


Introduction

Deep convolutional neural networks (DCNNs) have shown their promising success in image classification. Two recently proposed DCNNs have been gaining great reputation very quickly: Residual network (ResNet) proposed in 2015 has reported an outstanding performance in terms of classification accuracy, which is still one of the best state-of-art classifier on CIFAR-10 dataset by achieving a very small error rate of 6.97% with 56 layers; Furthermore, Densely-connected network (DenseNet) proposed in 2018 has reduced the small error rate by another 2+% with 110 layers.

As these two papers were written very precise and there are quite a few versions of implementations for both of them, I thought it would be really straightforward to reproduce their results. However, there are a few tricks which dragged me from reaching the expected accuracy, and it takes me quite a lot of time to figure them out. In this article, the lessons that I have learned are written in three aspects — data augmentation, the optimizer used to train neural networks and normalization. …