Fundamentals of Data Architecture to Help Data Scientists Understand Architectural Diagrams Better

Before pretending you understand the diagram your smart colleague shows to you.

Photo by Jared Murray on Unsplash

Introduction

Within a company using data to derive business value, although you may not be appreciated with your data science skills all the time, you always are when you manage the data infrastructure well. Everyone wants the data stored in an accessible location, cleaned up well, and updated regularly.

Backed up by these unobtrusive but steady demands, the salary of a data architect is equally high or even higher than that of a data scientist. In fact, based on the salary research conducted by PayScale (https://www.payscale.com/research/US/Country=United_States/Salary) shows the US average salary of Data Architect is $121,816, while that of Data Scientist is $96,089.

Not to say all data scientists should change their job, there would be a lot of benefits for us to learn at least the fundamentals of data architecture. Actually, there is one simple (but meaningful) framework that will help you understand any kinds of real-world data architectures.

Table of Contents

  • Three Components in Data Architecture: Data Lake -> Data Warehouse -> Data Mart
  • Tools Used in Each Component
  • Case Study — Building Scheduled & Automatic Data Feed from BigQuery (Data Warehouse) to Google Sheets (Data Mart)
  • Ending Note

Three Components in Data Architecture: Data Lake -> Data Warehouse -> Data Mart

“Data Lake”, “Data Warehouse”, and “Data Mart” are typical components in the architecture of data platform. In this order, data produced in the business is processed and set to create another data implication.

Diagram the author crafted using materials from Irasuto-ya (https://www.irasutoya.com/)

Three components take responsibility for three different functionalities as such:

  • Data Lake: holds an original copy of data produced in the business. Data processing from the original should be minimal if any; otherwise in case some data processing turned out to be wrong in the end, it will not be possible to fix the error retrospectively.
  • Data Warehouse: holds data processed and structured by a managed data model, reflecting the global (not specific) direction of the final use of the data. In many cases, the data is in tabular format.
  • Data Mart: holds a subpart and/or aggregated data set for the use of a particular business function, e.g. specific business unit or specific geographical area. A typical example is when we prepare the summary of KPIs for a specific business line followed by visualization in BI tool. Especially, preparing this kind of separate and independent component after the warehouse is worthwhile when the user wants the data mart regularly and frequently updated. On contrary, this portion can be skipped in cases the user only wants some set of data for ad hoc analysis done only once.
Summary of three data architecture components (exhibit created by author)

For more real-world examples beyond this bare-bone-only description, enjoy googling “data architecture” to find a lot of data architecture diagrams.

What you see when you image-google with “data architecture”. (Image captured by author)

Why do we need to split into these three components?

Because different stages within the process have different requirements.

In the data lake stage, we want the data is close to the original, while the data warehouse is meant to keep the data sets more structured, manageable with a clear maintenance plan, and having clear ownership. In the data warehouse, we also like the database type to be analytic-oriented rather than transaction-oriented. On the other hand, data mart should have easy access to non-tech people who are likely to use the final outputs of data journeys.

Differently-purposed system components tend to have re-design at separate times. Then, configuring the components loosely-connected has the advantage in future maintenance and scale-up.

How do data engineers and data scientists work to these three components?

Roughly speaking, data engineers cover from data extraction produced in business to the data lake and data model building in data warehouse as well as establishing ETL pipeline; while data scientists cover from data extraction out of data warehouse, building data mart, and to lead to further business application and value creation.

Of course, this role assignment between data engineers and data scientists is somewhat ideal and many companies do not hire both just to fit this definition. Actually, their job descriptions tend to overlap.

New trend beyond the three-component approach

Last but not the least, it should be worth noting that this three-component approach is conventional one present for longer than two decades, and new technology arrives all the time. For example, Data Virtualization is an idea to allow one-stop data management and manipulation interface against data sources, regardless of their formats and physical locations.

Tools Used in Each Component

Now, we understood the concept of three data platform components. Then, what tools do people use? Based on this “Data Platform Guide” (in Japanese) , here’re some ideas:

Data lake/warehouse

There are the following options for data lake and data warehouse.

Author crafted based on the “Data Platform Guide” (in Japanese)

ETL tools

ETL happens where data comes to the data lake and to be processed to fit the data warehouse. Data arrives in real-time, and thus ETL prefers event-driven messaging tools.

Author crafted based on the “Data Platform Guide” (in Japanese)

Workflow engine

A workflow engine is used to manage the overall pipelining of the data, for example, visualization of where the process is in progress by a flow chart, triggering automatic retry in case of error, etc.

Author crafted based on the “Data Platform Guide” (in Japanese)

Data mart/BI tools

The following tools can be used as data mart and/or BI solutions. The choice will be dependent on the business context, what tools your company is familiar with (e.g. are you Tableau person or Power BI person?), the size of aggregated data (e.g. if the data size is small, why doesn’t the basic solution like Excel or Google Sheets meet the goal?), what data warehouse solution do you use (e.g. if your data warehouse is on BigQuery, Google DataStudio can be an easy solution because it has natural linkage within the Google circle), and etc.

Author crafted based on the “Data Platform Guide” (in Japanese)

Case Study — Building Scheduled & Automatic Data Feed from BigQuery (Data Warehouse) to Google Sheets (Data Mart)

When the data size stays around or less than tens of megabytes and there is no dependency on other large data set, it is fine to stick to spreadsheet-based tools to store, process, and visualize the data because it is less-costly and everyone can use it.

Once the data gets larger and starts having data dependency with other data tables, it is beneficial to start from cloud storage as a one-stop data warehouse. (When the data gets even larger to dozens of terabytes, it can make sense to use on-premise solutions for cost-efficiency and manageability.)

In this chapter, I will demonstrate a case when the data is stored in Google BigQuery as a data warehouse. BigQuery data is processed and stored in real-time or in a short frequency. The end-user still wants to see daily KPIs on a spreadsheet on a highly aggregated basis. This means data mart can be small and fits even the spreadsheet solution. Instead of Excel, let’s use Google Sheets here because it can be in the same environment as the data source in BigQuery. Oh, by the way, do not think about running the query manually every day. Try to find a solution to make everything running automatically without any action from your side.

Data pipeline in case study (diagram created by author using materials from Irasuto-ya (https://www.irasutoya.com/))

Data to be used in this case study

In this case study, I am going to use a sample table data which has records of NY taxi passengers per ride, including the following data fields:

  • Car ID
  • Driver ID
  • Date of ride
  • Number of passengers
  • Amount of fare
  • etc.

The sample data is stored in the BigQuery as a data warehouse.

Can Google Sheets pull data from BigQuery tables?

Technically yes, but at the moment this is only available through Connected Sheets and you need an account of G Suite Enterprise, Enterprise for Education, or G Suite Enterprise Essentials account.

Diagram created by author.

Connected Sheets allows the user to manipulate BigQuery table data almost as if they play it on spreadsheet. See the GIF demonstration in this page on “BenCollins” blog post.

Example of the use of Google Sheets connected to BigQuery through Connected Sheets (Captured by author)

Connected Sheets also allows automatic scheduling and refresh of the sheets, which is a natural demand as a data mart.

Although it demonstrates itself as a great option, one possible issue is that owing G Suite account is not very common.

For more details about the setups, see this blog post from “BenCollins”.

What can we do to push data from BigQuery to Google Sheets?

To extract data from BigQuery and push it to Google Sheets, BigQuery alone is not enough, and we need a help of server functionality to call the API to post a query to BigQuery, receive the data, and pass it to Google Sheets.

Diagram created by author.

The server functionality can be on a server machine, external or internal of GCP (e.g. ‘Compute Engine’ instance on GCP; or ‘EC2’ instance on AWS). The code run can be scheduled using unix-cron job. But one downside here is that it takes maintenance work and cost on the instance and is too much for a small program to run.

‘Google Cloud Functions’ is a so-called “serverless” solution to run code without the launch of a server machine. Putting code in Cloud Functions and setting a trigger event (e.g. scheduled timing in this case study, but also can be HTML request from some internet users), GCP automatically manages the run of the code.

Set-ups in my case study

There are two steps in the configuration of my case study using NY taxi data.

Step 1: Set up scheduling — set Cloud Scheduler and Pub/Sub to trigger a Cloud Function.

Here, “Pub/Sub” is a messaging service to be subscribed by Cloud Functions and to trigger its run every day at a certain time. “Cloud Scheduler” is functionality to kick off something with user-defined frequency based on unix-cron format. Combining these two, we can create regular messages to be subscribed by Cloud Function. See this official instruction on how to do it. Here are screenshots from my GCP set-up.

Set up in Cloud Scheduler (Captured by author)
Set up in Pub/Sub (Captured by author)

Step 2: Set up code — prepare code on Cloud Functions to query BigQuery table and push it to Google Sheets.

The next step is to set up Cloud Functions. In Cloud Functions, you define 1) what is the trigger (in this case study, “cron-topic” sent from Pub/Sub, linked to Cloud Scheduler which pulls the trigger every 6 am in the morning) and 2) the code you want to run when the trigger is detected.

See this official instruction for further details, and here are screenshots from my set-up.

Set up in Cloud Functions (Captured by author)
Code input in Cloud Functions — here you can also set requirements.txt to use installable libraries in your mail.py program. (Captured by author)

The code to run has to be enclosed in a function named whatever you like (“nytaxi_pubsub” in my case.) The code content consists of two parts: part 1 to run a query on BigQuery to reduce the original BigQuery table to KPIs and save it as another data table in BigQuery, as well as make it a Pandas data frame, and part 2 to push the data frame to Sheets.

Here’re the codes I actually used. Importantly, the authentication to BigQuery is automatic as long as it resides within the same GCP project as Cloud Function (see this page for explanation.) Yet, this is not the case about the Google Sheets, which needs at least a procedure to share the target sheet through Service Account. See the description in gspread library for more details.

main.py (coded by author)
requirements.txt (coded by author)

Final data mart on Google Sheets

Finally, I got the aggregated data in Google Sheets like this:

Automatically updated data mart after a long journey of the setup. (captured by author)

This sheet is automatically updated every morning, and as the data warehouse is receiving new data through ETL from the data lake, we can easily keep track of the NY taxi KPIs the first thing every morning.

Ending Note

In a large company who hires data engineers and/or data architects along with data scientists, a primary role of data scientists is not necessarily to prepare the data infrastructure and put it in place, but knowing at least getting the gist of data architecture will benefit well to understand where we stand in the daily works.

Data Lake -> Data Warehouse -> Data Mart is a typical platform framework to process the data from the origin to the use case. Separating the process into three system components has many benefits for maintenance and purposefulness.

There are many options in the choice of tools. They are to be wisely selected against the data environment (size, type, and etc.) and the goal of the business.

Finally in this post, I discussed a case study where we prepared a small size data mart on Google Sheets, pulling out data from BigQuery as a data warehouse. With the use of Cloud Scheduler and Pub/Sub, the update was made to be automatic.

Reference

Data scientist and actuary with 10+ yrs experience in insurance, healthcare, and currently marketing. LinkedIn: https://fr.linkedin.com/in/motoharu-dei-358abaa

Sign up for The Variable

By Towards Data Science

Every Thursday, the Variable delivers the very best of Towards Data Science: from hands-on tutorials and cutting-edge research to original features you don't want to miss. Take a look.

You'll need to sign in or create an account to receive this newsletter.

Your home for data science. A Medium publication sharing concepts, ideas and codes.

Have a look, why you should read them?

Python Books You Must Read
Python Books You Must Read
Photo by Christina Morillo from Pexels

Python is an experiment in how much freedom programmers need. Too much freedom and nobody can read another’s code; too little and expressiveness is endangered.

- Guido van Rossum

Python is one of the top programming languages for a diverse range of tasks and domains. Python’s user-friendliness, high-level nature, and the emphasis on simplicity and enhanced code readability make it a favorable choice for many developers around the world. If that doesn’t sell Python to you, I’m sure that its exhaustive ecosystem of more than 255 thousand third-party packages will.

Features like these have skyrocketed the demand for Python everywhere…


Photo by Rupert Britton on Unsplash

Choosing the right one

Microsoft’s Power BI has drastically improved over the past 3 years. Starting as a weak competitor in the data visualization space, its monthly updates have gradually transformed it into a strong contender against Tableau, TIBCO’s Spotfire, and Qlik. However, even with major improvements, some confusing features remain.. One of the most commonly asked questions I get about Power BI involves calculated measures and calculated columns.

We all leverage calculations in our data models and visualizations. So, these two features are front and center. …


You’re most likely covered already

Photo of a group of kids sitting next to old school computers
Photo of a group of kids sitting next to old school computers
Photo by cottonbro from Pexels

Are you ready to take the leap into programming in Python? For a beginner, it can be scary to jump into the water but trust me, all you need is a pair of swimming pants and you are good to go. And by swimming pants, I mean some basic hardware and software.

A frequently asked question is ”which laptop should I buy for Python programming?” The good news is that Python does not require much. In this article, I will go through the basics to get you set up.

Put your pants on and let’s get started on your Python…


Hyperparameter tuning with GridSearchCV and RandomizedSearchCV

Photo by JESHOOTS.COM on Unsplash

Introduction

In this article, I’d like to speak about how we can improve the performance of our machine learning model by tuning the parameters.

Obviously it is important to know the meaning of the parameters that we want to adjust in order to improve our model. For this reason, before to speak about GridSearchCV and RandomizedSearchCV, I will start by explaining some parameters like C and gamma.

Part I: An overview of some parameters in SVC

In the Logistic Regression and the Support Vector Classifier, the parameter that determines the strength of the regularization is called C.

For a high…


Real-time, IoT, computer vision, genetics, and high-volume data projects

Photo by Austin Distel on Unsplash

For many aspiring data scientists, the dream job is in one of the global tech companies.

But when you focus on the tech industry only, you exclude an enormous number of cool jobs in other sectors.

I have worked for many different industries, and I would not like to miss the experience I have made. In my former position, I worked on projects in the chemical industry. I did not know that industry before, and I was surprised how advanced they are, and the many insights I could gain, especially the integration with IoT technology.

So, what are the reasons…