Clean SQL

November 19, 2016

I analyse a lot of data daily, much of the time using SQL. As a programming language, SQL is relatively simple and self-contained, and newcomers usually don't take long to start writing code that works (after learning to think in declarative way). Like other languages, SQL can be written in a way that reads like poetry, where the logic of the query is clear to anyone who reads it. Typically, however, reading someone else's SQL code is a very unpleasant experience... and although it's usually code written by rookies, 'pro' code can be just as painful at times.

It seems that newcomers to SQL don't think it's important to write code that is easy to read and understand by others. Instead they focus solely on getting the data output needed. On the other hand, I feel experienced coders tend to over-engineer the formatting of the code (think over alignment).

This has prompted me to write this post, which will serve as a living document of what I feel is the best way to write SQL code...

Why clean code?

In any project where code is written, whether for data analysis or a complex app, the compounding effects of messy code affects productivity and costs. Bad code slows down the pace of the team, as any change becomes less and less trivial. However, the pressure to produce results urges the team to add more bad code. Eventually, you end-up with a colossal pile of crap that shatters when touched.

Code is written once, and read many times. So consistently write clear, human-readable code.

Example of bad code

Before I start looking at how to write clean SQL, let's explore the motivation behind this... What do you think is wrong in this first example below (TSQL)?

SELECT
    c.Date
    a.ProductKey,
    a.description,
    e0.Classification,
    ExampleFlag = 1,
    FutureFcstMethod = CASE WHEN c.Discount NOT NULL THEN 'Mthd1'
        WHEN (b.Uplift IS NOT NULL AND b.Forecast IS NOT NULL) THEN 'Mthd2' END,
    SUM(IIF(e0.Flag=0, cost, NULL))/NULLIF(SUM(IIF(e0.Flag=0, sales, NULL)),0) as Rate
FROM Dim.Product a
LEFT JOIN fst.ProductClass e0 ON a.ProductKey = e0.ProductKey
LEFT JOIN fact.Discounts c ON a.ProductKey = c.ProductKey 
INNER JOIN fst.A_Final_Product_Analysis b ON a.ProductKey = b.ProductKey
AND b.TransDate < c.Date
WHERE a.productKey NOT IN (1234, 5678, 90123)
AND e0.Classification IN ('A', 'BC', 'A21')
AND c.Discount > 0
GROUP BY
    c.Date
    a.Product,
    a.description,
    e0.Classification,
    CASE WHEN c.Discount NOT NULL THEN 'Mthd1'
        WHEN (b.Uplift IS NOT NULL AND b.Forecast IS NOT NULL) THEN 'Mthd2' END

That's right... everything.

The code is densely packed, mis-aligned, and uses obscure naming... it's hard to glean any information from it.

Clean code rules

Most experienced programmers agree with a version of these rules. For an in-depth analysis of how to write clean code, I really recommend reading Robert Martin's Clean Code.

Naming

  • Name objects in a way that tells the purpose of the object - calling a field acc_111 doesn't tell us anything. A rule of thumb is the ability to easily pronounce the name.
  • Use camelCase for fields and tables - call a table salesTransactionHistory instead of sales_transaction_history or SalesTransactionHistory.
  • Functions or stored procedures should start with a verb that tells what it does - getSalesForecast instead of forecast. Do not add any prefix like sp_ or fn_ or any other Hungarian Notation.
  • Give consistent, meaningful aliases to joined tables in a query - a, b, c just confuse people. Instead, use prd for a product dimension table, or sales for a sales fact table.

Formatting

  • Don't capitalise SQL clauses like select, from, ... capitals are hard on the eyes
  • In general, structure your code vertically rather than horizontally. For example, list fields vertically in a select. If you want to be strict about it, any line should not exceed more than 80 characters in length.
  • Do not horizontally-align code. This is where most experienced programmers fail. Code that has been overly aligned in this way creates too much separation between parts that correspond to each other, drawing attention away from that relevant correspondence. For example, separating fields and data types in a create table statement draws you to look at the list of field names and not their data types.
  • Lists of fields, where conditions, and tables after a from statement, must:
    • start on a new line
    • have one level of indentation
    • be left-aligned
  • Joins should be formatted as below, with the join clause double indented, and the joined fields triple indented.
  • select
        tbl1.field1
        , tbl1.field2
    from
        table1 tb1
            inner join
        table2 tb2
                on  tb1.field1 = tb2.field1
                and tb1.field2 = tb2.field2
            left join
        table3 tb3
                on ...
    

Comments

  • Use comments sparingly - a comment is only needed when the code is unclear enough that it needs to be explained. Messy code won't be made better with the addition of comments.
  • In writing a comment, restrict it to stating the general purpose of a block of code in the most concise manner possible.
  • Never leave commented code co-existing with live code.

Queries, Functions & Procedures

  • If the logic in a single query is hard to understand, then it is probably doing too much at once. Separate the logic into multiple steps using either CTEs or temporary tables.
  • Each function or procedure should do one thing, do it well, and do it only - if it's doing too much, then split the logic into more than one function or procedure.

Examples of clean code

It's easier to see how to write clean SQL code by seeing some examples. I am more than happy to add to this list upon request - either comment on Hacker News or email me directly.

Select statements

select
    rev.accountKey
    , rev.revenueType
    , coalesce(indexedChurnDate, 0) - coalesce(indexedGrossAddDate, 0) + 1 as totalMonths
    , sum(coalesce(rev.revenueValue, 0)) as revenueValue
from
    fact.revenue rev
        inner join
    dim.customer cus
            on rev.accountKey = cus.accountKey
where
    rev.accountKey in (select distinct accountKey from shortLifeCustomers)
group by
    rev.accountKey
    , rev.revenueType
    , totalMonths

Create statements

create table base.revenue (
		contractKey bigint
		, dataDate date 
		, category text 
		, value numeric
        constraint baseRevenuePrimaryKey primary key (dataDate, contractKey, category)
	);

Insert statements

insert into dim.customer (
    contractKey
    , customerKey
    , customerType
    , activationDate
    , deactivationDate
    , billingLanguage
    , birthDate
    , tenure
)
	select
		, contract_Id as contractKey
		, customer_Id as customerKey
		, cust_type_id as customerType
		, to_date(activation_dt, 'yyyy-mm-dd') as activationDate
		, to_date(deactivation_dt, 'yyyy-mm-dd') as deactivationDate
		, bill_address as billingAddress
		, bill_language as billingLanguage
		, to_date(birth_dt, 'yyyy-mm-dd') as birthDate
		, tenure
	from
		import.customer;

Update statements

update
    base.traffic
set
    dataDate = to_date('2016/11', 'YYYY-MM')
    , updateDate = now()
where
    customerKey in (select distinct customerKey from newCustomers);

Delete statements

delete from
    base.traffic
where
    dataDate = to_date('2013/03', 'YYYY-MM') ;