400,000 GitHub repositories, 1 billion files, 14 terabytes of code: Spaces or Tabs?

Tabs or spaces. We are going to parse a billion files among 14 programming languages to decide which one is on top.

The rules:

Data source: GitHub files stored in BigQuery.
Stars matter: We’ll only consider the top 400,000 repositories — by number of stars they got on GitHub during the period Jan-May 2016.
No small files: Files need to have at least 10 lines that start with a space or a tab.
No duplicates: Duplicate files only have one vote, regardless of how many repos they live in.
One vote per file: Some files use a mix of spaces or tabs. We’ll count on which side depending on which method they use more.
Top languages: We’ll look into files with the extensions (.java, .h, .js, .c, .php, .html, .cs, .json, .py, .cpp, .xml, .rb, .cc, .go).

Numbers

How-to

I used the already existing [bigquery-public-data:github_repos.sample_files] table, that lists the files of the top 400,000 repositories. From there I extracted all the contents for the files with the top languages extensions:

SELECT a.id id, size, content, binary, copies, sample_repo_name , sample_path
FROM (
  SELECT id, FIRST(path) sample_path, FIRST(repo_name) sample_repo_name 
  FROM [bigquery-public-data:github_repos.sample_files] 
  WHERE REGEXP_EXTRACT(path, r'\.([^\.]*)$') IN ('java','h','js','c','php','html','cs','json','py','cpp','xml','rb','cc','go')
  GROUP BY id
) a
JOIN [bigquery-public-data:github_repos.contents] b
ON a.id=b.id

864.6s elapsed, 1.60 TB processed

That query took a relative long time since it involved joining a 190 million rows table with a 70 million rows one, and over 1.6 terabytes of contents. But don’t worry about having to run it, since I left the result publicly available at [fh-bigquery:github_extracts.contents_top_repos_top_langs].

In the [contents] table we have each unique file represented only once. To see the total number of files and size represented:

SELECT SUM(copies) total_files, SUM(copies*size) total_size
FROM [fh-bigquery:github_extracts.contents_top_repos_top_langs]

Then it was time to run the ranking according to the previously established rules:

SELECT ext, tabs, spaces, countext, LOG((spaces+1)/(tabs+1)) lratio
FROM (
  SELECT REGEXP_EXTRACT(sample_path, r'\.([^\.]*)$') ext, 
         SUM(best='tab') tabs, SUM(best='space') spaces, 
         COUNT(*) countext
  FROM (
    SELECT sample_path, sample_repo_name, IF(SUM(line=' ')>SUM(line='\t'), 'space', 'tab') WITHIN RECORD best,
           COUNT(line) WITHIN RECORD c
    FROM (
      SELECT LEFT(SPLIT(content, '\n'), 1) line, sample_path, sample_repo_name 
      FROM [fh-bigquery:github_extracts.contents_top_repos_top_langs]
      HAVING REGEXP_MATCH(line, r'[ \t]')
    )
    HAVING c>10 # at least 10 lines that start with space or tab
  )
  GROUP BY ext
)
ORDER BY countext DESC
LIMIT 100

16.0s elapsed, 133 GB processed

Analyzing each line of 133 GBs of code in 16 seconds? That’s why I love BigQuery.

More?

Want more stories? Check my medium, follow me on twitter, and subscribe to reddit.com/r/bigquery. And try BigQuery — every month you get a full terabyte of analysis for free.

All the open source code in GitHub now shared within BigQuery: Analyze all the code!
All the open source code in GitHub is now available in BigQuery.medium.com

Analyzing GitHub issues and comments with BigQuery
(Docker, Kubernetes, TensorFlow and other top projects)medium.com