Google Cloud Platform Blog
Product updates, customer stories, and tips and tricks on Google Cloud Platform
GitHub on BigQuery: Analyze all the open source code
Wednesday, June 29, 2016
Posted by
Felipe Hoffa
, Google Developer Advocate
Google, in collaboration with GitHub, is releasing an incredible new open dataset on
Google BigQuery
. So far you've been able to monitor and analyze GitHub's pulse since 2011 (thanks
GitHub Archive project
!) and today we're adding the perfect complement to this. What could you do if you had access to analyze all the open source software in the world, with just one SQL command?
The
Google BigQuery Public Datasets
program now offers a full snapshot of the content of more than 2.8 million open source GitHub repositories in BigQuery. Thanks to our new collaboration with GitHub, you'll have access to analyze the source code of almost 2 billion files with a simple (or complex) SQL query. This will open the doors to all kinds of new insights and advances that we're just beginning to envision.
For example, let's say you're the author of a popular open source library. Now you'll be able to find every open source project on GitHub that's using it. Even more, you'll be able to guide the future of your project by analyzing how it's being used, and improve your APIs based on what your users are actually doing with it.
On the security side, we've seen how the most popular open source projects benefit from having multiple eyes and hands working on them. This visibility helps projects get hardened and buggy code cleaned up. What if you could search for errors with similar patterns in every other open source project? Would you notify their authors and send them pull requests? Well, now you can.
Some concepts to keep in mind while working with BigQuery and the GitHub contents dataset:
With BigQuery everyone gets
a terabyte every month to run queries
. If you've never tried BigQuery before, follow these
getting started instructions
.
The contents table has all the non-binary files in GitHub that are less than 1MB. It's a huge table, with more than 1.5 terabytes of data! This means the monthly terabyte for BigQuery queries won't last long if you want to query this table. To make your life easier, we've created extracts with only a sample of 10% of all files of the most popular projects, as well as another dataset with all the .go, .rb. .js, .php, .py, and .java code. Use them to make your free quota last!
If these tables are not enough, you can always create your own extracts (but you'll be billed for the respective storage). To do so, you could sign up for $300 in
Google Cloud Platform
credits. These credits could be used to store terabytes (and more) of data in BigQuery.
BigQuery makes it easy to join different datasets. How about ranking coding patterns by the number of stars their projects get? See a related post looking at the
Hacker News effect on a project’s GitHub stars
.
SQL is not enough? Learn how BigQuery allows you to run arbitrary
JavaScript code inside SQL
to enable a full range of possibilities.
To learn more, read
GitHub's announcement
and try some
sample queries
. Share your queries and findings in our
reddit.com/r/bigquery
and
Hacker News
posts. The ideas are endless, and I'll start collecting tips and links to other articles on this
post on Medium
.
Stay curious!
Free Trial
GCP Blogs
Big Data & Machine Learning
Kubernetes
GCP Japan Blog
Labels
Android
1
Announcement
15
Announcement Partners Technical Customers Compute Networking Storage Big Data & Analytics Developers Compute Engine Cloud Storage Cloud SQL Cloud BigTable
1
Announcements
1
api
2
app engine
50
Atmosphere Live
1
Big Data & Analytics
7
bigquery
15
BigTable
2
CDN
1
Cloud Console
2
Cloud Dataflow
5
Cloud Datastore
7
cloud endpoints
1
Cloud Pub/Sub
2
Cloud SDK
1
cloud sql
12
cloud storage
27
Cloudera
1
Compute
5
Compute Engine
56
container cluster
1
Container Engine
1
Container Registry
1
customer
59
Customers
4
Dataflow
4
DataLab
1
Dev Tools
1
developer tools
5
developer-insights
6
Developers
2
Developers Console
2
devfests
4
Disaster Recovery
1
Encryption Keys
1
ESG
1
Event
4
events
11
GA
1
Gaming
1
Go Client
1
Google App Engine
5
Google Apps
1
Google BigQuery
8
Google Cloud Deployment Manager
1
Google Cloud Networking
2
Google Cloud Platform
8
Google Cloud Storage
7
Google Compute Engine
9
Google Container Engine
1
gRPC
1
hadoop
3
Hardware
1
Helium
1
how to
2
IO2013
3
iOS
1
Kubernetes
15
Levyx
1
Local SSD
2
Logging
1
mapreduce
1
Media
3
Mobile
1
Nearline
1
networking
3
open source
98
PaaS Solution
1
Partner
12
Partners
2
Pricing
4
Products
15
Pub/Sub
2
Research
1
round-up
8
Server
1
Siggraph
1
solutions
4
Startup
1
Storage
2
Tableau
1
TCO
1
Technical
23
Windows
1
Wowza
1
Zync
3
Archive
2016
Jun
May
Apr
Mar
Feb
Jan
2015
Dec
Nov
Oct
Sep
Aug
Jul
Jun
May
Apr
Mar
Feb
Jan
2014
Dec
Nov
Oct
Sep
Aug
Jul
Jun
May
Apr
Mar
Feb
Jan
2013
Dec
Nov
Oct
Sep
Aug
Jul
Jun
May
Apr
Mar
Feb
Jan
2012
Dec
Nov
Oct
Sep
Aug
Jul
Jun
May
Apr
Mar
Feb
Jan
2011
Dec
Nov
Oct
Sep
Aug
Jul
Jun
May
Apr
Mar
Feb
Jan
2010
Dec
Oct
Sep
Aug
Jul
Jun
May
Apr
Mar
Feb
Jan
2009
Dec
Nov
Oct
Sep
Aug
Jul
Jun
May
Apr
Mar
Feb
Jan
2008
Dec
Nov
Oct
Sep
Aug
Jul
Jun
May
Apr
Feed
Subscribe by email
Technical questions? Check us out on
Stack Overflow
.
Subscribe to
our monthly newsletter
.
Google
on
Follow @googlecloud
Follow
Follow