Hello folks, this is Hamel from GitHub -- I’m one of the Machine Learning Engineers who worked on this project. The reason we are excited to host this data is that we believe the community will be able to innovate and advance the state of the art much faster if it is provided in a tractable format for machine learning researchers. This data is already public, however there is specialized knowledge required to acquire, parse, dedupe, and clean code from many programming languages at a massive scale on GitHub. We strived to reduce these barriers to encourage greater involvement.
While we present the task of information retrieval as one possible use case of this dataset, we know there could be other practical applications of this data (i.e. code summarization). While we went through great lengths to pre-process the data for the community, the data is still messy and often you will find that there might not be high-quality comments that are aligned with a code snippet we parsed. However, we believe this is part of the excitement of the dataset it poses challenges that machine learning practitioners will have to address.
Code is very different than natural language with regard to its structure and syntactic rules and may benefit from different approaches relative to natural language processing. Our baseline models and benchmarks mostly treat code as natural language, however we are aware that there could be an opportunity to innovate on this front. If anyone creates any interesting projects from this dataset, please do get in touch. Happy to answer any questions!
<rant> GitHub has one of the worse search experience in the modern history of the internet. It doesn't need fancy ML or DL. Just get the stupid dedupping done, for God's sake. A simple TF-IDF similarity metric would make your search 10,000% better. You even know forks, names of files, folders, number of stars. You have no excuse to so badly that a teenager can beat you by simple handcrafted rules. First learn and do the basics and then come here to talk about your need for ML. </rant>
PS: Sorry, I had to get this out. I've spent better part of my productivity in past two years shifting through page number 50 and 100 of undedupped results that GitHub mercylessly throws at you. No one needs to suffer like this.
Seriously, how dare someone at GitHub release a large dataset that might be interesting while their website’s basic search functionality is still terrible!? Why isn’t everyone at GitHub working to solve just this problem??
C’mon, cut these folks some slack — GitHub is huge, and one annoying adjacent feature doesn’t give you the right to attack these folks like this.
This is not an "attack", is just straight forward feedback, and really one I agree with, a company that has received more than 100 millions on investment shouldn't have such a bad search feature
How would you make sure that any derived works (resulting from the use of this dataset) are properly licensed? It is very likely that this dataset (as per 1909.09436.pdf: "2 million functions, obtained from mechanically scraping and preprocessing associated function documentation") is contaminated by code with dubious licenses, right? And any derived work as a result would also be contaminated, right? What's the plan to deal with this issue?
[I'm one of the Microsoft Research people who worked on this]
We used the repo-level license information from github to filter to free non-copyleft licenses, and the license files are actually stored next to the extracted data corpus (see https://github.com/github/CodeSearchNet#Licenses).
[I'm one of the Microsoft Research people who worked on this]
All code is from public repositories. We only used publicly available APIs to obtain the data, so that others can reproduce the results / build on top of the data extraction pipeline we built: https://github.com/github/CodeSearchNet/tree/master/function...
We only used code from repos that github has marked as using a non-copyleft open source license (i.e., we had an explicit license whitelist and used only repos matching that).
The unit of observation for this data are functions or methods and their associated docstrings or top-level comments. So there are no code reviews in the data. However, we do include metadata including the SHA and owner/repo which would allow you to retrieve this information! What were you thinking of doing with code reviews?
Shawn from Weights & Biases here. We've been working with Github and Microsoft Research on this for just about a year now and we're super excited to launch it today.
We've seen huge advances in human language modeling and translation due to the success of deep learning. Often new directions start with a really motivated team producing a new kind of dataset. Who better to do that for code as language than Github!
This started as a grassroots effort inside of Github, and went through many iterations. When it was presented to Github's CEO six months ago, he correctly pointed out that we needed to go back and include Github's most popular language (javascript). As the project went on many smart people chipped in, and we produced something that we think is truly useful.
We overcame plenty of challenges to pull this off. For example: how do you clean this data? how do you label it? We've got folks from Github, Microsoft Research and Weights & Biases here to answer any and all questions you might have. Can't wait to see where this goes!
> We've seen huge advances in human language modeling and translation due to the success of deep learning.
I wonder if we’ll eventually see a system where instead of writing code you describe in natural language what you want the program to do and then ML is applied in order to generate the code for that.
I mean, a lot of people have been interested in the past in making human programming languages, and had varying degrees of success.
Personally I love writing code but, it could be, couldn’t it?
Write some unit tests, a human description of what it does and based on the source code and description of existing software the system would basically “debug the program into existence” for you.
That’d be kind of freaky, kind of cool and a little bit scary.
Find a human who wants a computer program written (A). Find a human who can program computers (B). Have A describe what they want and have B code it without asking questions to further clarify the issues. What do you expect the result to be like? For me, my experience tells me that it will be a total failure.
The problem with programming is not the encoding of the requirements in programming language for the most part. The problem is that the specifier (A in this example) usually does not have a full grasp of what they actually want. In fact, they usually don't have any idea at all. "Give me a e-shopping system to sell comic books" is the level of detail they can understand.
The closer A can come to expressing the requirements they need, the closer they are to actually being B in reality. B's real skill is not in knowing the syntax and grammar of the computer language, it's in knowing that in order to make a system that will satisfy A we need to do X, Y, and Z to the tiniest detail.
When we get into trouble with our software is when we write code that is dramatically more complex than the problem we are trying to represent. This doesn't happen so much because we don't know how to program. This happens because we are slowly extending the code base over time with imperfect knowledge about what we are ultimately building at any given time. We also have to trade-off the benefit for getting something done with discovering generalities that allow us to simplify the expression of code that we already have.
I don't think we will ever replace "programmers" with AI -- at least not until the AI can be trained to ask the important questions about what the system really needs to be (and for that we need turing-test passing level AI). I think it's much more likely that we will build more and better tools that help programmers visualise and plan the programming situation. I think we will have automatic code generation because we already have it: Look at "derive" in Haskell and Rust, for example. But I think that's the level of automatic code generation we're going to want for at leas the next 20 years or so.
Interestingly for testing, I think we'll actually go the opposite direction: We will spend more time thinking about requirements and the computer will help us by writing tests that challenge our assumptions: I've broken this function, are you sure you got it right? Again, we already have these kinds of systems and I think that this is the most appropriate direction to invest in research.
> Write some unit tests, a human description of what it does and based on the source code and description of existing software the system would basically “debug the program into existence” for you.
Sounds more or less like the mechanism by which developer jobs succumb to automation. Hopefully those of us that are working class have seized the capital by then.
Great idea, I would love to see this. Tools for programming are going make some interesting leaps with all the new work going into language modeling. So much of the code we write is tweaks and combinations of a not particularly large set of patterns: loops, functions, merge, reduce, sort, filter, interleave, etc... Generating large blocks that are near your target result would be really useful for saving time, especially on the more repetitive tasks like writing tests or simple CRUD API endpoints.
Microsoft was showing off similar work from their own code datasets this year at ICLR, I couldn't find a link online, but the demos had block suggestions from method signatures for C#. It should be possible to get similar results with natural language queries.
Yes that would be really cool. The field of program synthesis has made strides in this area but it doesn't appear you can create anything more than trivial programs from human languages at the moment. I think that you are more likely to see technology that augments the human significantly -- for example better code completion, error detection etc. that allows you to work much faster. It will be exciting to see how machine learning shapes developer tools and workflows in the future.
[I'm one of the Microsoft Research people who worked on this]
There wasn't a technical reason (unless you count lazyness as a technical reason) -- we simply had infrastructure for Python specifically lying around from past research projects, which we initially reused.
After we got Nat's feedback, we redid our data processing pipeline completely to be based TreeSitter (which wasn't around when we started thinking about parsing Python), which makes it much easier to scale to the number of programming languages on GitHub.
I haven't looked at this in detail, but where I could see this technology going is by doing a kind of lint. We've got hand built linters for most languages and some of them are really good (Clippy in Rust is amazingly good -- it practically writes my code sometimes). However, a tool that analysed my code, picked out things that were idiomatically strange and then suggested example code that might be better would be quite useful, I think.
[I'm one of the Microsoft Research people who worked on this]
There are many interesting ideas that you could build on top of this kind of data, and we only scratched the surface so far.
For example, the simple "search" technique we are using as a baseline is based on the idea of joint embeddings: We learn functions f_query and f_js/f_python/... to map from the inputs into some vector space such that, for example, for a (python method, docstring) pair, f_query(docstring) is near to f_python(method). To search given a query, we just do f_query(query) and look for nearest neighbours in all the code we indexed before.
Now, we could also just do f_python(def bubblesort(...): ...) and look for the nearest neighbour that is in C#, and should get out a C# implementation of bubblesort. Similarly, we could apply all kinds of filters on the results (code from highly-starred repos, code that uses framework X, ...) to do more interesting thing.
Great effort in putting together this large corpus! While reading through your paper, I noticed the difficulties you faced in correctly annotating the code for their quality, correctness and hiring annotators for different languages. I can imagine how herculean this task could be.
I was wondering if you thought to include stackoverflow questions and answers, which have been vetted by thousands of programmers over a long period of time. Stackoverflow might even want to participate in this effort to provide a clean ground truth for this great project.
Code challenges like this are interesting - thank you for putting this together!
I do have a question about the set-up, if that's alright. Netflix and others have found that shared tasks can lead to great models, but not necessarily ones that are suited for use in a production environment. Have you put much thought into how best to set up a challenge such at this to make the obvious "ensemble everything" solution be less worthwhile?
Similarly, have you put much thought into how to encourage the sharing of information between participants?
1. We could log additional information about the model, such as inference time, number of parameters, memory usage, etc. and have the primary metric be overall efficiency (best NDCG with fewest parameters/fastest runtime/etc).
2. We're experimenting with different kinds of benchmarks, and I am most excited about explicitly collaborative ones. In these there is no contest/prize (hence no incentive to cheat/withhold information); only the shared goal of improving the model and our collective understanding of the problem. I hope we can incentivize information sharing by tracking and acknowledging individual contributions to the eventual best model in the benchmark. We could approximate individual contribution by seeing which scripts, code segments, workflows, architectural changes, writeups, or discussion comments other participants rate as the most helpful or choose to include in their experiments most often as the benchmark evolves. Of course this could only be an estimate--as Shawn says above, any idea could have "actually happened in a hallway conversation". Still, this is much easier to achieve in a logging/visualization platform like W&B than in the current paradigm of "read research papers, clone relevant repos, spend weeks trying to synthesize/reproduce their results, run your own experiments, write them up in a research paper, hope it gets accepted to a conference before other people publish the same idea, try to integrate your changes/publish your own repo, repeat"--and for hundreds of practitioners, ranging from brand new students to PhDs, working on related problems. This cycle is especially challenging for folks who are new to, working outside of, or trying to collaborate across the relatively few established/well-funded academic/industrial teams.
Collaborative benchmarks can be especially impactful for social good projects, where the primary incentive is to figure out and broadly implement the best solution ASAP (e.g. climate change!), not to make money or argue over the credit attribution. So, my long-term goal is for as much sharing of information and collaboration from as many folks as possible--the more inclusive and transparent the field of deep learning research becomes, the safer and better its outcomes. Very open to ideas on how to help make this happen.
What is exciting about the tools we provided in this competition -- especially the weights and biases leaderboard is the level of transparency you get that you don't always see in a Kaggle competition (unless it is shared voluntarily). You can see:
- All the system logging information including CPU/GPU utilization, with the runtime and type of GPU card used - Extensive logging of model training progression - All of the model artifacts and metadata - A link to the code on GitHub with the code that ran that data. - Anything emitted to stdout (for logging) - etc.
This allows for extreme reproducibility and insight that is very helpful. With these tools, the community can see if an "ensemble everything: method is used and how long the model takes to train and what resources are consumed, etc.
There's been a lot of thought at Weights & Biases about the tradeoff between competition and collaboration. Competition certainly fosters activity, but it encourages behaviors like ensembling everything to eek out a few more hundredths of a percent. This benchmark isn't incentivized in any way other than "let's drive the field forward", so we may see less of that behavior.
We've considered benchmarks that proceed in phases: a closed competitive phase for 3 months, then award a prize to the top result, and another prize for best user writeup. Follow that by a collaborative phase where it's more about sharing, teamwork etc. Rinse and repeat.
The question of attribution is really interesting. Who made the largest contribution to the development of a model? It could have actually happened in a hallway conversation, or something equally as untrackable. We'd love to hear other peoples' thoughts on this.
Stacey on our team has put a lot of thought into these topics and may have more to say here!
The Netflix challenge was quite a while ago now. Since then Kaggle has added things like Kaggle kernels, where the models are trained on data they haven’t seen before (not just evaluated).
Resources and training time are also kept even between submissions.
Since the CodeSearchNet Corpus contains metadata such as owner/repository, it would be nice to create a search tool for the data set itself. That way you could check if, by chance, some of your open source code is part of the corpus.
The data set is apparently ~20GB [0], so a cheap VPS instance might do the job of hosting the data in a searchable format.
A phrase is never enough to describe what the task at hand is. It may work for simpler use cases like the results on stack overflow. Otherwise I do not see it doing better than a google search which leads me to stack overflow.
You're right. We'll need bigger datasets and more complex models to achieve more useful results, but machine learning models can give predictions in real time so you could have this running as you type in your IDE, instead of visiting a website, also with a model fine tuned to your code base. "simpler use cases like the results on stack overflow" could explain at least 10% of the work I do, so even just that tool would be very useful for mediocre programmers like myself.
Behind all the hype, predictive text is something machine learning models are beginning to do very well. G-mail has rolled out a lot of similar features from advancements in deep learning models.
It might be useful for scoped search and code navigation when you don't have a general-purpose question that is amenable to Stack Overflow. Let's say you are trying to find code in a repository that carries out a task but your keyword search turns up empty -- semantic search might be able to help you in that kind of situation.
> Our fully preprocessed CodeSearchNet Corpus is available for download on Amazon S3
I am surprised that Github went with S3 for this download. Isn't there a Azure equivalent of S3 for large object storage ? This just shows the dominance of AWS.
It’s a good thing that MS don’t force every team to always choose MS product, especially for a trivial thing like a download storage. Maybe the team was more familiar with AWS, who knows. I’m just glad that they can make this decision.
> Searching for code to reuse, call into, or to see how others handle a problem is one of the most common tasks in a software developer’s day
I don't disagree at all that this is how we code these days... but I distinctly remember a time when this wasn't so. We had to do everything ourselves. We engineered our solutions based on various requirements and constraints, and most importantly, we had to figure it out ourselves. The only external help we had was with the APIs we used... and they had to be studied to be understood.
Even in recent times, the most fun I've had programming has been when it's all come from my wee little head, rather than trawling for solutions and shoehorning/ reworking something similar.
Only mildly related, but I had a recent experience that made me fond of Java's way of doing things that your comment reminded me of.
I had to use a tool for work (name withhold to protect the guilty) that has awful documentation. The tool allows you to write snippets of your own code, but provides no IDE and no documentation (AFAIK) of anything but the most trivial aspects of the API.
I started using Python, but with no debugger and no interactive shell there was no way I was going to guess the names of the functions I needed. Lucky for me, someone uploaded the Javadoc of an older version of the API, and that was the missing piece of my puzzle: having the function names, the return types, and Java's stack traces, I now had all I needed.
Back to the topic: like you, I sometimes wonder if there's a downside to not having to scroll through hundreds of manual pages anymore. But until someone shows some kind of evidence of something being lost, I won't worry too much.
That said, I definitely wish more companies would make their documentation available offline, if only as a static version of the online version. For those of us who regularly program in trains and planes, offline docs are a lifesaver.
While we present the task of information retrieval as one possible use case of this dataset, we know there could be other practical applications of this data (i.e. code summarization). While we went through great lengths to pre-process the data for the community, the data is still messy and often you will find that there might not be high-quality comments that are aligned with a code snippet we parsed. However, we believe this is part of the excitement of the dataset it poses challenges that machine learning practitioners will have to address.
Code is very different than natural language with regard to its structure and syntactic rules and may benefit from different approaches relative to natural language processing. Our baseline models and benchmarks mostly treat code as natural language, however we are aware that there could be an opportunity to innovate on this front. If anyone creates any interesting projects from this dataset, please do get in touch. Happy to answer any questions!
reply