Complete Public Reddit Comments Corpus
Item Preview
Share or Embed This Item
data
Complete Public Reddit Comments Corpus
- Publication date
- 2015-07
- Item Size
- 149.6G
(Here is the original Reddit comment announcing this collection of data and what the processes were.)
This is an archive of Reddit comments from October of 2007 until May of 2015 (complete month). This reflects 14 months of work and a lot of API calls. This dataset includes nearly every publicly available Reddit comment. Approximately 350,000 comments out of ~1.65 billion were unavailable due to Reddit API issues.
Q: How are the files structured?
Each file is compressed with bzip2 compression. When uncompressed, each file is a series of JSON blocks delimited by new lines (\n). The name of each file follows the format RC_yyyy-mm.bz2 where yyyy is the year and mm is the month. RC stands for "reddit comments."
Q: What does Reddit use for comment ids?
Comment ids are in base 36. If a comment starts with t1_, simply remove that piece and convert to base 10 to get an integer representation of the comment. Most comments should be in sequential order.
Q: I noticed 1-5 comments are missing on average for each 100 sequential ids -- why?
Those comments were either removed, private or unavailable from the API. 99% of the time, it’s due to the comment being posted in a private subreddit.
Q: I’m doing analysis on scores and need to know when the comments were fetched.
Most of the JSONs should have a “retrieved_on” key that I added to reflect when that particular comment was pulled from the Reddit API. There is also an “archived” key in each JSON block that will tell you if that comment has been archived (meaning that people can no longer vote or reply to that comment).
Q: I have additional questions and would like to contact you. What is your contact information?
No problem. My e-mail is jason@pushshift.io
This is an archive of Reddit comments from October of 2007 until May of 2015 (complete month). This reflects 14 months of work and a lot of API calls. This dataset includes nearly every publicly available Reddit comment. Approximately 350,000 comments out of ~1.65 billion were unavailable due to Reddit API issues.
Q: How are the files structured?
Each file is compressed with bzip2 compression. When uncompressed, each file is a series of JSON blocks delimited by new lines (\n). The name of each file follows the format RC_yyyy-mm.bz2 where yyyy is the year and mm is the month. RC stands for "reddit comments."
Q: What does Reddit use for comment ids?
Comment ids are in base 36. If a comment starts with t1_, simply remove that piece and convert to base 10 to get an integer representation of the comment. Most comments should be in sequential order.
Q: I noticed 1-5 comments are missing on average for each 100 sequential ids -- why?
Those comments were either removed, private or unavailable from the API. 99% of the time, it’s due to the comment being posted in a private subreddit.
Q: I’m doing analysis on scores and need to know when the comments were fetched.
Most of the JSONs should have a “retrieved_on” key that I added to reflect when that particular comment was pulled from the Reddit API. There is also an “archived” key in each JSON block that will tell you if that comment has been archived (meaning that people can no longer vote or reply to that comment).
Q: I have additional questions and would like to contact you. What is your contact information?
No problem. My e-mail is jason@pushshift.io
- Addeddate
- 2015-07-09 01:23:38
- Identifier
- 2015_reddit_comments_corpus
- Scanner
- Internet Archive Python library 0.8.4
- Year
- 2015
comment
Reviews (3)
65,713 Views
20 Favorites
DOWNLOAD OPTIONS
IN COLLECTIONS
Unsorted DatasetsUploaded by Sketch the Cow on
SIMILAR ITEMS (based on metadata)
I took the Reddit comment archive and converted all the JSON into one SQLite database using this program that I wrote: https://gist.github.com/ers35/3b615a75fa0ed5e6d5cc I ran a few tests to make sure the number of database rows matches the number of JSON records. "SELECT MAX(rowid) FROM comment" and "SELECT COUNT(id) FROM comment" both return 1659361605. This gives me some confidence as to the integrity of the dataset, but I cannot be 100% sure. The compressed size is 163G....
Dataset published and compiled by /u/Stuck_In_the_Matrix , in r/datasets . The dataset is ~1.7 billion JSON objects complete with the comment, score, author, subreddit, position in comment tree and other fields that are available through Reddit's API. I'm currently doing NLP analysis and also putting the entire dataset into a large searchable database using Sphinxsearch (also testing ElasticSearch). This dataset is over 1 terabyte uncompressed, so this would be best for larger research...
Topics: reddit, datasets, comments, bigquery, Stuck_In_the_Matrix
Topics: reddit, datasets, comments, bigquery, Stuck_In_the_Matrix
To Browse the Repository: Click Here This website is a repository for web content that has been deemed "legacy" and has been removed by their original publishers, and might otherwise be difficult or cumbersome to get. Since starting this, end 2018, in response to Mozilla removing all legacy extensions from its add-ons site, with plans to expand to include more, similar "legacy" content, a few things have changed needing me to re-evaluate both the need for this site and my...
( 2 reviews )
( 2 reviews )
collection
eye 2.9M
The Dataset Collection consists of large data archives from both sites and individuals.
i found this at a goodwill on june 1st, 2023 and it has a lot of wieners in it
( 8 reviews )
Topics: nsfw, gay porn, lgbtq, pride
( 8 reviews )
Topics: nsfw, gay porn, lgbtq, pride
14,566,367 Hacker News comments and stories archived by Grey Panther's Hacker News Archiver . See https://hn-archive.appspot.com/ for: more details the source code to subscribe to a newsletter with notifications about the project to contact Grey Panther with other inquiries
Topics: Hacker News, Forum, Comments, Stories
Topics: Hacker News, Forum, Comments, Stories
collection
eye 12,991
Abstract While playing around with the Nmap Scripting Engine (NSE) we discovered an amazing number of open embedded devices on the Internet. Many of them are based on Linux and allow login to standard BusyBox with empty or default credentials. We used these devices to build a distributed port scanner to scan all IPv4 addresses. These scans include service probes for the most common ports, ICMP ping, reverse DNS and SYN scans. We analyzed some of the data to get an estimation of the IP address...
collection
eye 60.5M
Software for MS-DOS machines that represent entertainment and games. The collection includes action, strategy, adventure and other unique genres of game and entertainment software. Through the use of the EM-DOSBOX in-browser emulator, these programs are bootable and playable. Please be aware this browser-based emulation is still in beta - contact Jason Scott , Software Curator, if there are issues or questions. Thanks to eXo for contributions and assistance with this archive. Thank you for your...
All the "journal article" DOIs from CrossRef's OAI-PMH server; URLs of just under 50 million journal articles.
Topics: doi, dataset
Topics: doi, dataset