redditToDatabase.png

Complete Public Reddit Comments Corpus

Publication date: 2015-07

Item Size: 149.6G

(Here is the original Reddit comment announcing this collection of data and what the processes were.)

This is an archive of Reddit comments from October of 2007 until May of 2015 (complete month). This reflects 14 months of work and a lot of API calls. This dataset includes nearly every publicly available Reddit comment. Approximately 350,000 comments out of ~1.65 billion were unavailable due to Reddit API issues.

Q: How are the files structured?

Each file is compressed with bzip2 compression. When uncompressed, each file is a series of JSON blocks delimited by new lines (\n). The name of each file follows the format RC_yyyy-mm.bz2 where yyyy is the year and mm is the month. RC stands for "reddit comments."

Q: What does Reddit use for comment ids?

Comment ids are in base 36. If a comment starts with t1_, simply remove that piece and convert to base 10 to get an integer representation of the comment. Most comments should be in sequential order.

Q: I noticed 1-5 comments are missing on average for each 100 sequential ids -- why?

Those comments were either removed, private or unavailable from the API. 99% of the time, it’s due to the comment being posted in a private subreddit.

Q: I’m doing analysis on scores and need to know when the comments were fetched.

Most of the JSONs should have a “retrieved_on” key that I added to reflect when that particular comment was pulled from the Reddit API. There is also an “archived” key in each JSON block that will tell you if that comment has been archived (meaning that people can no longer vote or reply to that comment).

Q: I have additional questions and would like to contact you. What is your contact information?

No problem. My e-mail is jason@pushshift.io

Addeddate: 2015-07-09 01:23:38

Identifier: 2015_reddit_comments_corpus

Scanner: Internet Archive Python library 0.8.4

Year: 2015

comment
Reviews (3)

65,713 Views

20 Favorites

3 Reviews

DOWNLOAD OPTIONS

1 file

ITEM TILE

1 file

PNG

1 file

TORRENT

101 Files
100 Original

SHOW ALL

IN COLLECTIONS

Unsorted Datasets

The Dataset Collection

Uploaded by Sketch the Cow on July 9, 2015

Internet Archive Audio

Featured

Top

Images

Featured

Top

Software

Featured

Top

Texts

Featured

Top

Video

Featured

Top

Mobile Apps

Browser Extensions

Archive-It Subscription

Save Page Now

Complete Public Reddit Comments Corpus

comment
Reviews (3)

DOWNLOAD OPTIONS

IN COLLECTIONS

SIMILAR ITEMS (based on metadata)

Internet Archive Audio

Featured

Top

Images

Featured

Top

Software

Featured

Top

Texts

Featured

Top

Video

Featured

Top

Mobile Apps

Browser Extensions

Archive-It Subscription

Save Page Now

Complete Public Reddit Comments Corpus

Item Preview

Flag this item for

Complete Public Reddit Comments Corpus

plus-circle Add Review comment Reviews (3)

DOWNLOAD OPTIONS

IN COLLECTIONS

SIMILAR ITEMS (based on metadata)

eye 5,347

favorite 4

comment 0

eye 1,059

favorite 2

comment 0

eye 14,661

favorite 24

comment 2

eye 2.9M

eye 76,749

favorite 426

comment 8

eye 960

favorite 1

comment 0

eye 12,991

eye 60.5M

eye 2,059

favorite 3

comment 0

comment
Reviews (3)