A concerned researcher of Reddit might want to avoid quoting or citing posts likely to be deleted by their authors (even when pseudonymous), especially on sensitive topics. How many authors delete their posts on Reddit and by when are they likely to have done so?

In [2]:
import numpy as np
import pandas as pd
In [3]:
def percent_true(s):  # in pd True=1 False=0, so mean() is percent True
    return round(s.mean() * 100, 1)

r/AmItheAsshole in 2018¶

Let's read in no more than the first 500 posts starting in 2018-April on r/AmItheAsshole. The year 2018 is ancient history on Reddit (no activity) and back then 500 submissions provides almost the whole of the month. (It also includes a few posts that were removed by moderators, which I'll address shortly.)

In [4]:
df18m = pd.read_csv("reddit_20180401-20180630_AmItheAsshole_n__l500.csv")

This and all data is from the Reddit and Pushshift APIs via using my reddit-query.py script. Pushshift ingests submissions within a day and permits advanced queries, including over time periods, with aggregated results. The Reddit API can't do this, but it can then be queried per submission id for the latest state of each submission.

In [5]:
df18m.columns
Out[5]:
Index(['author_r', 'del_author_p', 'del_author_r', 'title', 'id',
       'created_utc', 'elapsed_hours', 'score_p', 'num_comments_p',
       'del_text_p', 'del_text_r', 'rem_text_r', 'url'],
      dtype='object')
In [6]:
df18m.shape
Out[6]:
(500, 13)
In [7]:
percent_true(df18m["del_text_p"]) # "_p" suffix means pushshift data
Out[7]:
6.6
In [8]:
df18m[df18m["del_text_p"] == True]["elapsed_hours"].max()
Out[8]:
21

Within the first 21 hours (Pushshift's longest delay before ingesting a post in this data), 6% of redditors had already deleted their posts.

In [9]:
percent_true(df18m["del_text_r"]) "_r" # means reddit data
Out[9]:
41.0

41% of those posts are deleted on Reddit.

Users can also delete their accounts independently of deleting a submission. How many of the posters have deleted their accounts?

In [10]:
percent_true(df18m["del_author_r"])
Out[10]:
55.8

55% of authors from this period in 2018 have now deleted their accounts.


Are these numbers also true of popular posts? Let's look at no more than 500 posts starting in April with more than 150 comments. (Note, Pushshift records score and num_comments at ingest, but updates num_comments as it ingests arriving comments, so num_comments is a more recent representation of reddit and a better proxy for popularity.)

In [11]:
df18mp = pd.read_csv("reddit_20180401-20180630_AmItheAsshole_n150+_l500.csv")
In [12]:
df18mp.shape
Out[12]:
(18, 13)
In [13]:
percent_true(df18mp["del_text_r"])
Out[13]:
38.9
In [14]:
percent_true(df18mp["del_author_r"])
Out[14]:
44.4

There's only 18 posts that were sufficiently commented on back then: 38% of the posts are deleted as are 44% of the authors' accounts.

r/AmItheAsshole in 2020¶

What of more recent data from 2020?

In [15]:
df20 = pd.read_csv("reddit_20200401-20200630_AmItheAsshole_n__l500.csv")
percent_true(df20["del_text_r"])
Out[15]:
10.0
In [16]:
percent_true(df20["del_author_r"])
Out[16]:
48.6

10% of posts in the 2020 data have been deleted by their users (much lower than 2018) and 48% of the user accounts are deleted (similar to 2018). The likely reason we don't have as many user deleted posts is because of aggressive moderation since 2018 -- including trivial violations such as how to title posts. Let's discard moderated submissions (i.e., removed) from the calculation.

In [17]:
df20m = df20[df20["rem_text_r"] == False]
In [18]:
len(df20m)
Out[18]:
150
In [19]:
percent_true(df20m["del_text_r"])
Out[19]:
33.3

Only 150 posts remain; the other 70% were removed by moderators! But of the remaining messages, 33% were deleted by their users, much closer to the 2018 figure. From now on, I'll use reddit-query.py to exclude moderator-removed posts via the Pushshift parameter ?selftext:not=[removed].

Staying in 2020, are these numbers also true of popular posts? Let's look at no more than 500 posts starting in April with more than 150 comments.

In [20]:
df20p = pd.read_csv("reddit_20200401-20200630_AmItheAsshole_n150+_l500.csv")
In [21]:
df20p.shape
Out[21]:
(500, 13)
In [22]:
percent_true(df20p["del_text_r"])
Out[22]:
13.2

13% of even popular (and not removed by moderators) posts are deleted.

In [23]:
percent_true(df20["del_author_r"])
Out[23]:
48.6

And 48% of the authors have deleted their accounts.

r/Advice and r/relationship_advice in 2020¶

Is the deletion of a third of all moderated submissions on r/AmItheAsshole unusual? What about other advice subreddits, including r/Advice?

In [24]:
df20ad = pd.read_csv("reddit_20200401-20200630_Advice_n__l500.csv")
In [25]:
df20ad.shape
Out[25]:
(500, 13)
In [26]:
percent_true(df20ad["del_text_r"])
Out[26]:
42.4

Yes, it's similar to r/AmItheAsshole's 33% of moderated posts: 42% of r/Advice moderated submissions from this period are now deleted.

What about r/relationship_advice?

In [27]:
df20ra = pd.read_csv(
    "reddit_20200401-20200630_relationship_advice_n__l500.csv"
)
In [28]:
df20ra.shape
Out[28]:
(500, 13)
In [29]:
percent_true(df20ra["del_text_r"])
Out[29]:
50.6

Similarly, 51% of r/relationship_advice submissions from this period are now deleted.

Hence, a third to one half of (moderated) posts are deleted by their authors on these advice subreddits.

When are posts deleted by?¶

If a concerned researcher wanted to avoid quoting or citing posts likely to be deleted by their authors (even when pseudonymous), how long ought they wait before including them in their data? We only have two snapshots: when Pushshift ingested the posts (typically within a day) and what's on Reddit now. A granular report could be achieved with a long-running service that accumulates posts in real time and regularly polls to see if and when they are deleted; this would be non-trivial.

Instead, perhaps we can roughly infer a period by when most posts are deleted by looking at the percentage of posts that are deleted week-by-week. Let's return to r/AmItheAsshole, selecting (moderated) posts from March 01 2020 until August 28 -- when this data was collected and a larger time frame.

As in all the data from reddit-query.py, Pushshift is queried, and then the resulting submission ids are checked at Reddit. Because each Reddit query takes roughly 2 seconds, and there's almost 2,000 posts a day, I sample the Pushshift set to 8,5000 submissions, and then check their deletion status at Reddit.

In [54]:
df20ls = pd.read_csv( # in variable name: 'l' means limited, 's' means sampled
    "reddit_20200301-20200828_AmItheAsshole_n__l8500_sampled.csv",
    parse_dates=["created_utc"],
)
In [31]:
df20ls.shape
Out[31]:
(8500, 13)
In [32]:
percent_true(df20ls["del_text_r"])
Out[32]:
22.5

22% of posts through this six month period are deleted; that's the baseline. Let's look at this on a weekly basis.

https://stackoverflow.com/questions/51650066/iterate-over-pd-df-with-date-column-by-week-python

In [33]:
df20ls["week_idx"] = df20ls["created_utc"].apply(
    lambda x: "%s-%s" % (x.year, "{:02d}".format(x.week))
)
In [34]:
def del_text_r_weekly(week_data):
    return (week_data["del_text_r"].mean() * 100).round(1)
In [35]:
s20ls = df20ls.groupby("week_idx").apply(del_text_r_weekly)
s20ls
Out[35]:
week_idx
2020-09    24.5
2020-10    25.9
2020-11    24.1
2020-12    16.0
2020-13    21.5
2020-14    24.0
2020-15    20.2
2020-16    27.1
2020-17    25.0
2020-18    25.1
2020-19    25.2
2020-20    22.6
2020-21    20.2
2020-22    25.3
2020-23    21.3
2020-24    20.8
2020-25    23.6
2020-26    21.6
2020-27    19.5
2020-28    23.6
2020-29    21.4
2020-30    26.2
2020-31    20.2
2020-32    24.1
2020-33    21.4
2020-34    20.1
2020-35    15.1
dtype: float64
In [36]:
s20ls.describe()
Out[36]:
count    27.000000
mean     22.429630
std       2.937924
min      15.100000
25%      20.500000
50%      22.600000
75%      24.750000
max      27.100000
dtype: float64

The percent deleted varies within the weekly grouping, but what might be a steady baseline of deletion and now long is it before we reach it? We exceed our mean and median of 22% percent deleted three weeks ago (in the 32nd week of 2020). We exceed our third quartile of 24% five weeks ago (in week 30). To avoid reporting on posts likely to be deleted, it seems reasonable to wait five weeks. A few people might still delete posts months or years later, but we'll have avoided reporting on most who do so.


What about popular posts? Here, I not need sample explicitly because limiting myself to submissions with 150+ comments does so naturally.

In [37]:
df20lp = pd.read_csv(
    "reddit_20200301-20200828_AmItheAsshole_n150+_l__.csv",
    parse_dates=["created_utc"],
)
In [38]:
df20lp.shape
Out[38]:
(8674, 14)

Over all the data, what's the percent of posts deleted on Reddit?

In [39]:
percent_true(df20lp["del_text_r"])
Out[39]:
9.5

9% is close to but lower then the 13% we saw earlier for popular posts from April--June, but the current data includes recent posts. Let's look at this data's weekly breakdown.

In [40]:
df20lp["week_idx"] = df20lp["created_utc"].apply(
    lambda x: "%s-%s" % (x.year, "{:02d}".format(x.week))
)
In [41]:
def del_text_r_weekly(week_data):
    return (week_data["del_text_r"].mean() * 100).round(1)
In [42]:
s20lp = df20lp.groupby("week_idx").apply(del_text_r_weekly)
s20lp
Out[42]:
week_idx
2020-09    11.1
2020-10    12.6
2020-11    14.1
2020-12    12.4
2020-13     9.6
2020-14    12.4
2020-15    13.3
2020-16    12.9
2020-17    12.8
2020-18    14.4
2020-19    11.5
2020-20    11.8
2020-21     7.5
2020-22     9.6
2020-23     9.2
2020-24    10.5
2020-25     8.6
2020-26     8.0
2020-27     7.8
2020-28     8.6
2020-29     9.9
2020-30     9.0
2020-31     7.7
2020-32     4.2
2020-33     5.9
2020-34     5.9
2020-35     3.6
dtype: float64
In [43]:
s20lp.describe()
Out[43]:
count    27.000000
mean      9.811111
std       2.922503
min       3.600000
25%       7.900000
50%       9.600000
75%      12.400000
max      14.400000
dtype: float64

Popular posts are less likely to be deleted, and the gradient toward reaching a steady state of deletion is shallower. We exceed our mean (9.8%) and median (9.6%) deleted 7 weeks ago (in week 29 of 2020). We exceed our third quartile 17 weeks ago (in week 18). If we wished to avoid reporting on popular posts likely to be deleted, we would wait four months.

Conclusion¶

It's clear that many redditors delete their accounts and posts on advice subreddits, despite their pseudonymity.

It's less clear if waiting some time before quoting or citing submissions is necessarily more "ethical."

  • The vast majority of redditors use pseudonymous accounts and use additional "throwaways" on advice subreddits -- I've never encountered an obviously real name beyond my own.
  • Deleting posts can be useless. r/AmItheAsshole discourages deletion and threatens to ban users who delete their submissions when being actively discussed (often within 48 hours of its posting). Additionally, the AutoMOD bot often posts a copy of the original submission as a comment.
  • For the dedicated, original posts can often be found on services using Pushshift's data (e.g., removeddit, cedit, camas).

Finally, I don't know why redditors delete their posts and accounts; that is an open question that requires speaking with them directly.

rAmItheAsshole, r/Advice, and r/relationship_advice in 2021¶

At the start of July 2021 I returned to this project with the intention of asking Redditors why they delete. In August, I returned to this notebook to confirm if my findings from last year's analysis of April's data persists for 2021.

AmItheAsshole:

In [44]:
df21am = pd.read_csv("reddit_20210401-20210603_AmItheAsshole_n__l500_.csv")
In [45]:
df21am.shape
Out[45]:
(500, 14)
In [46]:
percent_true(df21am["del_text_r"])
Out[46]:
14.4

14% is less than last year's 33%, though still significant.

In [47]:
df21ad = pd.read_csv("reddit_20210401-20210603_Advice_n__l500_.csv")
In [48]:
# Advice:
In [49]:
df21ad.shape
Out[49]:
(500, 14)
In [50]:
percent_true(df21ad["del_text_r"])
Out[50]:
43.6

44% is similar to last year's 42%.

What about r/relationship_advice?

In [51]:
df21ra = pd.read_csv(
    "reddit_20210401-20210603_relationship_advice_n__l500_.csv"
)
In [52]:
df21ra.shape
Out[52]:
(500, 14)
In [53]:
percent_true(df21ra["del_text_r"])
Out[53]:
50.6

Exactly the same as last year.