Skip to main content
Menu
Dissecting Trump’s Most Rabid Online Following

Editor’s note: The story below contains two slurs that appear in the names of subreddits. Links to Reddit may also contain offensive material.

President Donald Trump’s administration, in its turbulent first months, has drawn fire from both the left and the right, including the ACLU, government ethics accountability groups and former Bush administration officials. But one group has shown nothing but unbridled enthusiasm for the president’s actions thus far: the over 380,000 members of r/The_Donald, one of the thousands of comment boards on Reddit, the fifth-most-popular website in the U.S.

The subreddit, where posters refer to President Trump as the “God Emperor” and “daddy,” is arguably the epicenter of Trump fervor on the internet. Its membership has grown steadily since the 2016 presidential election, though its members were especially active during the campaign. They mobilized to comb through the hacked Democratic National Committee emails published on WikiLeaks and played a large role in spreading information and theories about those emails. More broadly, they waged the “Great Meme War”: an effort to get Trump elected by bombarding the internet with social-media-ready content promoting Trump or bashing Democratic candidate Hillary Clinton. Some of those memes played on Clinton’s campaign gaffes, such as her use of the phrase “basket of deplorables,” while others involved an emerging pro-Trump iconography centered around images of Pepe the Frog — a cartoon character with a convoluted history that gained especial prominence after it was co-opted by white nationalists as a sort of unofficial mascot. Members of r/The_Donald like to say they “shitposted” Donald Trump into office; regardless of whether the flood of memes swung the election, it did overwhelm the front page of Reddit to such an extent that the site’s CEO rushed to deploy a change in Reddit’s algorithm that limits the influence of any single subreddit.1















What can we say about the animating force behind r/The_Donald? For one, it’s not universal among Trump supporters; nearly 63 million Americans voted for Trump, and the 382,000 members of r/The_Donald represent less than 1 percent of that. But in the subreddit’s vocal and dedicated membership, you can find an influential strain of Trump boosterism. According to former staffers, the Trump campaign team monitored the subreddit for messages that resonated, and Trump himself participated in an “Ask Me Anything” on r/The_Donald in July. Since the election, the subreddit has continued to serve as a conduit through which fringe conspiracy theories — often started on sites like 4chan.org, a freewheeling image-based message board best known for creating memes, posting stolen celebrity nudes and birthing the hacker collective Anonymous — enter a larger online discourse. The most striking example has been “Pizzagate,” the false idea that a pizza parlor in Washington, D.C., is the center of a child-trafficking ring involving Clinton campaign manager John Podesta, which prompted a man from North Carolina to “self-investigate” the shop, where he fired a rifle several times and threatened an employee.

r/The_Donald has repeatedly been accused of offering a safe harbor where racists and white nationalists can congregate and express their views, much the same way that Trump’s campaign is said to have mobilized and emboldened those same groups. And indeed, r/The_Donald is home to some pretty vile comment threads. The subreddit’s moderators declined to talk to us about their community and accused FiveThirtyEight of being “fake news.” Regardless, we think there’s a way to get at the nature of r/The_Donald that is more rigorous than doing a quick scan of its comments (and certainly more objective than simply soliciting the opinions of the group’s fans and detractors).















We’ve adapted a technique that’s used in machine learning research — called latent semantic analysis — to characterize 50,323 active subreddits2 based on 1.4 billion comments posted from Jan. 1, 2015, to Dec. 31, 2016, in a way that allows us to quantify how similar in essence one subreddit is to another. At its heart, the analysis is based on commenter overlap: Two subreddits are deemed more similar if many commenters have posted often to both. This also makes it possible to do what we call “subreddit algebra”: adding one subreddit to another and seeing if the result resembles some third subreddit, or subtracting out a component of one subreddit’s character and seeing what’s left. (There’s a detailed explanation of how this analysis works at the bottom of the article).

Here’s a simple example: Using our technique, you can add the primary subreddit for talking about the NBA (r/nba) to the main subreddit for the state of Minnesota (r/minnesota) and the closest result is r/timberwolves, the subreddit dedicated to Minnesota’s pro basketball team. Similarly, you can take r/nba and subtract r/sports, and the result is r/Sneakers, a subreddit dedicated to the sneaker culture that is a prominent non-sport component of NBA fandom.

This may all seem pretty abstract, but that same algebra can be applied to r/The_Donald. What happens when you break r/The_Donald up into subgroups using subreddit subtraction? What happens when you add unrelated subreddits to r/The_Donald? Before we get into those questions, let’s take a look at the subreddits that are most similar to r/The_Donald, according to our analysis3:

r/Conservative and r/AskTrumpSupporters top the list, followed by r/HillaryForPrison, a subreddit that refers to Hillary Clinton by the pronoun “it” and notes in bold on the sidebar that “Putting It behind bars is fun!” After that it’s r/uncensorednews, a subreddit started by white nationalist moderators who found the existing, extremely popular r/news subreddit to be too liberal.

So does this mean that users who comment on r/The_Donald comment on r/Conservative more than any other subreddit? No. Eight percent of r/The_Donald’s users have also commented on r/Conservative, which is about one-fifth the size of r/The_Donald, and conversely, 51 percent of commenters on r/Conservative have commented on r/The_Donald. But the raw number of shared commenters isn’t very informative on its own because, for example, almost every subreddit will have a lot of overlap with big, really popular subreddits such as r/AskReddit, which has over 16 million members. Our analysis is a bit more subtle: We weight the overlaps in commenters according to, in essence, how surprising those overlaps are — that is, how much more two subreddits’ user bases overlap than we would expect them to based on chance alone. Since essentially every subreddit overlaps heavily with super popular groups like r/AskReddit, that result is no longer surprising and gets a lower weight. What rises to the top, then, are the more unlikely results that are characteristic of a specific subreddit rather than those that are common to Reddit as a whole. And by looking at these weighted commenter overlap rankings across thousands of subreddits, we built a profile for each subreddit that helps capture what defines the average commenter on each specific subreddit.

There’s nothing too revealing in that list above — all of those subreddits are explicitly pro-Trump, anti-Clinton or politically conservative. So let’s use subreddit algebra to dissect r/The_Donald into its constituent parts. What happens when you filter out commenters’ general interest in politics? To figure that out, we can subtract r/politics from r/The_Donald. The result most closely matches r/fatpeoplehate, a now-banned subreddit that was dedicated to ridiculing and bullying overweight people.

r/The_Donald r/politics =

1.r/fatpeoplehate0.275For sharing insults aimed at overweight people (now banned)

2.r/TheRedPill0.274Virulently misogynistic subreddit, nominally devoted to “sexual strategy”

3.r/Mr_Trump0.266Now-dormant subreddit formed during a moderator schism at r/The_Donald

4.r/coontown0.266Open and enthusiastic racism against black people (now banned)

5.r/4chan0.253Screenshots of 4chan.org posts

Subreddit algebra isn’t quite as simple as A – B = C. It’s more like A – B is closer to C than anything else, but it’s also pretty similar to D and not far off from E. So when you subtract r/politics from r/The_Donald, you actually get a list of every subreddit in our analysis, ranked in order of their similarity to the result of that subtraction. We’re showing just the top five.

And that top five isn’t exactly pretty, though it does support the theory that at least a subset of Trump’s supporters are motivated by racism. The presence of r/fatpeoplehate at the top of the list echoes some of President Trump’s own behavior, including his referring to 1996 Miss Universe winner Alicia Machado as “Miss Piggy” and insulting Rosie O’Donnell about her weight. The second-closest result, r/TheRedPill, describes itself in its sidebar as a place for “discussion of sexual strategy in a culture increasingly lacking a positive identity for men”; named after a scene from the “The Matrix,” the group believes that women run the world and men are an oppressed class, and from that belief springs an ideology that has been described as “the heart of modern misogyny.” r/Mr_Trump self-describes as “the #1 Alt-Right, most uncucked subreddit” — referring to a populist white-nationalist movement and an increasingly all-purpose insult meant to denigrate others’ masculinity — and the appallingly named r/coontown is the now-banned but previously central home to unrepentant racism on Reddit. Finally, coming in at No. 5 is r/4chan, a subreddit dedicated to posting screenshots of threads found on 4chan, where many users supported Trump for president and where the /pol/ board in particular has a strongly racist bent.

We dissected r/The_Donald in a bunch of other ways using subreddit algebra. Here are some of the more interesting results:

r/The_Donald r/conspiracy =

1.r/CFB0.269For college football discussion

2.r/nfl0.255For NFL discussion

3.r/TrumpMinnesota0.244Small subreddit for Trump supporters in Minnesota

r/The_Donald + r/europe =

1.r/european0.781Now-private subreddit that hosted racist and anti-Semitic commentary on European affairs

2.r/worldnews0.768Main subreddit for discussion of world affairs

3.r/syriancivilwar0.688For discussion of the conflict in Syria

r/The_Donald + r/Games =

1.r/KotakuInAction0.676Main hub of Gamergate discussion on Reddit

2.r/gaming0.619Largest general gaming subreddit

3.r/Cynicalbrit0.586Unofficial fanpage for the internet personality TotalBiscuit

So even adding innocuous subreddits, such as r/europe and r/Games, to r/The_Donald can result in something ugly or hate-based — r/european frequently hosts anti-Semitism and racism, while r/KotakuInAction is Reddit’s main home for the misogynistic Gamergate movement. Which raises a question: Are these hateful communities linked specifically to Trump’s supporters on Reddit, or are they common to politically active Reddit users in general? To get at that question, let’s try subtracting r/politics from r/conservative:

r/Conservative r/politics =

1.r/Mary0.265Subreddit for devotees of the biblical Mary

2.r/RCIA0.264For those considering converting to Catholicism (RCIA means “rite of Christian initiation for adults”)

3.r/ak470.241For discussing the AK-47 rifle

4.r/TelaIgne0.240A space where Catholic redditors pray for other redditors (the name is Latin for “web on fire”)

5.r/ChristianJewishRoots0.240For discussion of the relationship between Christian and Jewish theology

When we do this, we find that the top result is a subreddit dedicated to the glorification of a biblical Mary, and the other related subreddits are similarly focused on Christianity, except for r/ak47, which is dedicated to the famous rifle.

So what about the other 2016 presidential candidates? How does Trump’s Reddit following compare to that of Hillary Clinton or Democratic primary candidate Bernie Sanders (whose r/SandersForPresident subreddit still has over 215,000 members)? This analysis lets us take any subreddit and say how “Trump-ish” it is vs. how “Clinton-ish” or “Sanders-ish” it is. Here’s a selection of subreddits plotted on a three-way spectrum from r/The_Donald to r/SandersForPresident to r/hillaryclinton.






















Subreddits dedicated to politics and news are smack in the middle. r/Feminism is on the Sanders/Clinton side of the spectrum, though slightly closer to Clinton, as is r/TheBluePill, a feminist parody of r/TheRedPill; r/BasicIncome (a subreddit advocating for a universal basic income) is also on the liberal side, though slightly closer to Sanders.

And all of those hate-based subreddits? They’re decidedly in r/The_Donald’s corner.

How does this work?

Latent semantic analysis (LSA) — the technique from natural language processing that we’ve adapted for this analysis — is often used to determine how related one book, article or speech is to another. The basic idea is that documents using similar words with similar frequency are probably closely related. But what about the words themselves? LSA also allows you to assess how similar words are by looking at the other words that show up around them. So, for example, two words that might rarely show up together (say “dog” and “cat”) but often have the same words nearby (such as “pet” and “vet”) are deemed closely related. The way this works is that every word in, say, a book is assigned a value based on its co-occurrence with every other word in that book, and the result is a set of vectors — one for each word — that can be compared numerically. On a very technical level, the way you determine how similar two words like “dog” and “cat” are is by looking at the angle between their two vectors (there’s a visual guide to understanding these concepts below).

Vectors are interesting because they can be enormous, multidimensional things that contain a huge amount of information — but you can still use them to do grade-school arithmetic. When machine-learning researchers at Google tried adding word vectors together or subtracting one from another, they discovered semantically meaningful relationships.4 For example, if you take the vector for “king,” subtract the vector for “man” and add the vector for “woman,” the closest result is the vector for “queen.” Slightly more subtle relationships were also exposed: e.g. “Rome” plus “Germany” equals “Berlin.” It turned out to be a very powerful way of analyzing language. Here, we are also using co-occurrence to try to uncover the nature of different subreddits and their relationships to one another.

The idea of co-occurrence is clear when we’re talking about words, but what does it mean for subreddits? We found relationships by looking at how many commenters various subreddits have in common — that’s our measure of co-occurrence. Here’s a simplified example of how this works:

Let’s say we want to see how subreddits in the world of health and exercise are related to one another. To do that, we can plot every subreddit in terms of two key subreddits — r/nutrition and r/Outdoors

Let’s start with r/running. That subreddit has, let’s say, one commenter who has also commented in r/nutrition and three who have also commented in r/Outdoors. So we give it a vector of [1,3]

Now let’s add two more subreddits: r/weightlifting and r/Fitness. r/weightlifting has three commenters in common with r/nutrition and one with r/Outdoors, and r/Fitness has four and three, respectively.

Now we can do some addition by combining the vectors. If we add r/weightlifting to r/running, we get a third vector that looks similar to r/Fitness. The angle between the two gives us a measure of just how similar.

So instead of (King – Man) + Woman = Queen, you get Running + Weightlifting = Fitness.

For over 50,000 subreddits that span a huge range of topics, it gets a bit more complicated. Instead of characterizing all of them in terms of just two subreddits — like r/Outdoors and r/nutrition above — we ranked all of the subreddits by the number of unique commenters and then pulled out the 2,133 subreddits whose unique commenter rank was between 200 and 2,201 (there are some ties). We used this subset of subreddits to characterize all active subreddits.5 We then combined all the resulting subreddit vectors into a big matrix with 50,323 rows and 2,133 columns and converted the raw co-occurrences to positive pointwise mutual information values.6 Similarity between subreddits is based on the cosine similarity of their vectors — a measure of the angle between them. To perform subreddit algebra, subreddit vectors are added and subtracted using standard linear algebra, and then the cosine similarities are calculated to rank subreddits by their similarity to the combination.

Are we sure this is meaningful?

To test our analysis, we looked at some cases of subreddit algebra where the results should be obvious — like the example above where adding r/nba to r/minnesota should (and does) yield r/timberwolves as the best fit. Other combinations of a sport and a location similarly result in location-specific discussions of that sport.

We also looked at a test case involving a harder-to-see relationship. If you take the subreddit for managing money and investing, r/personalfinance, and subtract the subreddit for frugality, r/Frugal, the resulting most similar subreddit is r/wallstreetbets, a subreddit about taking extreme risks in the stock market.

The data and code behind this analysis

The Reddit comments data is from a collection hosted on Google’s BigQuery of 1.4 billion comments from January 2015 to December 2016.7 The analysis itself was done in R. You can find the code here.

Development by Justin McCraw

Footnotes

  1. The CEO of Reddit also became embroiled in controversy after “trolling the trolls” by taking negative comments posted about himself and swapping his username out for the usernames of r/The_Donald’s moderators.
  2. We define an active subreddit as one where at least one non-bot user has commented at least 10 times.
  3. As determined by their similarity scores, which range from 0 (totally dissimilar) to 1 (exactly the same). The scores are a measure of how close together subreddit vectors are in vector space, which is calculated by measuring the angle between them (the cosine similarity). Higher similarity scores mean vectors are closer together and therefore more similar. For example, the similarity score between r/gaming and r/Games, two very similar subreddits, is 0.79.
  4. Originally these word vectors were generated using a recently developed neural-network-based context model called word2vec (also see algorithms like GloVe), but research has shown that even simple co-occurrence models also encode semantic relationships.
  5. We could have performed this characterization using all 50,323 subreddits, but in order to save time and storage space, we excluded the largest and smallest subreddits as they likely provide the least amount of relevant information.
  6. The subreddit vectors are a unique fingerprint of commenter co-occurrence across thousands of subreddits. Also, each subreddit vector is normalized to have a length of one because we are most interested in their directions, not their lengths.
  7. Thanks to Reddit users /u/stuck_in_the_matrix (https://pushshift.io/) and /u/fhoffa (https://www.reddit.comr/bigquery/) for, respectively, downloading the data and uploading it for public use.

Trevor Martin is a Ph.D. candidate at Stanford University, where his research straddles the fields of statistics and genetics. Outside of the lab, he works on using data to better understand our world.

Comments