From r/WSB to Numerai Signals

1 day ago·3 min read

Using r/WallStreetBets data for Numerai Signals submisison.

I stumbled upon Arjun Rohlfing-Das ‘s excellent post on Sentiment Analysis for Trading with Reddit Text Data that uses r/wallstreetbets data for sentiment analysis which seems to be holding predictive power.

Just give me the code

WSB to Numerai Signals

Numerai Signals Submission from r/WSB

colab.research.google.com

This notebook is built upon the work of Arjun Rohlfing-Das’s notebook. It predicts for entire market. I have modified it to work with all listed symbols.

This is a ‘Run All’ notebook. Once you have setup the PRAW credentials, all you have to do is, just click Run all from colab and it will grenerate a .csv that you can submit to the tournament.

Since, sentiments are quantified using ML models, what about using this as a feature for Numerai Signal’s tournament. This can be combined with a strategy you are currently using or submit these scores directly as I did.

Workflow

Symbols in the Signal universe
Collect Reddit data using PRAW
Symbol filtering
To the moon? (Sentiment analysis)
Rolling average of daily scores for top 400 extreme symbols
Submit

Symbols in the Signals universe

Latest tickers(Symbols) in the universe can be downloaded using NumerAPI with Bloomberg to Yahoo mapping.

Tickers mapping and universe

Collecting Reddit data using PRAW

You’ll need to setup credentials for PRAW. Check this article on Scrapping Reddit data. “Daily Discussion” data is scrapped with all comments. You can filter the comments by the number of up votes it has as not every comment will be useful.

List of comments

Scrapping using PRAW

Symbol filtering

If you explore r/WSB and look at the symbols, you’ll find some ambiguity in the way stocks are mentioned. Terminology is different so I decided to split the symbols by space. i.e, TSLA US -> TSLA and only consider new symbols with length ≥2 .

Another criteria for filtering symbols is stopwords. You might want to use appropriate stopwords that reddit users use which are also in the symbol list. This gives false impression of stock being discussed.

The symbols with only numbers are also removed because they may create ambiguity.

To the moon?

Score all comments for a day based on sentiment

I have used VADER sentiment analysis model from NLTK. A better model can be used. Or, You can take historical Signals targets and historical comments data and train a simple classification model on that. So, It is trained on reddit data optimized for Signal’s targets.

2. Log all tickers mentioned in those comments

3. Assign the daily sentiment to tickers involved

Sentiment analysis for all stocks

Finding extreme stocks and scoring

Not all of 5k stocks will be discussed there. Here, the stocks having most positive and most negative sentiments across all the days are selected.

A rolling window of 14 days is applied and the scores for last day and will be used for submission.

Scoring

Submission

Since the submission need tickers in the Bloomberg format, we need to re-map the filtered tickers back to original Bloomberg tickers. This may cause a hash collision so I have used the first occurrence of Bloomberg ticker to be used.

Submission

As usual, this is a Run all Colab notebook once you setup PRAW. This will create accept-worthy submissions.

What’s next

Try different model
Better cleaning of comments (use upvotes)
Combine the daily sentiment scores with your current Numerai strategy.
Predict for weeks in the validation data to see diagnostics.

Below is a plot of Fourier transform of market sentiment vs. SPY ticker. This shows, there is some predictive power in the r/WSB.

Image for post — Market sentiment vs. S&P500

From a Numerai participant’s perspective 💡

From the tournament perspective, the main difference between these two is the data. Numerai main tournament provides you obfuscated, clean, and normalized data in a supervised learning manner i.e, features + targets. Signals on the other hand gives only a list of symbols or tickers in the Bloomberg universe and historical targets. That means, we have to collect all the data that can be a good feature for prediction. …

Background

Now, having already submitted your predictions, you might be wondering about improving your models to get better results in the tournament. For a typical machine learning problem you start with Exploratory Data Analysis, then you might want to build a validation pipeline along with a baseline model and then you optimize your model(s). …

Just Give Me The Code:

Make sure you have signed up on numer.ai as you’ll need to set up your API keys to make submissions directly from colab.

💡 The Numerai tournament problem

The Numerai data science problem is like a typical supervised machine learning problem, where the data has several input features and corresponding labels (or targets). And our goal is to learn a mapping from input to targets using various techniques. We usually split data into training and validation parts. …

From r/WSB to Numerai Signals

Suraj Parmar

1 day ago·3 min read

Just give me the code

WSB to Numerai Signals

Numerai Signals Submission from r/WSB

colab.research.google.com

Workflow

Symbols in the Signals universe

Collecting Reddit data using PRAW

Symbol filtering

To the moon?

Finding extreme stocks and scoring

Submission

What’s next

Suraj Parmar

Solving problems, one sense, at a time. #ML , “The best way to learn is to teach.” parmarsuraj99@gmail.com About: https://parmarsuraj99.github.io/suraj-parmar/

22

22

Solving problems, one sense, at a time. #ML , “The best way to learn is to teach.” parmarsuraj99@gmail.com About: https://parmarsuraj99.github.io/suraj-parmar/

Published in Deterministic Algorithms Lab

·Updated Jan 18

Task

Read more in Deterministic Algorithms Lab · 3 min read

12

Nov 22, 2020

A guide to “The hardest data science tournament on the planet”

Data science for fun? for crypto? Why not both?

towardsdatascience.com

Just give me the code

Signals

Getting Started

colab.research.google.com

From a Numerai participant’s perspective 💡

Read more · 5 min read

21

Sep 4, 2020

Don’t just submit and wait, evaluate!

Just Give me the code

Evaluating Financial ML Models on Numerai

More metrics to get you started with evaluation

colab.research.google.com

Model Diagnostics Update

Starting with the coming round, you will receive additional information about your model when you submit. These metrics…

forum.numer.ai

Background

Read more · 6 min read

66

Published in Towards Data Science

·Aug 3, 2020

Data science for fun? for crypto? Why not both? 😀

Just Give Me The Code:

Google Colaboratory

An end-2-end guide to making your first numerai submission

colab.research.google.com

Read more in Towards Data Science · 5 min read

169

Published in Analytics Vidhya

·Sep 25, 2019

Read more in Analytics Vidhya · 6 min read

175

More From Medium