From r/WSB to Numerai Signals

Using r/WallStreetBets data for Numerai Signals submisison.

I stumbled upon Arjun Rohlfing-Das ‘s excellent post on Sentiment Analysis for Trading with Reddit Text Data that uses r/wallstreetbets data for sentiment analysis which seems to be holding predictive power.

Just give me the code

This notebook is built upon the work of Arjun Rohlfing-Das’s notebook. It predicts for entire market. I have modified it to work with all listed symbols.

This is a ‘Run All’ notebook. Once you have setup the PRAW credentials, all you have to do is, just click Run all from colab and it will grenerate a .csv that you can submit to the tournament.

Since, sentiments are quantified using ML models, what about using this as a feature for Numerai Signal’s tournament. This can be combined with a strategy you are currently using or submit these scores directly as I did.

Workflow

  1. Symbols in the Signal universe
  2. Collect Reddit data using PRAW
  3. Symbol filtering
  4. To the moon? (Sentiment analysis)
  5. Rolling average of daily scores for top 400 extreme symbols
  6. Submit

Symbols in the Signals universe

Latest tickers(Symbols) in the universe can be downloaded using NumerAPI with Bloomberg to Yahoo mapping.

Tickers mapping and universe

Collecting Reddit data using PRAW

You’ll need to setup credentials for PRAW. Check this article on Scrapping Reddit data. “Daily Discussion” data is scrapped with all comments. You can filter the comments by the number of up votes it has as not every comment will be useful.

List of comments
Scrapping using PRAW

Symbol filtering

If you explore r/WSB and look at the symbols, you’ll find some ambiguity in the way stocks are mentioned. Terminology is different so I decided to split the symbols by space. i.e, TSLA US -> TSLA and only consider new symbols with length ≥2 .

Another criteria for filtering symbols is stopwords. You might want to use appropriate stopwords that reddit users use which are also in the symbol list. This gives false impression of stock being discussed.

The symbols with only numbers are also removed because they may create ambiguity.

To the moon?

  1. Score all comments for a day based on sentiment

I have used VADER sentiment analysis model from NLTK. A better model can be used. Or, You can take historical Signals targets and historical comments data and train a simple classification model on that. So, It is trained on reddit data optimized for Signal’s targets.

2. Log all tickers mentioned in those comments

3. Assign the daily sentiment to tickers involved

Sentiment analysis for all stocks

Finding extreme stocks and scoring

Not all of 5k stocks will be discussed there. Here, the stocks having most positive and most negative sentiments across all the days are selected.

A rolling window of 14 days is applied and the scores for last day and will be used for submission.

Scoring

Submission

Since the submission need tickers in the Bloomberg format, we need to re-map the filtered tickers back to original Bloomberg tickers. This may cause a hash collision so I have used the first occurrence of Bloomberg ticker to be used.

Submission

As usual, this is a Run all Colab notebook once you setup PRAW. This will create accept-worthy submissions.

What’s next

  1. Try different model
  2. Better cleaning of comments (use upvotes)
  3. Combine the daily sentiment scores with your current Numerai strategy.
  4. Predict for weeks in the validation data to see diagnostics.

Below is a plot of Fourier transform of market sentiment vs. SPY ticker. This shows, there is some predictive power in the r/WSB.

Image for post
Image for post
Market sentiment vs. S&P500

Written by

Solving problems, one sense, at a time. #ML , “The best way to learn is to teach.” parmarsuraj99@gmail.com About: https://parmarsuraj99.github.io/suraj-parmar/

Solving problems, one sense, at a time. #ML , “The best way to learn is to teach.” parmarsuraj99@gmail.com About: https://parmarsuraj99.github.io/suraj-parmar/

Treating Punctuation restoration as translation with Transformers.

Image for post
Image for post
Illustration of seq2seq model for punctuation restoration

Task

The transcript we get in ASR is often not punctuated and to use it in other tasks, we need a punctuated text. There are many approaches for this but I wanted to explore seq2seq Transformers with this and possibly for multi-lingual application too.


Can you make unique and equally good predictions?

Image for post
Image for post
A still from signals.numer.ai film

If you think Numerai’s main tournament is hard, then you might want to take a look at Signals! It’s more ambitious, and of course, harder! Signals provide a platform to evaluate your financial models and earn some NMR cryptocurrency too!

“Beating the wisdom of the crowds is harder than recognizing faces or driving cars” — Marcos López de Prado

If you are new to Numerai main tournament, this might help.

Just give me the code

This notebook has taken inspiration from the example_model.py and Jason Rosenfeld’s notebook.

From a Numerai participant’s perspective 💡

From the tournament perspective, the main difference between these two is the data. Numerai main tournament provides you obfuscated, clean, and normalized data in a supervised learning manner i.e, features + targets. Signals on the other hand gives only a list of symbols or tickers in the Bloomberg universe and historical targets. That means, we have to collect all the data that can be a good feature for prediction. …


Don’t just submit and wait, evaluate!

Image for post
Image for post
Glowing numerai

Update — DEC 19, 2020: The notebook has been updated according to the new target “Nomi”. TARGET_NAME is now only “target” instead of “target_kazutsugi” .

Just Give me the code

Note: This isn't a 'Run all' and submit notebook. I have tried to make this flexible so feel free to experiment and customize according to your style and workflow.

This post on Model Diagnostics. It also has links to community-written posts on the metrics.

Also, check out A guide to “The hardest data science tournament on the planet” if you want to get started with submitting your predictions for the tournament.

Background

Now, having already submitted your predictions, you might be wondering about improving your models to get better results in the tournament. For a typical machine learning problem you start with Exploratory Data Analysis, then you might want to build a validation pipeline along with a baseline model and then you optimize your model(s). …


Data science for fun? for crypto? Why not both? 😀

Image for post
Image for post
Source: Numerai blog

Update — DEC 01, 2020: The notebook has been updated according to the new target “Nomi”. TARGET_NAME is now “target” instead of “target_kazutsugi”

Just Give Me The Code:

Make sure you have signed up on numer.ai as you’ll need to set up your API keys to make submissions directly from colab.

💡 The Numerai tournament problem

The Numerai data science problem is like a typical supervised machine learning problem, where the data has several input features and corresponding labels (or targets). And our goal is to learn a mapping from input to targets using various techniques. We usually split data into training and validation parts. …


Classifying digits by training a model on MNIST dataset is really a fun thing to do with the frameworks available and putting it to production would be great.

Code: https://github.com/parmarsuraj99/Autoencoders

We know that neural networks can be seen as ‘Universal Function Estimators’ , means we can map them to their correct label. This is called Supervised learning approach.

What if we don’t have labels ? We are left with images only? What can we do with them? Now this is getting interesting. We can train a network to improve resolution of an image, De-noise them, even Generate new samples. …