May 5, 2020

5 min read

News Named Entity Extraction (NER) and Sentiment Analysis

Summary

The post illustrates a stylised news analysis pipeline using some publicly available APIs and includes the following steps:

Fetch news — read in news source via an RSS feed (Feedparser)
Extract entitities — perform named entity recognition (NER) on the unstructured text ( Thomson Reuters Intelligent Tagging (TRIT) / Refinitiv Open Calais)
Filter on news and entities — filter on entities and events of interest
Sentiment analysis — extract sentiment on the news item (Texblob)
Find signal — correlate sentiment with price movement
Historical EOD prices — fetch historical prices (Eodhistoricaldata.com)
Backtesting — back test for PnL performance (internal or Zipline)

We first load dependency libraries required for our tasks such as Feedparser for news analysis and TextBlob for sentiment analysis.

!pip install psycopg2-binary
!pip install feedparser
import pandas as pd
import numpy as np
from textblob import TextBlob
import feedparser
import requests
import json
import yamlRequirement already satisfied: psycopg2-binary in /usr/local/lib/python3.6/dist-packages (2.8.5)
Requirement already satisfied: feedparser in /usr/local/lib/python3.6/dist-packages (5.2.1)

For this example, we use Globenewsire RSS feeds by topic.

# Dictionary of RSS feeds that we will fetch and combine
# GlobeNewsire / Europe - http://www.globenewswire.com/Rss/List
newsurls = {
    'globenewswire-us':           'http://www.globenewswire.com/RssFeed/country/United%20States/feedTitle/GlobeNewswire%20-%20News%20from%20United%20States',
}

Fetch News from RSS feed

Define some convenience functions to iterate through all news items for a given RSS url.

# Function to fetch the rss feed and return the parsed RSS
def parse_rss( rss_url ):
    return feedparser.parse( rss_url ) 
    
# Function grabs the rss feed headlines (titles) and returns them as a list
def get_headlines( rss_url ):
    headlines = []
    feed = parse_rss( rss_url )
    for newsitem in feed['items']:
        headlines.append(newsitem['title'])
    return headlinesdef get_summaries( rss_url ):
    summaries = []
    feed = parse_rss( rss_url )
    for newsitem in feed['items']:
        summaries.append(newsitem['summary'])
    return summariesdef get_entries( rss_url ):
    entries = []
    feed = parse_rss( rss_url )
    for newsitem in feed['items']:
        entries.append(newsitem.keys())
    return entries

Inspect entries available in news feed

Before

# Inspect the entries available in the RSS feed
entries = []# Iterate over the feed urls
for key,url in newsurls.items():
    # Call getHeadlines() and combine the returned headlines with allheadlines
    entries.extend( get_entries( url ) )print(entries[0])dict_keys(['id', 'guidislink', 'link', 'links', 'tags', 'title', 'title_detail', 'summary', 'summary_detail', 'published', 'published_parsed', 'dc_identifier', 'language', 'publisher', 'publisher_detail', 'contributors', 'dc_modified', 'dc_keyword'])t Support During Lockdown'}

Assign summaries and headlines for each news item.

# A list to hold all headlines and summaries
allheadlines = []
summaries = []

# Iterate over the feed urls
for key,url in newsurls.items():
    # Call getHeadlines() and combine the returned headlines with allheadlines
    allheadlines.extend( getHeadlines( url ) )
    summaries.extend( getSummaries( url ) )

View headlines

Below is a sample list of headlines, such as securities class action law suits.

# Iterate over the allheadlines list and print each headline
for hl in allheadlines:
    print(hl)Velocity (VEL) Alert: Johnson Fistel Investigates Velocity Financial, Inc.; Investors Suffering Losses Encouraged to Contact Firm
ALIGN DEADLINE ALERT: Faruqi & Faruqi, LLP Encourages Investors Who Suffered Losses Exceeding $100,000 In Align Technology, Inc. To Contact The Firm
ALLAKOS DEADLINE ALERT: Faruqi & Faruqi, LLP Encourages Investors Who Suffered Losses Exceeding $50,000 In Allakos Inc. To Contact The Firm
FUNKO LEAD PLAINTIFF DEADLINE ALERT: Faruqi & Faruqi, LLP Encourages Investors Who Suffered Losses Exceeding $50,000 In Funko, Inc. To Contact The Firm
WWE DEADLINE ALERT: Faruqi & Faruqi, LLP Encourages Investors Who Suffered Losses Exceeding $50,000 in World Wrestling Entertainment, Inc. to Contact the Firm
ROSEN, A GLOBALLY RECOGNIZED LAW FIRM, Reminds Golden Star Resources Ltd. Investors of Important Deadline in Securities Class Action – GSS
ROSEN, A GLOBALLY RECOGNIZED LAW FIRM, Reminds LogicBio Therapeutics, Inc. Investors of the Important Deadline in Securities Class Action First Filed by Firm – LOGC
...

Named Entity Extraction (NER) — make an API calls to Thomson Reutersr Intelligent Tagging (TRIT) with news headline content

We next extract the entities and topics in the news using the Thomson Reuters Intelligent Tagging (TRIT) also known as Open Calais.

Define the news item to be processed.

# Define sample content to be queried
contentText = allheadlines[1]
print(contentText)Velocity (VEL) Alert: Johnson Fistel Investigates Velocity Financial, Inc.; Investors Suffering Losses Encouraged to Contact Firm

2.1 Query TRIT / OpenCalais JSON API

headType = "text/raw"
token = 'oSyQfYcRShExGJmJPXRgr4kOFAsIHqoJ'
url = "https://api-eit.refinitiv.com/permid/calais"
payload = contentText.encode('utf8')
headers = {
    'Content-Type': headType,
    'X-AG-Access-Token': token,
    'outputformat': "application/json"
    }#  The daily limit is 5,000 requests, and the concurrent limit varies by API from 1-4 calls per second. 
TRITResponse = requests.request("POST", url, data=payload, headers=headers)
# Load content into JSON object
JSONResponse = json.loads(TRITResponse.text)
# print(json.dumps(JSONResponse, indent=4, sort_keys=True))

2.2 Get entities in news

#Get Entities
print('====Entities====')
print('Type, Name')for key in JSONResponse:
    if ('_typeGroup' in JSONResponse[key]):
        if JSONResponse[key]['_typeGroup'] == 'entities':
            print(JSONResponse[key]['_type'] + ", " + JSONResponse[key]['name'])====Entities====
Type, Name
Company, JOHNSON FISTEL
Company, velocity financial, inc.

2.3 Get RIC code for entity

#Get RIC codeprint('====RIC====')
print('RIC')for entity in JSONResponse:
    for info in JSONResponse[entity]:
        if (info =='resolutions'):
            for companyinfo in (JSONResponse[entity][info]):
                if 'primaryric' in companyinfo:
                    symbol = companyinfo['primaryric']
                    print(symbol)====RIC====
RIC
VEL.N

2.4 Get topics for the news item

#Print Header
print(symbol)
print('====Topics====')
print('Topics, Score')for key in JSONResponse:
    if ('_typeGroup' in JSONResponse[key]):
        if JSONResponse[key]['_typeGroup'] == 'topics':
            print(JSONResponse[key]['name'] + ", " + str(JSONResponse[key]['score']))VEL.N
====Topics====
Topics, Score
Business_Finance, 1
Health_Medical_Pharma, 0.935
Disaster_Accident, 0.817

4. Sentiment Analysis

# Define function to be used for text senitments analysis 
def get_sentiment(txt): 
        ''' 
        Utility function to clean text by removing links, special characters 
        using simple regex statements and to classify sentiment of passed tweet 
        using textblob's sentiment method 
        '''
        #clean text
        clean_txt = ' '.join(re.sub("(@[A-Za-z0-9]+)|([^0-9A-Za-z \t])|(\w+:\/\/\S+)", " ", txt).split())
        # create TextBlob object of passed tweet text 
        analysis = TextBlob(clean_txt) 
        # set sentiment 
        if analysis.sentiment.polarity > 0: 
            return 'positive'
        elif analysis.sentiment.polarity == 0: 
            return 'neutral'
        else: 
            return 'negative'

When inspecting the sentiment from both headline and summary text, we see that the sentiment is negative as expected.

print('headline: ', allheadlines[1])
print('headline sentiment: ', get_sentiment(allheadlines[1]))
print('summary: ', summaries[1])
print('summary sentiment: ', get_sentiment(summaries[1]))Headline:  Velocity (VEL) Alert: Johnson Fistel Investigates Velocity Financial, Inc.; Investors Suffering Losses Encouraged to Contact Firm
Headline sentiment:  negative
Summary:  <p>SAN DIEGO, April  26, 2020  (GLOBE NEWSWIRE) -- Shareholder rights law firm Johnson Fistel, LLP is investigating potential violations of the federal securities laws by Velocity Financial, Inc. ("Velocity" or "the Company") (NYSE: VEL).<br></p>
Summary sentiment:  negative

5. Get historical EOD price data

eod_api_token = '5cc0ea63d1cda3.37070012'
eod_symbol = symbol.replace('N', 'US')
eod_price_url = 'https://eodhistoricaldata.com/api/eod/' + eod_symbol + '?api_token=' + eod_api_token
price_df = pd.read_csv(eod_price_url)
price_df.sort_values(by=['Date'], inplace=True, ascending=False)
price_df.head()

Finally Get Historical EOD Price Data

eod_api_token = '<mytoken>'
eod_symbol = symbol.replace('N', 'US')
eod_price_url = 'https://eodhistoricaldata.com/api/eod/' + eod_symbol + '?api_token=' + eod_api_token
price_df = pd.read_csv(eod_price_url)
price_df.sort_values(by=['Date'], inplace=True, ascending=False)
price_df.head()

See future articles on how backtesting could be performed.

Resources

News Named Entity Extraction (NER) and Sentiment Analysis

Summary

More from Julian Kaljuvee

News Sentiment Based Long-Short Trading Strategy

Data Science — What is Alt Data or Alternative Data?

Visualising Poetry with AI

Resources for Emerging Crypto Space

How our age defines where we like to live: 25 — New York, 35 — London, 45 — Zürich

Get the Medium app

Julian Kaljuvee

More from Medium

Topic Modeling with Deep Learning Using Python BERTopic

The Best Way to do Named Entity Recognition (NER)

GloVe and fastText Clearly Explained: Extracting Features from Text Data

Sentiment Analysis: Hugging Face Zero-shot Model vs Flair Pre-trained Model