News Named Entity Extraction (NER) and Sentiment Analysis
Summary
The post illustrates a stylised news analysis pipeline using some publicly available APIs and includes the following steps:
- Fetch news — read in news source via an RSS feed (Feedparser)
- Extract entitities — perform named entity recognition (NER) on the unstructured text ( Thomson Reuters Intelligent Tagging (TRIT) / Refinitiv Open Calais)
- Filter on news and entities — filter on entities and events of interest
- Sentiment analysis — extract sentiment on the news item (Texblob)
- Find signal — correlate sentiment with price movement
- Historical EOD prices — fetch historical prices (Eodhistoricaldata.com)
- Backtesting — back test for PnL performance (internal or Zipline)
We first load dependency libraries required for our tasks such as Feedparser for news analysis and TextBlob for sentiment analysis.
!pip install psycopg2-binary
!pip install feedparser
import pandas as pd
import numpy as np
from textblob import TextBlob
import feedparser
import requests
import json
import yamlRequirement already satisfied: psycopg2-binary in /usr/local/lib/python3.6/dist-packages (2.8.5)
Requirement already satisfied: feedparser in /usr/local/lib/python3.6/dist-packages (5.2.1)
For this example, we use Globenewsire RSS feeds by topic.
# Dictionary of RSS feeds that we will fetch and combine
# GlobeNewsire / Europe - http://www.globenewswire.com/Rss/List
newsurls = {
'globenewswire-us': 'http://www.globenewswire.com/RssFeed/country/United%20States/feedTitle/GlobeNewswire%20-%20News%20from%20United%20States',
}
Fetch News from RSS feed
Define some convenience functions to iterate through all news items for a given RSS url.
# Function to fetch the rss feed and return the parsed RSS
def parse_rss( rss_url ):
return feedparser.parse( rss_url )
# Function grabs the rss feed headlines (titles) and returns them as a list
def get_headlines( rss_url ):
headlines = []
feed = parse_rss( rss_url )
for newsitem in feed['items']:
headlines.append(newsitem['title'])
return headlinesdef get_summaries( rss_url ):
summaries = []
feed = parse_rss( rss_url )
for newsitem in feed['items']:
summaries.append(newsitem['summary'])
return summariesdef get_entries( rss_url ):
entries = []
feed = parse_rss( rss_url )
for newsitem in feed['items']:
entries.append(newsitem.keys())
return entries
Inspect entries available in news feed
Before
# Inspect the entries available in the RSS feed
entries = []# Iterate over the feed urls
for key,url in newsurls.items():
# Call getHeadlines() and combine the returned headlines with allheadlines
entries.extend( get_entries( url ) )print(entries[0])dict_keys(['id', 'guidislink', 'link', 'links', 'tags', 'title', 'title_detail', 'summary', 'summary_detail', 'published', 'published_parsed', 'dc_identifier', 'language', 'publisher', 'publisher_detail', 'contributors', 'dc_modified', 'dc_keyword'])t Support During Lockdown'}
Assign summaries and headlines for each news item.
# A list to hold all headlines and summaries
allheadlines = []
summaries = []
# Iterate over the feed urls
for key,url in newsurls.items():
# Call getHeadlines() and combine the returned headlines with allheadlines
allheadlines.extend( getHeadlines( url ) )
summaries.extend( getSummaries( url ) )
View headlines
Below is a sample list of headlines, such as securities class action law suits.
# Iterate over the allheadlines list and print each headline
for hl in allheadlines:
print(hl)Velocity (VEL) Alert: Johnson Fistel Investigates Velocity Financial, Inc.; Investors Suffering Losses Encouraged to Contact Firm
ALIGN DEADLINE ALERT: Faruqi & Faruqi, LLP Encourages Investors Who Suffered Losses Exceeding $100,000 In Align Technology, Inc. To Contact The Firm
ALLAKOS DEADLINE ALERT: Faruqi & Faruqi, LLP Encourages Investors Who Suffered Losses Exceeding $50,000 In Allakos Inc. To Contact The Firm
FUNKO LEAD PLAINTIFF DEADLINE ALERT: Faruqi & Faruqi, LLP Encourages Investors Who Suffered Losses Exceeding $50,000 In Funko, Inc. To Contact The Firm
WWE DEADLINE ALERT: Faruqi & Faruqi, LLP Encourages Investors Who Suffered Losses Exceeding $50,000 in World Wrestling Entertainment, Inc. to Contact the Firm
ROSEN, A GLOBALLY RECOGNIZED LAW FIRM, Reminds Golden Star Resources Ltd. Investors of Important Deadline in Securities Class Action – GSS
ROSEN, A GLOBALLY RECOGNIZED LAW FIRM, Reminds LogicBio Therapeutics, Inc. Investors of the Important Deadline in Securities Class Action First Filed by Firm – LOGC
...
Named Entity Extraction (NER) — make an API calls to Thomson Reutersr Intelligent Tagging (TRIT) with news headline content
We next extract the entities and topics in the news using the Thomson Reuters Intelligent Tagging (TRIT) also known as Open Calais.
Define the news item to be processed.
# Define sample content to be queried
contentText = allheadlines[1]
print(contentText)Velocity (VEL) Alert: Johnson Fistel Investigates Velocity Financial, Inc.; Investors Suffering Losses Encouraged to Contact Firm
2.1 Query TRIT / OpenCalais JSON API
headType = "text/raw"
token = 'oSyQfYcRShExGJmJPXRgr4kOFAsIHqoJ'
url = "https://api-eit.refinitiv.com/permid/calais"
payload = contentText.encode('utf8')
headers = {
'Content-Type': headType,
'X-AG-Access-Token': token,
'outputformat': "application/json"
}# The daily limit is 5,000 requests, and the concurrent limit varies by API from 1-4 calls per second.
TRITResponse = requests.request("POST", url, data=payload, headers=headers)
# Load content into JSON object
JSONResponse = json.loads(TRITResponse.text)
# print(json.dumps(JSONResponse, indent=4, sort_keys=True))
2.2 Get entities in news
#Get Entities
print('====Entities====')
print('Type, Name')for key in JSONResponse:
if ('_typeGroup' in JSONResponse[key]):
if JSONResponse[key]['_typeGroup'] == 'entities':
print(JSONResponse[key]['_type'] + ", " + JSONResponse[key]['name'])====Entities====
Type, Name
Company, JOHNSON FISTEL
Company, velocity financial, inc.
2.3 Get RIC code for entity
#Get RIC codeprint('====RIC====')
print('RIC')for entity in JSONResponse:
for info in JSONResponse[entity]:
if (info =='resolutions'):
for companyinfo in (JSONResponse[entity][info]):
if 'primaryric' in companyinfo:
symbol = companyinfo['primaryric']
print(symbol)====RIC====
RIC
VEL.N
2.4 Get topics for the news item
#Print Header
print(symbol)
print('====Topics====')
print('Topics, Score')for key in JSONResponse:
if ('_typeGroup' in JSONResponse[key]):
if JSONResponse[key]['_typeGroup'] == 'topics':
print(JSONResponse[key]['name'] + ", " + str(JSONResponse[key]['score']))VEL.N
====Topics====
Topics, Score
Business_Finance, 1
Health_Medical_Pharma, 0.935
Disaster_Accident, 0.817
4. Sentiment Analysis
# Define function to be used for text senitments analysis
def get_sentiment(txt):
'''
Utility function to clean text by removing links, special characters
using simple regex statements and to classify sentiment of passed tweet
using textblob's sentiment method
'''
#clean text
clean_txt = ' '.join(re.sub("(@[A-Za-z0-9]+)|([^0-9A-Za-z \t])|(\w+:\/\/\S+)", " ", txt).split())
# create TextBlob object of passed tweet text
analysis = TextBlob(clean_txt)
# set sentiment
if analysis.sentiment.polarity > 0:
return 'positive'
elif analysis.sentiment.polarity == 0:
return 'neutral'
else:
return 'negative'
When inspecting the sentiment from both headline and summary text, we see that the sentiment is negative as expected.
print('headline: ', allheadlines[1])
print('headline sentiment: ', get_sentiment(allheadlines[1]))
print('summary: ', summaries[1])
print('summary sentiment: ', get_sentiment(summaries[1]))Headline: Velocity (VEL) Alert: Johnson Fistel Investigates Velocity Financial, Inc.; Investors Suffering Losses Encouraged to Contact Firm
Headline sentiment: negative
Summary: <p>SAN DIEGO, April 26, 2020 (GLOBE NEWSWIRE) -- Shareholder rights law firm Johnson Fistel, LLP is investigating potential violations of the federal securities laws by Velocity Financial, Inc. ("Velocity" or "the Company") (NYSE: VEL).<br></p>
Summary sentiment: negative
5. Get historical EOD price data
eod_api_token = '5cc0ea63d1cda3.37070012'
eod_symbol = symbol.replace('N', 'US')
eod_price_url = 'https://eodhistoricaldata.com/api/eod/' + eod_symbol + '?api_token=' + eod_api_token
price_df = pd.read_csv(eod_price_url)
price_df.sort_values(by=['Date'], inplace=True, ascending=False)
price_df.head()
Finally Get Historical EOD Price Data
eod_api_token = '<mytoken>'
eod_symbol = symbol.replace('N', 'US')
eod_price_url = 'https://eodhistoricaldata.com/api/eod/' + eod_symbol + '?api_token=' + eod_api_token
price_df = pd.read_csv(eod_price_url)
price_df.sort_values(by=['Date'], inplace=True, ascending=False)
price_df.head()
See future articles on how backtesting could be performed.