How to process textual data using TF-IDF in Python
Computers are good with numbers, but not that much with textual data. One of the most widely used techniques to process textual data is TF-IDF. In this article, we will learn how it works and what are its features.
From our intuition, we think that the words which appear more often should have a greater weight in textual data analysis, but that’s not always the case. Words such as “the”, “will”, and “you” — called stopwords — appear the most in a corpus of text, but are of very little significance. Instead, the words which are rare are the ones that actually help in distinguishing between the data, and carry more weight.
An introduction to TF-IDF
TF-IDF stands for “Term Frequency — Inverse Data Frequency”. First, we will learn what this term means mathematically.
Term Frequency (tf): gives us the frequency of the word in each document in the corpus. It is the ratio of number of times the word appears in a document compared to the total number of words in that document. It increases as the number of occurrences of that word within the document increases. Each document has its own tf.
Inverse Data Frequency (idf): used to calculate the weight of rare words across all documents in the corpus. The words that occur rarely in the corpus have a high IDF score. It is given by the equation below.
Combining these two we come up with the TF-IDF score (w) for a word in a document in the corpus. It is the product of tf and idf:
Let’s take an example to get a clearer understanding.
Sentence 1 : The car is driven on the road.
Sentence 2: The truck is driven on the highway.
In this example, each sentence is a separate document.
We will now calculate the TF-IDF for the above two documents, which represent our corpus.
From the above table, we can see that TF-IDF of common words was zero, which shows they are not significant. On the other hand, the TF-IDF of “car” , “truck”, “road”, and “highway” are non-zero. These words have more significance.
Using Python to calculate TF-IDF
Lets now code TF-IDF in Python from scratch. After that, we will see how we can use sklearn to automate the process.
The function computeTF
computes the TF score for each word in the corpus, by document.
The function computeIDF
computes the IDF score of every word in the corpus.
The function computeTFIDF
below computes the TF-IDF score for each word, by multiplying the TF and IDF scores.
The output produced by the above code for the set of documents D1 and D2 is the same as what we manually calculated above in the table.
You can refer to this link for the complete implementation.
sklearn
Now we will see how we can implement this using sklearn in Python.
First, we will import TfidfVectorizer
from sklearn.feature_extraction.text
:
Now we will initialise the vectorizer
and then call fit and transform over it to calculate the TF-IDF score for the text.
Under the hood, the sklearn fit_transform executes the following fit
and transform
functions. These can be found in the official sklearn library at GitHub.
One thing to notice in the above code is that, instead of just the log of n_samples, 1 has been added to n_samples to calculate the IDF score. This ensures that the words with an IDF score of zero don’t get suppressed entirely.
The output obtained is in the form of a skewed matrix, which is normalised to get the following result.
Thus we saw how we can easily code TF-IDF in just 4 lines using sklearn. Now we understand how powerful TF-IDF is as a tool to process textual data out of a corpus. To learn more about sklearn TF-IDF, you can use this link.
Happy coding!
Thanks for reading this article. Be sure to clap and recommend this article if you find it helpful.
For more about programming, you can follow me, so that you get notified every time I come up with a new post.
Cheers!
Also, Let’s get connected on Twitter, Linkedin, Github and Facebook.