Applying Machine Learning to classify an unsupervised text document

Nov 2, 2018

Text classification is a problem where we have fixed set of classes/categories and any given text is assigned to one of these categories. In contrast, Text clustering is the task of grouping a set of unlabeled texts in such a way that texts in the same group (called a cluster) are more similar to each other than to those in other clusters.

Here I have created a document on my own which contains two kinds of sentences related with either cricket or travelling.

document = [“This is the most beautiful place in the world.”, “This man has more skills to show in cricket than any other game.”, “Hi there! how was your ladakh trip last month?”, “There was a player who had scored 200+ runs in single cricket innings in his career.”, “I have got the opportunity to travel to Paris next year for my internship.”, “May be he is better than you in batting but you are much better than him in bowling.”, “That was really a great day for me when I was there at Lavasa for the whole night.”, “That’s exactly I wanted to become, a highest ratting batsmen ever with top scores.”, “Does it really matter wether you go to Thailand or Goa, its just you have spend your holidays.”, “Why don’t you go to Switzerland next year for your 25th Wedding anniversary?”, “Travel is fatal to prejudice, bigotry, and narrow mindedness., and many of our people need it sorely on these accounts.”, “Stop worrying about the potholes in the road and enjoy the journey.”, “No cricket team in the world depends on one or two players. The team always plays to win.”, “Cricket is a team game. If you want fame for yourself, go play an individual game.”, “Because in the end, you won’t remember the time you spent working in the office or mowing your lawn. Climb that goddamn mountain.”, “Isn’t cricket supposed to be a team sport? I feel people should decide first whether cricket is a team game or an individual sport.”]

This above document was created to show how to classify the sentences in two different classes since there are only two kinds of sentences mentioned here.

So, we now need to import our necessary libraries and here we go:

from sklearn.feature_extraction.text import TfidfVectorizer
from sklearn.cluster import KMeans
import numpy as np
import pandas as pd

So here we have used TfidfVectorizer. So what is TF-IDF?

In information retrieval or text mining, the term frequency-inverse document frequency also called tf-idf, is a well known method to evaluate how important is a word in a document. tf-idf are also a very interesting way to convert the textual representation of information into a Vector Space Model (VSM).

Google has already been using TF*IDF (or TF-IDF, TFIDF, TF.IDF, Artist formerly known as Prince) as a ranking factor for your content for a long time, as the search engine seems to focus more on term frequency rather than on counting keywords.

TF*IDF is an information retrieval technique that weighs a term’s frequency (TF) and its inverse document frequency (IDF). Each word or term has its respective TF and IDF score. The product of the TF and IDF scores of a term is called the TF*IDF weight of that term.

The TF*IDF algorithm is used to weigh a keyword in any content and assign the importance to that keyword based on the number of times it appears in the document. More importantly, it checks how relevant the keyword is throughout the web, which is referred to as corpus.

We have to create vectorizer usingTfidfVectorizer class to fit and transform the document which we have created:

vectorizer = TfidfVectorizer(stop_words='english')
X = vectorizer.fit_transform(document)

We now need to understand the K-means algorithm which we are going to use in our text document.

K-means is one of the simplest unsupervised learning algorithms that solve the well known clustering problem. The procedure follows a simple and easy way to classify a given data set through a certain number of clusters (assume k clusters) fixed a priori. The main idea is to define k centroids, one for each cluster. These centroids shoud be placed in a cunning way because of different location causes different result. So, the better choice is to place them as much as possible far away from each other. The next step is to take each point belonging to a given data set and associate it to the nearest centroid. When no point is pending, the first step is completed and an early groupage is done. At this point we need to re-calculate k new centroids as barycenters of the clusters resulting from the previous step. After we have these k new centroids, a new binding has to be done between the same data set points and the nearest new centroid.

We will now implement our k-means clustering algorithm in our vectorized document below:

true_k = 2
model = KMeans(n_clusters=true_k, init='k-means++', max_iter=100, n_init=1)
model.fit(X)

We will get the below output:

Out[5]: 
KMeans(algorithm='auto', copy_x=True, init='k-means++', max_iter=100,
    n_clusters=2, n_init=1, n_jobs=1, precompute_distances='auto',
    random_state=None, tol=0.0001, verbose=0)

Now execute the below code to get the centroids and features

order_centroids = model.cluster_centers_.argsort()[:, ::-1]
terms = vectorizer.get_feature_names()

Now we can print the centroids into which clusters they belongs

for i in range(true_k):
 print(“Cluster %d:” % i),
 for ind in order_centroids[i, :10]:
 print(‘ %s’ % terms[ind])

We will get the below output

Cluster 0:
 better
 game
 ladakh
 month
 hi
 trip
 stop
 journey
 worrying
 highest
Cluster 1:
 cricket
 team
 world
 year
 really
 game
 travel
 place
 beautiful
 skills

We can now predict the text sentence which can tell us in which cluster the sentence belongs to

print(“\n”)
print(“Prediction”)
X = vectorizer.transform([“Nothing is easy in cricket. Maybe when you watch it on TV, it looks easy. But it is not. You have to use your brain and time the ball.”])
predicted = model.predict(X)
print(predicted)

This will give the following output :

Prediction
[1]

So, here we have got the prediction as [1] which means it belongs to the cluster 1 which is related with Cricket and hence our test sentence is also talking about cricket fact. So our prediction is correct. We can also test it with other sentences and see if this works. Well its not true that this model will give you the accurate results always, but to get more accurate results we need to have more data or more text which can improve our model and provide the better results.

Hence, we have built our first text classifier which can predict the sentence to which class/cluster it belongs to. So I hope you enjoyed this article and please do not forget to leave your comments and queries on the comments section. Also please let me know if you find any area of improvements in here, it would be really appreciable.

You can also send me an email on vishabh1010@gmail.com or call me at +919538160491. You can also contact me over linkedin.