Topic Discovery in python!

07/23/2017

So I still haven’t figured out if I want to make one blog post a week or make more than one post a week but I will try to effectively post at least once a week on topics in computer science. We’ll see where it goes, it will be very exciting and most certainly worth the click.

This week I plan on exploring a data set of over 5,000 film entries scraped from imdb in an effort to briefly discuss machine learning, particularly Latent Dirichlet Allocation. I will not go into any of the theory because that is beyond the scope of this blog, these aren’t the droids you’re looking for.

However, nltk and gensim provide extensive apis that enable processing human language. Anything from stemming down to root words and or tokenizing a document to perform further analysis it is made easy with the above modules.

 


import pandas as pd
from nltk.tokenize import RegexpTokenizer
from stop_words import get_stop_words
from nltk.stem.porter import PorterStemmer
from gensim import corpora, models
import gensim
import numpy as np
import matplotlib.pyplot as pp
import re

 

Let’s start by reading in the csv file, movie_metadata.csv. A link to the kaggle download is commented in the code below.

 


## https://www.kaggle.com/deepmatrix/imdb-5000-movie-dataset
movie=pd.read_csv("movie_metadata.csv")
movie.head()

Screen Shot 2017-07-23 at 4.27.21 PM

 

 Latent Dirichlet Allocation is used to estimate word topic assignments and the frequency of those assignments for a fixed number of words called documents. Let’s assume each document exhibits multiple topics. So we will be looking at columns plot_keywords and genres.

 


movie['plot_keywords']

 

Next let’s remove the pipe with some list comprehension and check if successful.

 


keyword_strings=[str(d).replace("|"," ") for d in movie['plot_keywords']]
keyword_strings[1]

Screen Shot 2017-07-23 at 4.27.29 PM

Good!

 

Stemming reduces words down to their root word and is particularly useful in developing insightful NLP models.

 

docs=[d for d in keyword_strings if d.count(' ')==5]
len(docs)
texts=[]

#create english stop words list
en_stop= get_stop_words('en')

# create p_stemmer of class PorterStemmer
# stemmer reduces words in a topic to its root word
p_stemmer= PorterStemmer()

# init regex tokenizer
tokenizer= RegexpTokenizer(r'\w+')

# for each document clean and tokenize document string,
# remove stop words from tokens, stem tokens and add to list
for i in docs:
  raw=i.lower()
  tokens=tokenizer.tokenize(raw)
  stopped_tokens=[i for i in tokens if not i in en_stop]
  stemmed_tokens = [p_stemmer.stem(i) for i in stopped_tokens]
  texts.append(stemmed_tokens)

 

The next block of code transforms the granular data into sets of identifiable tokens to manipulate later. To do so, let’s create a dictionary for the terms and value and a matrix for each document and term relationship.

 

# turn our tokenized docs into a key value dict
dictionary= corpora.Dictionary(texts)
# convert tokenized docs into a doc matrix
corpus=[dictionary.doc2bow(text) for text in texts]

 

The immediate next line of code generates the Latent Dirichlet Allocation model taking the corpus, the number of topics and the number of training iterations. Printing the model we see there is an estimate of observed words assigned to each topic, effectively (or ineffectively) predicted.

 

ldamodel=gensim.models.ldamodel.LdaModel(corpus,num_topics=2,id2word=dictionary,passes=20)
print(ldamodel.print_topics(num_topics=2,num_words=5))

screen-shot-2017-07-23-at-4-30-46-pm.png

Let’s parse this data into something we can handle. We will also combine both topics into one array to get a nice plot and then plot the data.

 

top=ldamodel.print_topics(num_topics=2,num_words=5)
topic_num=[]
topic_str=[]
topic_freq=[]

for a in top:
  topic_num.append(a[0])
  topic_str.append(" ".join(re.findall(r'"([^"]*)"',a[1])))
  w0,w1,w2,w3,w4=map(float, re.findall(r'[+-]?[0-9.]+', a[1]))
  tup=(w0,w1,w2,w3,w4)
  topic_freq.append(tup)

words0=topic_str[0].split(" ")
words1=topic_str[1].split(" ")
words=words0+words1

worddict0=dict(zip(words0,topic_freq[0]))
worddict1=dict(zip(words1,topic_freq[1]))

sorted_list0 = [(k,v) for v,k in sorted([(v,k) for k,v in worddict0.items()])]
sorted_list1 = [(k,v) for v,k in sorted([(v,k) for k,v in worddict1.items()])]y_pos = np.arange(5)

freqs=[a[1] for a in sorted_list0]
ws=[a[0] for a in sorted_list0]
freqs1=[a[1] for a in sorted_list1]
ws1=[a[0] for a in sorted_list1]

pp.bar(y_pos, freqs, align='center', alpha=0.5, color=['coral'])
pp.xticks(y_pos, ws)
pp.ylabel('word contributions')
pp.title('Predicted Topic 0 from IMDB Plot Keywords')
pp.show()

pp.bar(y_pos, freqs1, align='center', alpha=0.5, color=['coral'])
pp.xticks(y_pos, ws1)
pp.ylabel('word contributions')
pp.title('Predicted Topic 1 from IMDB Plot Keywords')
pp.show()</pre>
<pre>

i1

i2

This process then can be repeated for any genre of film in the imdb data set.

If you like these blog posts or want to comment and or share something do so below and follow py-guy!

2 thoughts on “Topic Discovery in python!

  1. This is a nice analysis. Being a data visualization nitpicker, I’d like to make three notes. First, the X axis label is misleading, it’s definitely not “Word Frequency”. Secondly, may I ask what do you use the color encoding for? Your graphs will be much better if you will use a single color for the bars. You can use horizontal bars for easier reading too. Lastly, bar charts are best viewed when sorted, unless there’s an intrinsic order in the bars, which doesn’t seem to be the case here.

    Great post, otherwise.

    Liked by 1 person

    1. Thanks for commenting! I appreciate the constructive criticism. There isn’t any particular color encoding in the graphs, and I see how that might be confusing now. As to the order of words, the first 5 words belong to the first topic and words 5-10 to the second topic (that is why you might expect repeated words). For clarity, I made a sorted plots for each topic to compare the two.

      ...
      worddict1=dict(zip(words1,topic_freq[1]))
      
      
      sorted_list0 = [(k,v) for v,k in sorted([(v,k) for k,v in worddict0.items()])]
      sorted_list1 = [(k,v) for v,k in sorted([(v,k) for k,v in worddict1.items()])]
      
      
      y_pos = np.arange(5)
      freqs=[a[1] for a in sorted_list0]
      ws=[a[0] for a in sorted_list0]
      freqs1=[a[1] for a in sorted_list1]
      ws1=[a[0] for a in sorted_list1]
      
      pp.bar(y_pos, freqs, align='center', alpha=0.5, color=['coral'])
      pp.xticks(y_pos, ws)
      pp.ylabel('word contributions')
      pp.title('Predicted Topic 0 from SciFi Plot Keywords')
      pp.show()
      
      pp.bar(y_pos, freqs1, align='center', alpha=0.5, color=['coral'])
      pp.xticks(y_pos, ws1)
      pp.ylabel('word contributions')
      pp.title('Predicted Topic 1 from SciFi Plot Keywords')
      pp.show()
      

      Like

Leave a Reply

Fill in your details below or click an icon to log in:

WordPress.com Logo

You are commenting using your WordPress.com account. Log Out / Change )

Twitter picture

You are commenting using your Twitter account. Log Out / Change )

Facebook photo

You are commenting using your Facebook account. Log Out / Change )

Google+ photo

You are commenting using your Google+ account. Log Out / Change )

Connecting to %s