Topic Discovery in python!


So I still haven’t figured out if I want to make one blog post a week or make more than one post a week but I will try to effectively post at least once a week on topics in computer science. We’ll see where it goes, it will be very exciting and most certainly worth the click.

This week I plan on exploring a data set of over 5,000 film entries scraped from imdb in an effort to briefly discuss machine learning, particularly Latent Dirichlet Allocation. I will not go into any of the theory because that is beyond the scope of this blog, these aren’t the droids you’re looking for.

However, nltk and gensim provide extensive apis that enable processing human language. Anything from stemming down to root words and or tokenizing a document to perform further analysis it is made easy with the above modules.


import pandas as pd
from nltk.tokenize import RegexpTokenizer
from stop_words import get_stop_words
from nltk.stem.porter import PorterStemmer
from gensim import corpora, models
import gensim
import numpy as np
import matplotlib.pyplot as pp
import re


Let’s start by reading in the csv file, movie_metadata.csv. A link to the kaggle download is commented in the code below.



Screen Shot 2017-07-23 at 4.27.21 PM


 Latent Dirichlet Allocation is used to estimate word topic assignments and the frequency of those assignments for a fixed number of words called documents. Let’s assume each document exhibits multiple topics. So we will be looking at columns plot_keywords and genres.




Next let’s remove the pipe with some list comprehension and check if successful.


keyword_strings=[str(d).replace("|"," ") for d in movie['plot_keywords']]

Screen Shot 2017-07-23 at 4.27.29 PM



Stemming reduces words down to their root word and is particularly useful in developing insightful NLP models.


docs=[d for d in keyword_strings if d.count(' ')==5]

#create english stop words list
en_stop= get_stop_words('en')

# create p_stemmer of class PorterStemmer
# stemmer reduces words in a topic to its root word
p_stemmer= PorterStemmer()

# init regex tokenizer
tokenizer= RegexpTokenizer(r'\w+')

# for each document clean and tokenize document string,
# remove stop words from tokens, stem tokens and add to list
for i in docs:
  stopped_tokens=[i for i in tokens if not i in en_stop]
  stemmed_tokens = [p_stemmer.stem(i) for i in stopped_tokens]


The next block of code transforms the granular data into sets of identifiable tokens to manipulate later. To do so, let’s create a dictionary for the terms and value and a matrix for each document and term relationship.


# turn our tokenized docs into a key value dict
dictionary= corpora.Dictionary(texts)
# convert tokenized docs into a doc matrix
corpus=[dictionary.doc2bow(text) for text in texts]


The immediate next line of code generates the Latent Dirichlet Allocation model taking the corpus, the number of topics and the number of training iterations. Printing the model we see there is an estimate of observed words assigned to each topic, effectively (or ineffectively) predicted.




Let’s parse this data into something we can handle. We will also combine both topics into one array to get a nice plot and then plot the data.



for a in top:
  topic_str.append(" ".join(re.findall(r'"([^"]*)"',a[1])))
  w0,w1,w2,w3,w4=map(float, re.findall(r'[+-]?[0-9.]+', a[1]))

words0=topic_str[0].split(" ")
words1=topic_str[1].split(" ")


sorted_list0 = [(k,v) for v,k in sorted([(v,k) for k,v in worddict0.items()])]
sorted_list1 = [(k,v) for v,k in sorted([(v,k) for k,v in worddict1.items()])]y_pos = np.arange(5)

freqs=[a[1] for a in sorted_list0]
ws=[a[0] for a in sorted_list0]
freqs1=[a[1] for a in sorted_list1]
ws1=[a[0] for a in sorted_list1], freqs, align='center', alpha=0.5, color=['coral'])
pp.xticks(y_pos, ws)
pp.ylabel('word contributions')
pp.title('Predicted Topic 0 from IMDB Plot Keywords'), freqs1, align='center', alpha=0.5, color=['coral'])
pp.xticks(y_pos, ws1)
pp.ylabel('word contributions')
pp.title('Predicted Topic 1 from IMDB Plot Keywords')</pre>



This process then can be repeated for any genre of film in the imdb data set.

If you like these blog posts or want to comment and or share something do so below and follow py-guy!