07/23/2017
So I still haven’t figured out if I want to make one blog post a week or make more than one post a week but I will try to effectively post at least once a week on topics in computer science. We’ll see where it goes, it will be very exciting and most certainly worth the click.
This week I plan on exploring a data set of over 5,000 film entries scraped from imdb in an effort to briefly discuss machine learning, particularly Latent Dirichlet Allocation. I will not go into any of the theory because that is beyond the scope of this blog, these aren’t the droids you’re looking for.
However, nltk and gensim provide extensive apis that enable processing human language. Anything from stemming down to root words and or tokenizing a document to perform further analysis it is made easy with the above modules.
import pandas as pd from nltk.tokenize import RegexpTokenizer from stop_words import get_stop_words from nltk.stem.porter import PorterStemmer from gensim import corpora, models import gensim import numpy as np import matplotlib.pyplot as pp import re
Let’s start by reading in the csv file, movie_metadata.csv. A link to the kaggle download is commented in the code below.
## https://www.kaggle.com/deepmatrix/imdb-5000-movie-dataset movie=pd.read_csv("movie_metadata.csv") movie.head()
Latent Dirichlet Allocation is used to estimate word topic assignments and the frequency of those assignments for a fixed number of words called documents. Let’s assume each document exhibits multiple topics. So we will be looking at columns plot_keywords and genres.
movie['plot_keywords']
Next let’s remove the pipe with some list comprehension and check if successful.
keyword_strings=[str(d).replace("|"," ") for d in movie['plot_keywords']] keyword_strings[1]
Good!
Stemming reduces words down to their root word and is particularly useful in developing insightful NLP models.
docs=[d for d in keyword_strings if d.count(' ')==5] len(docs) texts=[] #create english stop words list en_stop= get_stop_words('en') # create p_stemmer of class PorterStemmer # stemmer reduces words in a topic to its root word p_stemmer= PorterStemmer() # init regex tokenizer tokenizer= RegexpTokenizer(r'\w+') # for each document clean and tokenize document string, # remove stop words from tokens, stem tokens and add to list for i in docs: raw=i.lower() tokens=tokenizer.tokenize(raw) stopped_tokens=[i for i in tokens if not i in en_stop] stemmed_tokens = [p_stemmer.stem(i) for i in stopped_tokens] texts.append(stemmed_tokens)
The next block of code transforms the granular data into sets of identifiable tokens to manipulate later. To do so, let’s create a dictionary for the terms and value and a matrix for each document and term relationship.
# turn our tokenized docs into a key value dict dictionary= corpora.Dictionary(texts) # convert tokenized docs into a doc matrix corpus=[dictionary.doc2bow(text) for text in texts]
The immediate next line of code generates the Latent Dirichlet Allocation model taking the corpus, the number of topics and the number of training iterations. Printing the model we see there is an estimate of observed words assigned to each topic, effectively (or ineffectively) predicted.
ldamodel=gensim.models.ldamodel.LdaModel(corpus,num_topics=2,id2word=dictionary,passes=20) print(ldamodel.print_topics(num_topics=2,num_words=5))
Let’s parse this data into something we can handle. We will also combine both topics into one array to get a nice plot and then plot the data.
top=ldamodel.print_topics(num_topics=2,num_words=5) topic_num=[] topic_str=[] topic_freq=[] for a in top: topic_num.append(a[0]) topic_str.append(" ".join(re.findall(r'"([^"]*)"',a[1]))) w0,w1,w2,w3,w4=map(float, re.findall(r'[+-]?[0-9.]+', a[1])) tup=(w0,w1,w2,w3,w4) topic_freq.append(tup) words0=topic_str[0].split(" ") words1=topic_str[1].split(" ") words=words0+words1 worddict0=dict(zip(words0,topic_freq[0])) worddict1=dict(zip(words1,topic_freq[1])) sorted_list0 = [(k,v) for v,k in sorted([(v,k) for k,v in worddict0.items()])] sorted_list1 = [(k,v) for v,k in sorted([(v,k) for k,v in worddict1.items()])]y_pos = np.arange(5) freqs=[a[1] for a in sorted_list0] ws=[a[0] for a in sorted_list0] freqs1=[a[1] for a in sorted_list1] ws1=[a[0] for a in sorted_list1] pp.bar(y_pos, freqs, align='center', alpha=0.5, color=['coral']) pp.xticks(y_pos, ws) pp.ylabel('word contributions') pp.title('Predicted Topic 0 from IMDB Plot Keywords') pp.show() pp.bar(y_pos, freqs1, align='center', alpha=0.5, color=['coral']) pp.xticks(y_pos, ws1) pp.ylabel('word contributions') pp.title('Predicted Topic 1 from IMDB Plot Keywords') pp.show()</pre> <pre>
This process then can be repeated for any genre of film in the imdb data set.
If you like these blog posts or want to comment and or share something do so below and follow py-guy!