Topic Discovery in python!


So I still haven’t figured out if I want to make one blog post a week or make more than one post a week but I will try to effectively post at least once a week on topics in computer science. We’ll see where it goes, it will be very exciting and most certainly worth the click.

This week I plan on exploring a data set of over 5,000 film entries scraped from imdb in an effort to briefly discuss machine learning, particularly Latent Dirichlet Allocation. I will not go into any of the theory because that is beyond the scope of this blog, these aren’t the droids you’re looking for.

However, nltk and gensim provide extensive apis that enable processing human language. Anything from stemming down to root words and or tokenizing a document to perform further analysis it is made easy with the above modules.


import pandas as pd
from nltk.tokenize import RegexpTokenizer
from stop_words import get_stop_words
from nltk.stem.porter import PorterStemmer
from gensim import corpora, models
import gensim
import numpy as np
import matplotlib.pyplot as pp
import re


Let’s start by reading in the csv file, movie_metadata.csv. A link to the kaggle download is commented in the code below.



Screen Shot 2017-07-23 at 4.27.21 PM


 Latent Dirichlet Allocation is used to estimate word topic assignments and the frequency of those assignments for a fixed number of words called documents. Let’s assume each document exhibits multiple topics. So we will be looking at columns plot_keywords and genres.




Next let’s remove the pipe with some list comprehension and check if successful.


keyword_strings=[str(d).replace("|"," ") for d in movie['plot_keywords']]

Screen Shot 2017-07-23 at 4.27.29 PM



Stemming reduces words down to their root word and is particularly useful in developing insightful NLP models.


docs=[d for d in keyword_strings if d.count(' ')==5]

#create english stop words list
en_stop= get_stop_words('en')

# create p_stemmer of class PorterStemmer
# stemmer reduces words in a topic to its root word
p_stemmer= PorterStemmer()

# init regex tokenizer
tokenizer= RegexpTokenizer(r'\w+')

# for each document clean and tokenize document string,
# remove stop words from tokens, stem tokens and add to list
for i in docs:
  stopped_tokens=[i for i in tokens if not i in en_stop]
  stemmed_tokens = [p_stemmer.stem(i) for i in stopped_tokens]


The next block of code transforms the granular data into sets of identifiable tokens to manipulate later. To do so, let’s create a dictionary for the terms and value and a matrix for each document and term relationship.


# turn our tokenized docs into a key value dict
dictionary= corpora.Dictionary(texts)
# convert tokenized docs into a doc matrix
corpus=[dictionary.doc2bow(text) for text in texts]


The immediate next line of code generates the Latent Dirichlet Allocation model taking the corpus, the number of topics and the number of training iterations. Printing the model we see there is an estimate of observed words assigned to each topic, effectively (or ineffectively) predicted.




Let’s parse this data into something we can handle. We will also combine both topics into one array to get a nice plot and then plot the data.



for a in top:
  topic_str.append(" ".join(re.findall(r'"([^"]*)"',a[1])))
  w0,w1,w2,w3,w4=map(float, re.findall(r'[+-]?[0-9.]+', a[1]))

words0=topic_str[0].split(" ")
words1=topic_str[1].split(" ")


sorted_list0 = [(k,v) for v,k in sorted([(v,k) for k,v in worddict0.items()])]
sorted_list1 = [(k,v) for v,k in sorted([(v,k) for k,v in worddict1.items()])]y_pos = np.arange(5)

freqs=[a[1] for a in sorted_list0]
ws=[a[0] for a in sorted_list0]
freqs1=[a[1] for a in sorted_list1]
ws1=[a[0] for a in sorted_list1], freqs, align='center', alpha=0.5, color=['coral'])
pp.xticks(y_pos, ws)
pp.ylabel('word contributions')
pp.title('Predicted Topic 0 from IMDB Plot Keywords'), freqs1, align='center', alpha=0.5, color=['coral'])
pp.xticks(y_pos, ws1)
pp.ylabel('word contributions')
pp.title('Predicted Topic 1 from IMDB Plot Keywords')</pre>



This process then can be repeated for any genre of film in the imdb data set.

If you like these blog posts or want to comment and or share something do so below and follow py-guy!

Solar Radiation Prediction


Sci-kit learn is a fantastic set of tools for machine learning in python. It is built on numpy, scipy, and matplotlib introduced in the first py-guy post and makes data analysis and visualization simple and intuitive. sci-kit learn provides classification, regression, clustering, dimensionality reduction, model selection, and preprocessing algorithms making data analysis in python accessible to everyone. We will cover an example of linear regression in this weeks post exploring Solar Radiation data from a NASA hackathon.

First after importing packages let’s read in the SolarPrediction.csv data set. The link to the data set is commented in the code block.


Taking a first look at the data set, specifically, UNIXTime and Date, note it is not formatted to a particular type so we will look at this later.




Calling the describe method on the data frame returns some descriptive statistics on the data set and tells us there might be a relationship between radiation, humidity and or temperature.


So let’s look at a correlation plot to get a better feel for any possible relationships.

truthmat= df.corr()
sns.heatmap(truthmat, vmax=.8, square=True)


There is a strong relationship between radiation and temperature (unsurprisingly or surprisingly) so let’s choose two features with some ambiguity. Pressure and Temperature will do fine, we will use seaborn, a statistical visualization library based on matplotlib to explore the relationship between the two features.

p = sns.jointplot(x="Pressure", y="Temperature", data=df)
p.fig.suptitle('Temperature vs. Pressure')



There is a clear positive trend albeit noisy because of the low pressure gradient. Lets do some quick feature engineering to get a better look at the trend.


#Convert time to_datetime
df['Time_conv'] = pd.to_datetime(df['Time'], format='%H:%M:%S')

#Add column 'hour'
df['hour'] = pd.to_datetime(df['Time_conv'], format='%H:%M:%S').dt.hour

#Add column 'month'
df['month'] = pd.to_datetime(df['UNIXTime'].astype(int), unit='s').dt.month

#Add column 'year'
df['year'] = pd.to_datetime(df['UNIXTime'].astype(int), unit='s').dt.year

#Duration of Day
df['total_time'] = pd.to_datetime(df['TimeSunSet'], format='%H:%M:%S').dt.hour - pd.to_datetime(df['TimeSunRise'], format='%H:%M:%S').dt.hour

First we will convert to date time to manipulate later then add hour, month and year columns for a granular scope. Much Better!


With sklearn linear regression we can train python to model the data and then test the model for its accuracy. We will drop temperature column from the dependent variables  because that is what we want to learn.


y = df['Temperature']
X = df.drop(['Temperature', 'Data', 'Time', 'TimeSunRise', 'TimeSunSet','Time_conv',], axis=1)

from sklearn.model_selection import train_test_split
X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.3, random_state=101)
from sklearn.linear_model import LinearRegression
lm = LinearRegression(),y_train)

Now let’s predict the temperature given the features.


predictions = lm.predict( X_test)
pp.xlabel('Temperature Test')
pp.ylabel('Predicted Temperature')


MSE and RMSE values tell us the there is significance and the model performed well and as you can see there is a positive upward trend centered around the mean.

print(metrics.mean_squared_error(y_test, predictions))
print(np.sqrt(metrics.mean_squared_error(y_test, predictions)))

Screen Shot 2017-07-21 at 8.16.00 PM

If you like these blog posts or want to comment and or share something do so below and follow py-guy!

Note: I referenced kaggler Sarah VCH’s notebook in making todays blog post, specifically the feature engineering code in the fifth code block. If you want to see her notebook I’ve listed the link below.