Pypack – compact packaging and reusable configuration

10-15-2017

In this post I will talk about how to use pypack to program clean and reusable python code. For programming larger complex applications in python, import statements tend to clutter code readability and isn’t practicable to reuse for different projects. Let’s say if you want to code a data science app you have your “go-to” packages like numpy, matplotlib, math etc. or a web crawler like selenium, beautiful soup, and requests with compact packaging and reusable configuration programming is streamlined.

First specify the packages used in your program in a configuration file named ‘config,’ defining imports and statements in key value declaration spaced by one line.

# config file
imports: 'math','json','collections','itertools','numpy','pandas','matplotlib.pyplot',''

statements: '','','','','np','pd','pp'

This will specify a list of imports pypack will pull into the dev environment necessary for your project.

# new.py
# packages from config file

import math
import json
import collections
import itertools
import numpy as np
import pandas as pd
import matplotlib.pyplot as pp

 

The above code snippet is the result of the configuration file contents listed at the beginning of the post.  pypack is a simple program written in python with less than 37 lines of code that reads the specified packages from the config file and writes those packages to a new python file for specialized coding projects.

import sys
# config file should be in same folder as pypack
# if not, specify
f=open('config','r')
s=f.read()

First pypack opens the config file and reads the contents to memory.

# parse config file
s1=s.split('imports:')
s2=''.join(s1)
s3=s2.split('statements:')
s4=''.join(s3)
arr= s4.split(',')

 

Python syntax is such that assigning elements is as simple as encapsulating a loop with brackets. The first four lines of this snippet comma delimit the the config file and assign imports and statement elements to separate arrays.

# list comprehension of imports and statements
arr=[a for a in arr[:7]]
st=arr[-1].split('\n\n')[0]
arr[-1]=st
arr1= s4.split(',')[7:]
arr1.insert(0,' ')

Next imports and statements lines are split, concatenated and then double space delimited to an array for list comprehension.

.py =open(sys.argv[1],'w')
for i in range(len(arr)):
   if arr1[i]==' ':
       .py.write('import '+arr[i]+'\n')
   if arr1[i]!=' ':
       .py.write('import '+arr[i]+' as '+arr1[i]+'\n')
.py.close()

Finally pypack opens a new writable python file and effectively iterates through the two arrays, writing imports and statements to the new python file.

If you like these blog posts or want to comment and or share something do so below and follow py-guy!

 

Topic Discovery in python!

07/23/2017

So I still haven’t figured out if I want to make one blog post a week or make more than one post a week but I will try to effectively post at least once a week on topics in computer science. We’ll see where it goes, it will be very exciting and most certainly worth the click.

This week I plan on exploring a data set of over 5,000 film entries scraped from imdb in an effort to briefly discuss machine learning, particularly Latent Dirichlet Allocation. I will not go into any of the theory because that is beyond the scope of this blog, these aren’t the droids you’re looking for.

However, nltk and gensim provide extensive apis that enable processing human language. Anything from stemming down to root words and or tokenizing a document to perform further analysis it is made easy with the above modules.

 


import pandas as pd
from nltk.tokenize import RegexpTokenizer
from stop_words import get_stop_words
from nltk.stem.porter import PorterStemmer
from gensim import corpora, models
import gensim
import numpy as np
import matplotlib.pyplot as pp
import re

 

Let’s start by reading in the csv file, movie_metadata.csv. A link to the kaggle download is commented in the code below.

 


## https://www.kaggle.com/deepmatrix/imdb-5000-movie-dataset
movie=pd.read_csv("movie_metadata.csv")
movie.head()

Screen Shot 2017-07-23 at 4.27.21 PM

 

 Latent Dirichlet Allocation is used to estimate word topic assignments and the frequency of those assignments for a fixed number of words called documents. Let’s assume each document exhibits multiple topics. So we will be looking at columns plot_keywords and genres.

 


movie['plot_keywords']

 

Next let’s remove the pipe with some list comprehension and check if successful.

 


keyword_strings=[str(d).replace("|"," ") for d in movie['plot_keywords']]
keyword_strings[1]

Screen Shot 2017-07-23 at 4.27.29 PM

Good!

 

Stemming reduces words down to their root word and is particularly useful in developing insightful NLP models.

 

docs=[d for d in keyword_strings if d.count(' ')==5]
len(docs)
texts=[]

#create english stop words list
en_stop= get_stop_words('en')

# create p_stemmer of class PorterStemmer
# stemmer reduces words in a topic to its root word
p_stemmer= PorterStemmer()

# init regex tokenizer
tokenizer= RegexpTokenizer(r'\w+')

# for each document clean and tokenize document string,
# remove stop words from tokens, stem tokens and add to list
for i in docs:
  raw=i.lower()
  tokens=tokenizer.tokenize(raw)
  stopped_tokens=[i for i in tokens if not i in en_stop]
  stemmed_tokens = [p_stemmer.stem(i) for i in stopped_tokens]
  texts.append(stemmed_tokens)

 

The next block of code transforms the granular data into sets of identifiable tokens to manipulate later. To do so, let’s create a dictionary for the terms and value and a matrix for each document and term relationship.

 

# turn our tokenized docs into a key value dict
dictionary= corpora.Dictionary(texts)
# convert tokenized docs into a doc matrix
corpus=[dictionary.doc2bow(text) for text in texts]

 

The immediate next line of code generates the Latent Dirichlet Allocation model taking the corpus, the number of topics and the number of training iterations. Printing the model we see there is an estimate of observed words assigned to each topic, effectively (or ineffectively) predicted.

 

ldamodel=gensim.models.ldamodel.LdaModel(corpus,num_topics=2,id2word=dictionary,passes=20)
print(ldamodel.print_topics(num_topics=2,num_words=5))

screen-shot-2017-07-23-at-4-30-46-pm.png

Let’s parse this data into something we can handle. We will also combine both topics into one array to get a nice plot and then plot the data.

 

top=ldamodel.print_topics(num_topics=2,num_words=5)
topic_num=[]
topic_str=[]
topic_freq=[]

for a in top:
  topic_num.append(a[0])
  topic_str.append(" ".join(re.findall(r'"([^"]*)"',a[1])))
  w0,w1,w2,w3,w4=map(float, re.findall(r'[+-]?[0-9.]+', a[1]))
  tup=(w0,w1,w2,w3,w4)
  topic_freq.append(tup)

words0=topic_str[0].split(" ")
words1=topic_str[1].split(" ")
words=words0+words1

worddict0=dict(zip(words0,topic_freq[0]))
worddict1=dict(zip(words1,topic_freq[1]))

sorted_list0 = [(k,v) for v,k in sorted([(v,k) for k,v in worddict0.items()])]
sorted_list1 = [(k,v) for v,k in sorted([(v,k) for k,v in worddict1.items()])]y_pos = np.arange(5)

freqs=[a[1] for a in sorted_list0]
ws=[a[0] for a in sorted_list0]
freqs1=[a[1] for a in sorted_list1]
ws1=[a[0] for a in sorted_list1]

pp.bar(y_pos, freqs, align='center', alpha=0.5, color=['coral'])
pp.xticks(y_pos, ws)
pp.ylabel('word contributions')
pp.title('Predicted Topic 0 from IMDB Plot Keywords')
pp.show()

pp.bar(y_pos, freqs1, align='center', alpha=0.5, color=['coral'])
pp.xticks(y_pos, ws1)
pp.ylabel('word contributions')
pp.title('Predicted Topic 1 from IMDB Plot Keywords')
pp.show()</pre>
<pre>

i1

i2

This process then can be repeated for any genre of film in the imdb data set.

If you like these blog posts or want to comment and or share something do so below and follow py-guy!

Solar Radiation Prediction

07-21-2017

Sci-kit learn is a fantastic set of tools for machine learning in python. It is built on numpy, scipy, and matplotlib introduced in the first py-guy post and makes data analysis and visualization simple and intuitive. sci-kit learn provides classification, regression, clustering, dimensionality reduction, model selection, and preprocessing algorithms making data analysis in python accessible to everyone. We will cover an example of linear regression in this weeks post exploring Solar Radiation data from a NASA hackathon.

First after importing packages let’s read in the SolarPrediction.csv data set. The link to the data set is commented in the code block.


 

Taking a first look at the data set, specifically, UNIXTime and Date, note it is not formatted to a particular type so we will look at this later.

headshape.png

 

df.shape
df.describe()

Calling the describe method on the data frame returns some descriptive statistics on the data set and tells us there might be a relationship between radiation, humidity and or temperature.

descr

So let’s look at a correlation plot to get a better feel for any possible relationships.

truthmat= df.corr()
sns.heatmap(truthmat, vmax=.8, square=True)

matrix

There is a strong relationship between radiation and temperature (unsurprisingly or surprisingly) so let’s choose two features with some ambiguity. Pressure and Temperature will do fine, we will use seaborn, a statistical visualization library based on matplotlib to explore the relationship between the two features.

p = sns.jointplot(x="Pressure", y="Temperature", data=df)
pp.subplots_adjust(top=.9)
p.fig.suptitle('Temperature vs. Pressure')

 

temp_press.png

There is a clear positive trend albeit noisy because of the low pressure gradient. Lets do some quick feature engineering to get a better look at the trend.

 

#Convert time to_datetime
df['Time_conv'] = pd.to_datetime(df['Time'], format='%H:%M:%S')

#Add column 'hour'
df['hour'] = pd.to_datetime(df['Time_conv'], format='%H:%M:%S').dt.hour

#Add column 'month'
df['month'] = pd.to_datetime(df['UNIXTime'].astype(int), unit='s').dt.month

#Add column 'year'
df['year'] = pd.to_datetime(df['UNIXTime'].astype(int), unit='s').dt.year

#Duration of Day
df['total_time'] = pd.to_datetime(df['TimeSunSet'], format='%H:%M:%S').dt.hour - pd.to_datetime(df['TimeSunRise'], format='%H:%M:%S').dt.hour
df.head()

First we will convert to date time to manipulate later then add hour, month and year columns for a granular scope. Much Better!

screen-shot-2017-07-21-at-8-05-13-pm.png

With sklearn linear regression we can train python to model the data and then test the model for its accuracy. We will drop temperature column from the dependent variables  because that is what we want to learn.

 

y = df['Temperature']
X = df.drop(['Temperature', 'Data', 'Time', 'TimeSunRise', 'TimeSunSet','Time_conv',], axis=1)

from sklearn.model_selection import train_test_split
X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.3, random_state=101)
from sklearn.linear_model import LinearRegression
lm = LinearRegression()
lm.fit(X_train,y_train)

Now let’s predict the temperature given the features.

 

X.head()
predictions = lm.predict( X_test)
pp.scatter(y_test,predictions)
pp.xlabel('Temperature Test')
pp.ylabel('Predicted Temperature')

linreg.png

MSE and RMSE values tell us the there is significance and the model performed well and as you can see there is a positive upward trend centered around the mean.

print(metrics.mean_squared_error(y_test, predictions))
print(np.sqrt(metrics.mean_squared_error(y_test, predictions)))

Screen Shot 2017-07-21 at 8.16.00 PM

If you like these blog posts or want to comment and or share something do so below and follow py-guy!

Note: I referenced kaggler Sarah VCH’s notebook in making todays blog post, specifically the feature engineering code in the fifth code block. If you want to see her notebook I’ve listed the link below.

https://www.kaggle.com/sarahvch/investigating-solar-radiation