First blog post ~ python packages

07-14-2017

Welcome to py-guy! py-guy blog explores science, culture and technology with simple examples and thoughtful discussions. For the first post I will talk about why python is a useful programming language and some nifty things python can do while exploring the MOMA data set. The Museum of Modern Art collection is an excellent data set containing title, artist, date, medium etc. of every artwork in the Museum of Modern Art and is perfect for the scope of this post. To download the data set and run your own analysis I’ve listed the link below.

https://www.kaggle.com/momanyc/museum-collection

Python seamlessly enables all stages of data manipulation and utilizing matplotlib, numpy, and pandas packages streamlines the process of intuitive data analysis. At first I felt cheated that I could just import a package to run all the calculations without knowing any of what is going on under the covers but after my first few modules I can say these packages are powerful components in the py-guy toolbox.

import math, json, collections, itertools
from collections import Counter
import numpy as np
import pandas as pd
import matplotlib.pyplot as pp

arts=pd.read_csv("artworks.csv",names=['id','title','artist-id','name','date','medium','dimensions','aquisition-date','credit','catalogue','department','classification','object-number','diameter','circumference','height', 'length', 'width', 'depth', 'weight', 'duration'],dtype='str')
arts.head()

With pandas there is a sort method you can call on any data frame to sort in ascending or descending order. Pandas enhances numpy by including data labels with descriptive indices, robust handling of common data formats and missing data, and relational databases operations.


df=pd.DataFrame(arts)
df.sort_values('date')
df['date']=pd.to_numeric(df['date'], errors='coerce')
df.sort_values('date')

romanticism= df[(df['date']>=1790) & (df['date']<=1880)]
modern= df[(df['date']>=1860) & (df['date']<=1945)]
contemporary= df[(df['date']>=1946) & (df['date']<=2017)]

df1=romanticism.sort_values('date')
df1[-5:] # check if successful

Then using matplotlib set a histogram for dates, setting the bins to the range of art periods to plot a histogram of the given data set.

 

# list comprehension to pull only dates of type float from df
dat=[d for d in df['date'] if np.isnan(d)==False]

# set plot
pp.hist(dat,bins=range(1790,2017))
pp.ylabel('Number of Artworks')
pp.xlabel('Year')
pp.title('Artworks per Year')

artworksPerYear

Python language is expressive in its readability and simplicity.  In only a few lines of code you can read, manipulate and plot data.

 

# according to wikipedia art periods are defined by the
# development of the work of an artist, groups of artists or art movement
# Romanticism -1790 - 1880
# Modern art - 1860 - 1945
# Contemporary art - 1946–present

periods = ('Romanticism','Modern','Contemporary')
y_pos = np.arange(3)
arts = [romanticism.size,modern.size,contemporary.size]

pp.bar(y_pos, arts, align='center', alpha=0.5, color=['coral','yellow','teal'])
pp.xticks(y_pos, periods)
pp.ylabel('Artworks')
pp.title('Pieces per Movement')

pp.show()

piecesPerMvt

Using collections and list comprehensions is just another powerful component python has to offer. I will make another blog post on python collections and list comprehensions but for now here is a quick example illustrating their utility.


# make a list comprehension
nam=[n for n in df['name']]

# using the from collections import Counter
name_art=Counter(nam)
# above line is equivalent to collections.Counter(nam)

# sort the collection by most artworks
mc=name_art.most_common(10)

artists=[artist[0] for artist in mc]
common_arts=[arts[1] for arts in mc]

Let’s try a horizontal bar chart with ‘barh.’

y_pos = np.arange(len(common_arts))
pp.figure(figsize=(10, 3))
pp.barh(y_pos, common_arts, align='center', alpha=0.5)
pp.yticks(y_pos, artists)
pp.xlabel('Number of Artworks')
pp.title('Top 10 Artists with most pieces in Moma')
pp.show()

 

topTenArtistPieces

Similarly this process can be repeated for different variables and scopes returning some interesting results.

arts=pd.read_csv("artworks.csv",names=['id','title','artist-id','name','date','medium','dimensions','aquisition-date','credit','catalogue','department','classification','object-number','diameter','circumference','height', 'length', 'width', 'depth', 'weight', 'duration'],dtype='str')
df=pd.DataFrame(arts)

cls=[c for c in df['classification']]
cls_count=collections.Counter(cls)

clsCol=cls_count.most_common()
clsArr= [c[0] for c in clsCol]
numCls=[c[1] for c in clsCol]
y_pos = np.arange(len(clsArr))

pp.figure(figsize=(10, 20))
pp.barh(y_pos, numCls, align='center', alpha=0.5)
pp.yticks(y_pos,clsArr)
pp.ylabel('Classification')
pp.xlabel('Number of Artworks')
pp.title('Classication of Artworks')
pp.show()

medium

Leave a Reply

Fill in your details below or click an icon to log in:

WordPress.com Logo

You are commenting using your WordPress.com account. Log Out / Change )

Twitter picture

You are commenting using your Twitter account. Log Out / Change )

Facebook photo

You are commenting using your Facebook account. Log Out / Change )

Google+ photo

You are commenting using your Google+ account. Log Out / Change )

Connecting to %s