07-21-2017
Sci-kit learn is a fantastic set of tools for machine learning in python. It is built on numpy, scipy, and matplotlib introduced in the first py-guy post and makes data analysis and visualization simple and intuitive. sci-kit learn provides classification, regression, clustering, dimensionality reduction, model selection, and preprocessing algorithms making data analysis in python accessible to everyone. We will cover an example of linear regression in this weeks post exploring Solar Radiation data from a NASA hackathon.
First after importing packages let’s read in the SolarPrediction.csv data set. The link to the data set is commented in the code block.
Taking a first look at the data set, specifically, UNIXTime and Date, note it is not formatted to a particular type so we will look at this later.
df.shape df.describe()
Calling the describe method on the data frame returns some descriptive statistics on the data set and tells us there might be a relationship between radiation, humidity and or temperature.
So let’s look at a correlation plot to get a better feel for any possible relationships.
truthmat= df.corr() sns.heatmap(truthmat, vmax=.8, square=True)
There is a strong relationship between radiation and temperature (unsurprisingly or surprisingly) so let’s choose two features with some ambiguity. Pressure and Temperature will do fine, we will use seaborn, a statistical visualization library based on matplotlib to explore the relationship between the two features.
p = sns.jointplot(x="Pressure", y="Temperature", data=df) pp.subplots_adjust(top=.9) p.fig.suptitle('Temperature vs. Pressure')
There is a clear positive trend albeit noisy because of the low pressure gradient. Lets do some quick feature engineering to get a better look at the trend.
#Convert time to_datetime df['Time_conv'] = pd.to_datetime(df['Time'], format='%H:%M:%S') #Add column 'hour' df['hour'] = pd.to_datetime(df['Time_conv'], format='%H:%M:%S').dt.hour #Add column 'month' df['month'] = pd.to_datetime(df['UNIXTime'].astype(int), unit='s').dt.month #Add column 'year' df['year'] = pd.to_datetime(df['UNIXTime'].astype(int), unit='s').dt.year #Duration of Day df['total_time'] = pd.to_datetime(df['TimeSunSet'], format='%H:%M:%S').dt.hour - pd.to_datetime(df['TimeSunRise'], format='%H:%M:%S').dt.hour df.head()
First we will convert to date time to manipulate later then add hour, month and year columns for a granular scope. Much Better!
With sklearn linear regression we can train python to model the data and then test the model for its accuracy. We will drop temperature column from the dependent variables because that is what we want to learn.
y = df['Temperature'] X = df.drop(['Temperature', 'Data', 'Time', 'TimeSunRise', 'TimeSunSet','Time_conv',], axis=1) from sklearn.model_selection import train_test_split X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.3, random_state=101) from sklearn.linear_model import LinearRegression lm = LinearRegression() lm.fit(X_train,y_train)
Now let’s predict the temperature given the features.
X.head() predictions = lm.predict( X_test) pp.scatter(y_test,predictions) pp.xlabel('Temperature Test') pp.ylabel('Predicted Temperature')
MSE and RMSE values tell us the there is significance and the model performed well and as you can see there is a positive upward trend centered around the mean.
print(metrics.mean_squared_error(y_test, predictions)) print(np.sqrt(metrics.mean_squared_error(y_test, predictions)))
If you like these blog posts or want to comment and or share something do so below and follow py-guy!
Note: I referenced kaggler Sarah VCH’s notebook in making todays blog post, specifically the feature engineering code in the fifth code block. If you want to see her notebook I’ve listed the link below.
https://www.kaggle.com/sarahvch/investigating-solar-radiation