Sunday, September 27, 2015

Python and Linear Regression

To do my first project for the Udacity Data Analyst Nanodegree, I need to use Python to perform some statistical analysis and make a conclusion to accept or reject a null hypothesis.

As a Java developer for years, I cannot really get used to Python's indentation structure. It spent me some time to look for a good IDE. After some googling, I have chosen Sublime as my Python editor, which is fast and lightweight enough to work for my project.

Python is an interpreted language and the execution speed is always one of its drawbacks. However, it is a very popular programming language now. I have not yet figured out the reasons even after finishing this project, as I only need to write several basic functions with some statistical packages (e.g. numpy, panda, scipy, statsmodels...etc). Well, maybe like what others said, corporate backing (i.e. Google) is the most critical reason for the popularity.

As a tool for statistical analysis, I agree that Python is very convenient and handy. To read a .csv file into a DataFrame object, I can just:

import pandas
data = pandas.read_csv('turnstile_weather_v2.csv') 

Other than some inferential statistics work, one task of this project is to create a linear regression model based on a set of selected features. For example, if we have a dataset of the New York subway ridership (i.e. entries per hour) and the corresponding features (i.e. variables like day of week, rain or not, wind speed...etc), how can we predict the ridership given that we know the input variables?

In the project I have used the Ordinary Least Square method provided by statsmodels module. Firstly I have to select a list of features which I assumed they are related to the ridership. Then I need to add some dummy variables too in order to take categorical data into consideration. The selected features and the dummy variables thus formed a full feature array which I would pass to statsmodels.api.OLS() method.

import numpy
import pandas
import statsmodels.api as sm

# Select features data
features = data[['rain', 'precipi', 'hour', 'weekday']]

# Add dummy variables (in this case, the UNIT variable)
dummy_units = pandas.get_dummies(data['UNIT'], prefix='unit')
features = features.join(dummy_units).values

# The dependent variable is ENTRIESn_hourly here
samples = dataframe['ENTRIESn_hourly'].values

model = sm.OLS(samples, sm.add_constant(features))
result = model.fit()

# Get the intercept vector and parameters of features
intercept = result.params[0]
params = result.params[1:]

# Finally we can get the predictions using the parameters
predictions = intercept + numpy.dot(features, params)

Basically it computes the data predictions using the OLS parameters. Normally we need to analyze the residuals to check if our model is acceptable. A basic check is to look at the histogram of the residuals and plot a probability plot of the residuals. For details please refer to Q-Q plot.

By working on this first Nanodegree project I got a glimpse of the Python language and have learnt a bit of data analysis skill. I think that will be a useful tool for my other projects in future.

No comments:

Post a Comment

Note: Only a member of this blog may post a comment.