Monday, September 28, 2015

The Second Nanodegree Project

The second project is about the usage of MongoDB and the objective is to clean up a part of the OpenStreetMap (OSM) data. Students can freely choose a part of the world for their project. For me, of course I selected my home town - Hong Kong.

Again, the programming language used in this project is Python, with the aid of MongoDB to analyze the map dataset. Students should be able to demonstrate the capability to assess the quality of data, and update the data if possible.

Besides some programming techniques, one interesting thing I have learnt from this project is the vision of OSM. While I always regarded Google Map as a very open platform, frankly I have never thought about the copyright. In fact, Google decides what can be shown in the map. For example, if you write an Android application and use the Google Map API for Android, you can display a map fragment in your screen, maybe add some your own overlays. However, you can never access the raw data of the map. So, in short, you can use Google Map data, probably for free, but that comes with some restrictions imposed by Google. The blog posted by the OSM creator Serge Wroclawski is an interesting article to read for this topic.

As far as I know, there are two ways to get the OSM data for a city. The first way is through mapzen.com. It allows you to select a city and download the corresponding OSM data. The second way is through the Openpass API, which needs you to enter the latitude and longitude of a rectangle area. Whatever way you choose, the OSM data you download should be a XML file like the following:

<?xml version="1.0" encoding="UTF-8"?>
<osm>
  <node changeset="19883770" id="274901" lat="22.3460512" lon="114.1811521" ... />
  <way changeset="25914878" id="4187007" timestamp="2014-10-07T10:55:01Z" ...>
    <nd ref="3049050712" />
    <nd ref="3049050713" />
    <nd ref="2481700725" />
  </way>
  ...

Walk through the OSM XML elements

I am only interested in the <node> and <way> elements of the OSM data, in order to handle the project requirements. The Python xml module can be used to walk through the XML tree quite easily. In the following snippet I only need to process the "start" event:

import xml.etree.cElementTree as ET

osm_file = open(osmfile, "r")
for event, elem in ET.iterparse(osm_file, events=("start",)):
  if elem.tag == "node" or elem.tag == "way":
    # do something

In the project, I need to read the data and write it into a JSON file, which can be imported to a MongoDB database for further queries.

Examine the MongdoDB data

I think the most useful knowledge I got from this project is related to the MongoDB stuff. It includes some basic operations like query and update, and some kind of MongoDB aggregation pipeline usage. The pymongo module is a MongoDB client for Python. To connect to a local running MongoDB instance, I just need to :

from pymongo import MongoClient

# Get the MongoDB database instance by name
def get_db(db_name):
    from pymongo import MongoClient
    client = MongoClient('localhost:27017')
    db = client[db_name]
    return db

I can then run some queries and aggregation like the following:

# Number of nodes
> db.hongkong.find({'doc_type':'node'}).count()
877075

# Number of ways
> db.hongkong.find({'doc_type':'way'}).count()
93051

# Number of unique users
> len(db.hongkong.find().distinct('created.user'))
704

# Top 10 contributing user
> list(db.hongkong.aggregate([
{'$group':{'_id': '$created.user', 'count':{ '$sum':1}}},
{'$sort':{'count':-1}}, {'$limit':10}]))

[{u'_id': u'xxxx', u'count': 510158},
{u'_id': u'yyyy', u'count': 77302},
...]

Conclusion

The second project of the Data Analyst programme is not difficult. All it requires are some basic skills to examine XML data in Python and run some queries against a MongoDB database. It also needs you to provide some suggestions to improve OSM, but that is not difficult too, in my opinion. Just like any consultancy work, when you give any suggestions, don't forget to consider the potential issues and the trade-off too.

Sunday, September 27, 2015

Python and Linear Regression

To do my first project for the Udacity Data Analyst Nanodegree, I need to use Python to perform some statistical analysis and make a conclusion to accept or reject a null hypothesis.

As a Java developer for years, I cannot really get used to Python's indentation structure. It spent me some time to look for a good IDE. After some googling, I have chosen Sublime as my Python editor, which is fast and lightweight enough to work for my project.

Python is an interpreted language and the execution speed is always one of its drawbacks. However, it is a very popular programming language now. I have not yet figured out the reasons even after finishing this project, as I only need to write several basic functions with some statistical packages (e.g. numpy, panda, scipy, statsmodels...etc). Well, maybe like what others said, corporate backing (i.e. Google) is the most critical reason for the popularity.

As a tool for statistical analysis, I agree that Python is very convenient and handy. To read a .csv file into a DataFrame object, I can just:

import pandas
data = pandas.read_csv('turnstile_weather_v2.csv') 

Other than some inferential statistics work, one task of this project is to create a linear regression model based on a set of selected features. For example, if we have a dataset of the New York subway ridership (i.e. entries per hour) and the corresponding features (i.e. variables like day of week, rain or not, wind speed...etc), how can we predict the ridership given that we know the input variables?

In the project I have used the Ordinary Least Square method provided by statsmodels module. Firstly I have to select a list of features which I assumed they are related to the ridership. Then I need to add some dummy variables too in order to take categorical data into consideration. The selected features and the dummy variables thus formed a full feature array which I would pass to statsmodels.api.OLS() method.

import numpy
import pandas
import statsmodels.api as sm

# Select features data
features = data[['rain', 'precipi', 'hour', 'weekday']]

# Add dummy variables (in this case, the UNIT variable)
dummy_units = pandas.get_dummies(data['UNIT'], prefix='unit')
features = features.join(dummy_units).values

# The dependent variable is ENTRIESn_hourly here
samples = dataframe['ENTRIESn_hourly'].values

model = sm.OLS(samples, sm.add_constant(features))
result = model.fit()

# Get the intercept vector and parameters of features
intercept = result.params[0]
params = result.params[1:]

# Finally we can get the predictions using the parameters
predictions = intercept + numpy.dot(features, params)

Basically it computes the data predictions using the OLS parameters. Normally we need to analyze the residuals to check if our model is acceptable. A basic check is to look at the histogram of the residuals and plot a probability plot of the residuals. For details please refer to Q-Q plot.

By working on this first Nanodegree project I got a glimpse of the Python language and have learnt a bit of data analysis skill. I think that will be a useful tool for my other projects in future.