Monday, September 28, 2015

The Second Nanodegree Project

The second project is about the usage of MongoDB and the objective is to clean up a part of the OpenStreetMap (OSM) data. Students can freely choose a part of the world for their project. For me, of course I selected my home town - Hong Kong.

Again, the programming language used in this project is Python, with the aid of MongoDB to analyze the map dataset. Students should be able to demonstrate the capability to assess the quality of data, and update the data if possible.

Besides some programming techniques, one interesting thing I have learnt from this project is the vision of OSM. While I always regarded Google Map as a very open platform, frankly I have never thought about the copyright. In fact, Google decides what can be shown in the map. For example, if you write an Android application and use the Google Map API for Android, you can display a map fragment in your screen, maybe add some your own overlays. However, you can never access the raw data of the map. So, in short, you can use Google Map data, probably for free, but that comes with some restrictions imposed by Google. The blog posted by the OSM creator Serge Wroclawski is an interesting article to read for this topic.

As far as I know, there are two ways to get the OSM data for a city. The first way is through mapzen.com. It allows you to select a city and download the corresponding OSM data. The second way is through the Openpass API, which needs you to enter the latitude and longitude of a rectangle area. Whatever way you choose, the OSM data you download should be a XML file like the following:

<?xml version="1.0" encoding="UTF-8"?>
<osm>
  <node changeset="19883770" id="274901" lat="22.3460512" lon="114.1811521" ... />
  <way changeset="25914878" id="4187007" timestamp="2014-10-07T10:55:01Z" ...>
    <nd ref="3049050712" />
    <nd ref="3049050713" />
    <nd ref="2481700725" />
  </way>
  ...

Walk through the OSM XML elements

I am only interested in the <node> and <way> elements of the OSM data, in order to handle the project requirements. The Python xml module can be used to walk through the XML tree quite easily. In the following snippet I only need to process the "start" event:

import xml.etree.cElementTree as ET

osm_file = open(osmfile, "r")
for event, elem in ET.iterparse(osm_file, events=("start",)):
  if elem.tag == "node" or elem.tag == "way":
    # do something

In the project, I need to read the data and write it into a JSON file, which can be imported to a MongoDB database for further queries.

Examine the MongdoDB data

I think the most useful knowledge I got from this project is related to the MongoDB stuff. It includes some basic operations like query and update, and some kind of MongoDB aggregation pipeline usage. The pymongo module is a MongoDB client for Python. To connect to a local running MongoDB instance, I just need to :

from pymongo import MongoClient

# Get the MongoDB database instance by name
def get_db(db_name):
    from pymongo import MongoClient
    client = MongoClient('localhost:27017')
    db = client[db_name]
    return db

I can then run some queries and aggregation like the following:

# Number of nodes
> db.hongkong.find({'doc_type':'node'}).count()
877075

# Number of ways
> db.hongkong.find({'doc_type':'way'}).count()
93051

# Number of unique users
> len(db.hongkong.find().distinct('created.user'))
704

# Top 10 contributing user
> list(db.hongkong.aggregate([
{'$group':{'_id': '$created.user', 'count':{ '$sum':1}}},
{'$sort':{'count':-1}}, {'$limit':10}]))

[{u'_id': u'xxxx', u'count': 510158},
{u'_id': u'yyyy', u'count': 77302},
...]

Conclusion

The second project of the Data Analyst programme is not difficult. All it requires are some basic skills to examine XML data in Python and run some queries against a MongoDB database. It also needs you to provide some suggestions to improve OSM, but that is not difficult too, in my opinion. Just like any consultancy work, when you give any suggestions, don't forget to consider the potential issues and the trade-off too.

No comments:

Post a Comment

Note: Only a member of this blog may post a comment.