Monday, April 17, 2017

A Simple Template for Machine Learning  in Python


The following shows a simple flow to do machine learning in Python:

  1. Load dataset
  2. Split the dataset into train and test subsets
  3. Create a classifier for classification task
  4. Fit the train dataset
  5. Predict the test labels using test dataset
  6. Find out the accuracy

from sklearn import datasets
from sklearn import metrics
from sklearn.model_selection import train_test_split
from sklearn.tree import DecisionTreeClassifier


def train():
    # Load your data set, e.g. the sklearn digits dataset
    digits = datasets.load_digits()

    # Split the data set into random train and test subsets
    features_train, features_test, labels_train, labels_test = \
        train_test_split(digits.data, digits.target, test_size=0.3, random_state=42)

    # Create a classifier, e.g. a DecisionTree classifier
    classifier = DecisionTreeClassifier(random_state=11)

    # Fit the train dataset in the classifier
    classifier.fit(features_train, labels_train)

    # Use the trained model to make predictions against the test dataset
    predictions = classifier.predict(features_test)

    # Calculate the prediction accuracy
    f1_score = metrics.f1_score(labels_test, predictions, average="macro")
    accuracy = metrics.accuracy_score(labels_test, predictions)

    print "F1 score = ", f1_score
    print "Accuracy = ", accuracy

Monday, September 28, 2015

The Second Nanodegree Project

The second project is about the usage of MongoDB and the objective is to clean up a part of the OpenStreetMap (OSM) data. Students can freely choose a part of the world for their project. For me, of course I selected my home town - Hong Kong.

Again, the programming language used in this project is Python, with the aid of MongoDB to analyze the map dataset. Students should be able to demonstrate the capability to assess the quality of data, and update the data if possible.

Besides some programming techniques, one interesting thing I have learnt from this project is the vision of OSM. While I always regarded Google Map as a very open platform, frankly I have never thought about the copyright. In fact, Google decides what can be shown in the map. For example, if you write an Android application and use the Google Map API for Android, you can display a map fragment in your screen, maybe add some your own overlays. However, you can never access the raw data of the map. So, in short, you can use Google Map data, probably for free, but that comes with some restrictions imposed by Google. The blog posted by the OSM creator Serge Wroclawski is an interesting article to read for this topic.

As far as I know, there are two ways to get the OSM data for a city. The first way is through mapzen.com. It allows you to select a city and download the corresponding OSM data. The second way is through the Openpass API, which needs you to enter the latitude and longitude of a rectangle area. Whatever way you choose, the OSM data you download should be a XML file like the following:

<?xml version="1.0" encoding="UTF-8"?>
<osm>
  <node changeset="19883770" id="274901" lat="22.3460512" lon="114.1811521" ... />
  <way changeset="25914878" id="4187007" timestamp="2014-10-07T10:55:01Z" ...>
    <nd ref="3049050712" />
    <nd ref="3049050713" />
    <nd ref="2481700725" />
  </way>
  ...

Walk through the OSM XML elements

I am only interested in the <node> and <way> elements of the OSM data, in order to handle the project requirements. The Python xml module can be used to walk through the XML tree quite easily. In the following snippet I only need to process the "start" event:

import xml.etree.cElementTree as ET

osm_file = open(osmfile, "r")
for event, elem in ET.iterparse(osm_file, events=("start",)):
  if elem.tag == "node" or elem.tag == "way":
    # do something

In the project, I need to read the data and write it into a JSON file, which can be imported to a MongoDB database for further queries.

Examine the MongdoDB data

I think the most useful knowledge I got from this project is related to the MongoDB stuff. It includes some basic operations like query and update, and some kind of MongoDB aggregation pipeline usage. The pymongo module is a MongoDB client for Python. To connect to a local running MongoDB instance, I just need to :

from pymongo import MongoClient

# Get the MongoDB database instance by name
def get_db(db_name):
    from pymongo import MongoClient
    client = MongoClient('localhost:27017')
    db = client[db_name]
    return db

I can then run some queries and aggregation like the following:

# Number of nodes
> db.hongkong.find({'doc_type':'node'}).count()
877075

# Number of ways
> db.hongkong.find({'doc_type':'way'}).count()
93051

# Number of unique users
> len(db.hongkong.find().distinct('created.user'))
704

# Top 10 contributing user
> list(db.hongkong.aggregate([
{'$group':{'_id': '$created.user', 'count':{ '$sum':1}}},
{'$sort':{'count':-1}}, {'$limit':10}]))

[{u'_id': u'xxxx', u'count': 510158},
{u'_id': u'yyyy', u'count': 77302},
...]

Conclusion

The second project of the Data Analyst programme is not difficult. All it requires are some basic skills to examine XML data in Python and run some queries against a MongoDB database. It also needs you to provide some suggestions to improve OSM, but that is not difficult too, in my opinion. Just like any consultancy work, when you give any suggestions, don't forget to consider the potential issues and the trade-off too.

Sunday, September 27, 2015

Python and Linear Regression

To do my first project for the Udacity Data Analyst Nanodegree, I need to use Python to perform some statistical analysis and make a conclusion to accept or reject a null hypothesis.

As a Java developer for years, I cannot really get used to Python's indentation structure. It spent me some time to look for a good IDE. After some googling, I have chosen Sublime as my Python editor, which is fast and lightweight enough to work for my project.

Python is an interpreted language and the execution speed is always one of its drawbacks. However, it is a very popular programming language now. I have not yet figured out the reasons even after finishing this project, as I only need to write several basic functions with some statistical packages (e.g. numpy, panda, scipy, statsmodels...etc). Well, maybe like what others said, corporate backing (i.e. Google) is the most critical reason for the popularity.

As a tool for statistical analysis, I agree that Python is very convenient and handy. To read a .csv file into a DataFrame object, I can just:

import pandas
data = pandas.read_csv('turnstile_weather_v2.csv') 

Other than some inferential statistics work, one task of this project is to create a linear regression model based on a set of selected features. For example, if we have a dataset of the New York subway ridership (i.e. entries per hour) and the corresponding features (i.e. variables like day of week, rain or not, wind speed...etc), how can we predict the ridership given that we know the input variables?

In the project I have used the Ordinary Least Square method provided by statsmodels module. Firstly I have to select a list of features which I assumed they are related to the ridership. Then I need to add some dummy variables too in order to take categorical data into consideration. The selected features and the dummy variables thus formed a full feature array which I would pass to statsmodels.api.OLS() method.

import numpy
import pandas
import statsmodels.api as sm

# Select features data
features = data[['rain', 'precipi', 'hour', 'weekday']]

# Add dummy variables (in this case, the UNIT variable)
dummy_units = pandas.get_dummies(data['UNIT'], prefix='unit')
features = features.join(dummy_units).values

# The dependent variable is ENTRIESn_hourly here
samples = dataframe['ENTRIESn_hourly'].values

model = sm.OLS(samples, sm.add_constant(features))
result = model.fit()

# Get the intercept vector and parameters of features
intercept = result.params[0]
params = result.params[1:]

# Finally we can get the predictions using the parameters
predictions = intercept + numpy.dot(features, params)

Basically it computes the data predictions using the OLS parameters. Normally we need to analyze the residuals to check if our model is acceptable. A basic check is to look at the histogram of the residuals and plot a probability plot of the residuals. For details please refer to Q-Q plot.

By working on this first Nanodegree project I got a glimpse of the Python language and have learnt a bit of data analysis skill. I think that will be a useful tool for my other projects in future.

Wednesday, July 9, 2014

Angular + PaperJS

In this sample I would like to demonstrate how to write an AngularJS directive which integrates with the PaperJS library.

One very powerful feature of AngularJS is its capability to extend the HTML vocabulary by directive implementation. Upon complete loading of the Angular library in a web page, it would walk through the page and compile the embedded directives into DOM model. Usually a directive is implemented either as HTML element or attribute, for example,

<my-directive />
<div my-directive />  

Of course directives can also be implemented in other forms like class and comment, but I don't go into details here. In the following sample, I would use attribute to implement my directive.

At the time of writing, I cannot find any official Angular directive to draw on the HTML canvas. Besides, I would like to make use of some Javascript graphics libraries (e.g. PaperJS, KineticJS...etc) so I do not need to deal with low level graphics details. There are some good hints to deal with canvas inside a directive in SO. Here I choose PaperJS for my experiment.

1. The HTML content


In the HTML page we need to specify the Angular application name. In the section, I specify a controller and include a canvas for our drawing.

<html ng-app="angularPaper">
   <head>
      <link href="css/bootstrap.min.css" rel="stylesheet" type="text/css"></link>
      <link href="css/stdtheme.css" rel="stylesheet" type="text/css"></link>
      <script src="js/angular.min.js"></script>
      <script src="js/ui-bootstrap-tpls-0.11.0.min.js"></script>
      <script src="js/paper.js"></script>
      <script src="js/goserver.js"></script>
      <script src="js/app.js"></script>
   </head>
   <body>
      <div role="main">
      <div class="container" ng-controller="BoardCtrl">
      <canvas board class="board-frame" height="500" width="500"></canvas>
      </div>
      </div>
   </body>
</html>

My application is called angularPaper and later we need to configure that in our angular codes in app.js. The goserver.js is another Javascript file which I use to contain the codes related to the Go board. Yes, I would like to draw a Go board. If you have no idea what this great game is, you'd better go to find it out now from wikipedia!!

In the above html, you will see that we enhance the canvas by a board attribute.  This attribute is not a standard attribute for the canvas element. We will implement the attribute using Angular directive.

Basically a Go board is a 19x19 grid like the following (image from wikipedia).There are also 13x13 and 9x9 boards for beginner level.




2. The Angular code

Next we need to write our angular module. The following shows a simple configuration of this sample:

'use strict';

var myApp = angular.module('angularPaper', ['ui.bootstrap']);

myApp.config(function($locationProvider) {
  $locationProvider.html5Mode(false);
});

myApp.controller('BoardCtrl', function($scope) {
  $scope.boardSize = 19;
  $scope.boardControl = {};
}); 

The above configuration defines a controller named BoardCtrl, which has a variable boardSize to define the size of board. Our directive will need this variable later to create a board. The boardControl object is used to expose some public operations defined inside the directive indirectly. It may not be the best solution for a controller to change the directive state, but it is certainly a simple approach. Another solution is to create a service to act as a communication bridge between the controller and directive (see this).


3. The Directive

The final bit is to actually write our directive, which uses PaperJS inside to draw the Go board. According to Angular guidelines, all codes related to the "view" or DOM manipulation should be written inside a directive. Therefore, it is the place where we do all the drawing and event handling (e.g. a mouse click on the board).

First, I define the directive placeholder like this:

myApp.directive('board', ['$timeout', function(timer) {
  return {
    restrict: 'A',
    scope: true,
    link: function(scope, element) {
    }
  };
}]);

This is an attribute directive, so I specify 'A' for the restrict property. Also I intend to use inherited scope from the controller so I set true to scope. The link function will contain the main logic for drawing and mouse event handling.

A drawEmptyBoard function is created to draw the board on the provided canvas element:
      /*
       * Function to draw an empty board. This function will be called whenever
       * user reset the board.
       */
      function drawEmptyBoard(element) {

        // create a new empty board
        scope.board = goServer.createBoard(scope.boardSize);

        // setup the paper scope on canvas
        if (!scope.paper) {
          scope.paper = new paper.PaperScope();
          scope.paper.setup(element);
        }

        // clear all drawing items on active layer
        scope.paper.project.activeLayer.removeChildren();

        // draw the board
        var size = parseInt(scope.boardSize);
        var width = element.offsetWidth;
        var margin;
        switch (size) {
        case 9:
          margin = width * 0.08;
          break;
        case 13:
          margin = width * 0.05;
          break;
        default:
          margin = width * 0.033;
        }
        scope.interval = (width - 2 * margin) / (size - 1);

        // store the coordinates for mouse event detection
        scope.coord = [];
        var x = margin;
        for (var i = 0; i < size; i++) {
          scope.coord[i] = x;
          x += scope.interval;
        }

        // assign the global paper object
        paper = scope.paper;

        // draw the board grid
        for (var i = 0; i < size; i++) {

          // draw x axis
          var from = new paper.Point(scope.coord[0], scope.coord[i]);
          var to = new paper.Point(scope.coord[size - 1], scope.coord[i]);
          var path = new paper.Path();
          path.strokeColor = 'black';
          path.moveTo(from);
          path.lineTo(to);

          // draw y axis
          from = new paper.Point(scope.coord[i], scope.coord[0]);
          to = new paper.Point(scope.coord[i], scope.coord[size - 1]);
          path = new paper.Path();
          path.strokeColor = 'black';
          path.moveTo(from);
          path.lineTo(to);
        }

        paper.view.draw();
      }


A tricky point here is, we cannot call the draw board function right away inside the link function. Invoking the function inside the link function has no effect at all and we cannot see our board. A workaround is to include a timer call to invoke the drawing function immediately at the end of the link function. That's the reason why I inject the $timeout service to the directive:

myApp.directive('board', ['$timeout', function(timer) {
  return {
    restrict: 'A',
    scope: true,
    link: function(scope, element) {
      function drawEmptyBoard(element) {
        // the code draws the board
      }

      timer(drawEmptyBoard(element[0]), 0);
    }
  };
}]);

Finally, I added a mousedown event handling to detect my mouse click and draw the black/white stones alternatively.
myApp.directive('board', ['$timeout', function(timer) {

  return {
    restrict: 'A',
    scope: true,
    link: function(scope, element) {

      /*
       * Use an internalControl object to expose operations
       */
      scope.internalControl = scope.boardControl || {};
      scope.internalControl.resetBoard = function() {
        drawEmptyBoard(element[0]);
      };

      /*
       * Function to draw an empty board. This function will be called whenever
       * user reset the board.
       */
      function drawEmptyBoard(element) {
        // the code draws the board
      }

      // bind the mouse down event to the element
      element.bind('mousedown', function(event) {
        var x, y;
        if (event.offsetX !== undefined) {
          x = event.offsetX;
          y = event.offsetY;
        } else { // Firefox compatibility
          x = event.layerX - event.currentTarget.offsetLeft;
          y = event.layerY - event.currentTarget.offsetTop;
        }

        // find the stone position
        x = Math.round((x - scope.coord[0]) / scope.interval);
        y = Math.round((y - scope.coord[0]) / scope.interval);

        if (!scope.board.isAllowed(x, y)) { return; }

        // draw the stone
        paper = scope.paper;
        var center = new paper.Point(scope.coord[x], scope.coord[y]);
        var circle = new paper.Path.Circle(center, scope.interval / 2 - 1);
        if (scope.board.nextColor() == goServer.moveType.BLACK) {
          circle.fillColor = 'black';
        } else {
          circle.fillColor = 'white';
          circle.strokeColor = 'black';
          circle.strokeWidth = 1;
        }
        paper.view.draw();

        scope.board.addMove(x, y);
      });

      timer(drawEmptyBoard(element[0]), 0);
    }
  };

}]);


A function called InternalControl is added to the inherited scope so that external code can invoke the exposed functions through the controller control variable.

The code has been put in GitHub. A running sample is also available at Plunker:  http://plnkr.co/edit/m6rq2o.


Tuesday, February 11, 2014

Spring Data on GAE - Part 3 - Custom Repository

In part 2, we have created a Player entity with parent in GAE to guarantee transactionality of the update to player instances. When we query the player instances via Spring MVC controller, we get JSON response like this:

[{"id":"ahJhbmd1bGFyLXNwcmluZy1nYWVyJgsSBlBhcmVudBiAgICAgICACgwLEgZQbGF5ZXIYgICAgICAkAgM", "parentKey":"ahJhbmd1bGFyLXNwcmluZy1nYWVyEwsSBlBhcmVudBiAgICAgICACgw", "name":"Sally","rank":"5d"}]

The id property is actually the encoded form of the com.google.appengine.api.datastore.Key class.

1. Customize the JSON output


Now I only want to include the key's long id part. First I would like to exclude the id and parentKey properties in the JSON output. So I use the @JsonIgnore annotation to mark the properties.

To display the id part of the encoded Key, I added a transient property called entityId.  The property will be populated by the EntityListener PostLoad callback.


Player.java
@Entity
@EntityListeners({ MyEntityListener.class })
public class Player {
    @Id
    @GeneratedValue(strategy = GenerationType.IDENTITY)
    private Long id; 

    @Basic
    @Extension(vendorName = "datanucleus", key = "gae.parent-pk", value = "true")
    private String parentKey;

    @Transient
    private Long entityId;

    @JsonIgnore
    public String getId() {
        return id;
    }

    public void setId(String id) {
        this.id = id;
    }

    @JsonIgnore
    public String getParentKey() {
        return parentKey;
    }

    public void setParentKey(String parentKey) {
        this.parentKey = parentKey;
    }

    public Long getEntityId() {
        return entityId;
    }

    public void setEntityId(Long entityId) {
        this.entityId = entityId;
    }

    // other properties and setters/getters skipped
}

MyEntityListener.java
public class MyEntityListener {
    @PostLoad
    public void postLoad(AbstractEntity entity) {
        entity.setEntityId(KeyFactory.stringToKey(entity.getId()).getId());
    }
}

Now we should get the response like the following if we submit the query again:

[{"entityId":4978588650569728,"name":"Sally","rank":"5d"}]

The response seems much better now with a Long entity id instead of the 80 characters long datastore key.

2. Implement a custom Spring Data Repository


Remember that we have a Spring Data repository that implements the Player DAO:

import org.springframework.data.jpa.repository.JpaRepository;
import com.angularspring.sample.domain.Player;

public interface PlayerRepository extends JpaRepository<Player, String> {
}

By default the repository has a findOne(String) method returning a Player object. This method is inherited from the Spring CrudRepository interface with String as primary key and Player as the entity type. What if we want to add a findOne(Long) method that finds the entity by the entityId?

Therefore I would like to implement a custom repository interface which contains this method:

@NoRepositoryBean
public interface CustomRepository<T, ID extends Serializable> extends JpaRepository<T, ID> {

    public T findOne(Long id);
}

Here is the implementation:

@NoRepositoryBean
public class CustomRepositoryImpl<T, ID extends Serializable> extends SimpleJpaRepository<T, ID>
        implements CustomRepository<T, ID> {

    @PersistenceContext
    private EntityManager em;

    private Class<T> domainClass;

    private Key parentKey;

    public CustomRepositoryImpl(Class<T> domainClass, EntityManager entityManager) {
        super(domainClass, entityManager);

        this.em = entityManager;
        this.domainClass = domainClass;
    }

    private Key getParentKey() {
        if (parentKey != null) {
            return parentKey;
        }

        List<Parent> parents = em.createQuery("SELECT p FROM Parent p", Parent.class)
                .getResultList();
        if (parents == null || parents.size() == 0) {
            return null;
        }

        Parent parent = parents.get(0);
        parentKey = KeyFactory.stringToKey(parent.getKey());
        return parentKey;
    }

    @Override
    public T findOne(Long id) {
        return em.find(domainClass,
                KeyUtils.toKeyString(getParentKey(), domainClass.getSimpleName(), id));
    }
}


As there would be other JPA entity class too, I would use a custom repository factory so that the above implementation could be applied to all user-defined repositories extending the CustomRepository interface. Note that I add the @NoRepositoryBean annotation in the above interface / implementation; otherwise Spring Data would try to create a repository on the fly when they are discovered by Spring. The KeyUtils class is a utility class which makes use of the Google API to convert the Long id to the GAE key string. Here is our custom repository factory:

public class CustomRepositoryFactoryBean<R extends JpaRepository<T, I>, T, I extends Serializable>
        extends JpaRepositoryFactoryBean<R, T, I> {

    protected RepositoryFactorySupport createRepositoryFactory(EntityManager entityManager) {
        return new CustomRepositoryFactory(entityManager);
    }

    private static class CustomRepositoryFactory<T, I extends Serializable> extends
            JpaRepositoryFactory {

        private EntityManager entityManager;

        public CustomRepositoryFactory(EntityManager entityManager) {
            super(entityManager);

            this.entityManager = entityManager;
        }

        protected Object getTargetRepository(RepositoryMetadata metadata) {
            return new CustomRepositoryImpl<T, I>((Class<T>) metadata.getDomainType(),
                    entityManager);
        }

        protected Class<?> getRepositoryBaseClass(RepositoryMetadata metadata) {
            return CustomRepository.class;
        }
    }
}

The factory simply creates the repository implementation for us when our user-defined repositories are found. To enable this factory we just need to specify it in our repository configuration.

<jpa:repositories base-package="com.angularspring.sample.repository"
        factory-class="com.angularspring.sample.repository.gae.CustomRepositoryFactoryBean" />

Finally, all we needed is to change the parent interface of our repository. For example, the PlayerRepository will now extend from our CustomRepository instead of JpaRepository.

import com.angularspring.sample.domain.Player;

public interface PlayerRepository extends CustomRepository<Player, String> {
}

3. Summary


In this blog we have shown how to use a transient entity id of long type to query GAE datastore for a particular entity type. By using a custom Spring Data repository and configuring a custom repository factory, we could apply this id lookup behavior to all defined Spring Data JPA repositories.

The source can be found at GitHub (tagged v0.3).

Thursday, February 6, 2014

Spring Data on GAE - Part 2 - Datastore Key

In the last blog we see it's straightforward to use Spring Data on GAE platform. However, due to the design of GAE datastore, we may see unexpected behavior about transactionality.

Assume that we define a simple entity with a primary key of Long type:

@Entity
public class Player {
    @Id
    @GeneratedValue(strategy = GenerationType.IDENTITY)
    private Long id; 

    // other properties skipped

    public Player() {
    }

    // setters and getters skipped
}

If we update multiple Player instances in the same transaction and immediately query the result, we may see that the expected data changes will not be effective at the same time. For example, we may see Player A's data has been updated but not Player B's. However, if we keep querying the database, eventually we can see all the data changes.

It is because GAE datastore assumes each entity has an optional ancestor path. Entities with the same ancestor path will be placed in the same entity group. In GAE datastore, an entity group is the unit where transactionality can be guaranteed.

1. GAE Datastore Entity Key


According to the datastore design, each entity has a primary key composed of the following three elements (see here):
  • The entity's kind
  • An identifier, which can be either
    • a key name string
    • an integer ID
  • An optional ancestor path locating the entity within the Datastore hierarchy
When we use DataNucleus JPA to persist our entities to datastore, it will assign the simple class name (i.e. "Player") to the entity's kind value. For the previous POJO definition, it will also generate an Long identifier for us. However, the ancestor path will be empty.

Therefore, each Player instance is in its own entity group, and updates to multiple entity groups may not be effective at the same time.

2. Use GAE Primary Key Type


One way to overcome the issue is to group all Player entities in the same entity group by defining a common ancestor path. This can be done by using the com.google.appengine.api.datastore.Key class as primary key, instead of Long. However, I do not prefer this way as it makes the entity class very Google specific.

Instead I use the DataNucleus extension as advised by the book Programming Google App Engine.

Firstly, I need to change the primary key of Player entity from the type Long to String. I also need to include the DataNucleus annotation (@org.datanucleus.api.jpa.annotations.Extension) to generate a GAE primary key for me. I know it still depends on DataNucleus but I think it is better than depending on a specific Google class.

@Id
@GeneratedValue(strategy = GenerationType.IDENTITY)
@Extension(vendorName = "datanucleus", key = "gae.encoded-pk", value = "true")
private String id;

I also need to add a new property to the entity to indicate the ancestor, which I created a Parent class to represent it.

@Basic
@Extension(vendorName = "datanucleus", key = "gae.parent-pk", value = "true")
private String parentKey;

@ManyToOne
private Parent parent;

public String getParentKey() {
    return parentKey;
}

public void setParentKey(String parentKey) {
    this.parentKey = parentKey;
}

@com.fasterxml.jackson.annotation.JsonIgnore
public Parent getParent() {
    return parent;
}

public void setParent(Parent parent) {
    this.parent = parent;
}


Parent.java
@Entity
public class Parent {

    @Id
    @GeneratedValue(strategy = GenerationType.IDENTITY)
    @Extension(vendorName = "datanucleus", key = "gae.encoded-pk", value = "true")
    String key;

    @OneToMany(cascade = CascadeType.ALL, mappedBy = "parent")
    private List<Player> players;

    // setters and getters skipped 
} 


Note that the @JsonIgnore annotation has been added for the parent property to avoid the potential recursive references when creating the JSON string.

3. Save the Parent and Players objects


To create a parent, we can either use the EntityManager API or define a Spring Data repository to implement the DAO for us. Here I use a Spring Data repository.

public interface ParentRepository extends JpaRepository<Parent, String> {
}


The common parent object can be created in an utility method like the following. It returns a Parent object which we need for the creation of Player objects.

public Parent getParent() {
    List<Parent> parents = parentDao.findAll();
    if (parents == null || parents.size() == 0) {
        parent = new Parent();
        parentDao.save(parent);
    } else {
        parent = parents.get(0);
    }

    parentKey = KeyFactory.stringToKey(parent.getKey());

    return parent;
}


In previous blog we have created a Spring MVC Controller class to map the URL /service/players/initDB to a method:

@RequestMapping(value = "/initDB", method = RequestMethod.GET)

@Transactional(readOnly = false, isolation = Isolation.READ_COMMITTED)

public ResponseEntity<String> initDB() {

    dataService.initDB();

    return new ResponseEntity<String>("Players inserted to database", HttpStatus.OK);

}


The following method creates the players. Note that we need to set the parent property so that DataNucleus can set the parentKey string for us.

public ResponseEntity<String> initDB() {
    // get entity group parent
    Parent p = getParent();

    // insert testing player data

    List<Player> players = new ArrayList<Player>();
    players.add(new Player("Snoopy", "9p", p));
    players.add(new Player("Wookstock", "9p", p));
    players.add(new Player("Charlie", "1d", p));
    players.add(new Player("Lucy", "4d", p));
    players.add(new Player("Sally", "5d", p));
    playerRepository.save(players);
    return new ResponseEntity<String>("5 players inserted into database", HttpStatus.OK);
 }



That's it. If we init the database using the URL, we will see that the complete list of players can be shown immediately when we submit a query URL /service/players (see last blog for the Spring MVC implementation).

[{"id":"ahJhbmd1bGFyLXNwcmluZy1nYWVyJgsSBlBhcmVudBiAgICAgICACgwLEgZQbGF5ZXIYgICAgICAkAgM","parentKey":"ahJhbmd1bGFyLXNwcmluZy1nYWVyEwsSBlBhcmVudBiAgICAgICACgw","name":"Sally","rank":"5d"},{"id":"ahJhbmd1bGFyLXNwcmluZy1nYWVyJgsSBlBhcmVudBiAgICAgICACgwLEgZQbGF5ZXIYgICAgICA4AgM","parentKey":"ahJhbmd1bGFyLXNwcmluZy1nYWVyEwsSBlBhcmVudBiAgICAgICACgw","name":"Snoopy","rank":"9p"},{"id":"ahJhbmd1bGFyLXNwcmluZy1nYWVyJgsSBlBhcmVudBiAgICAgICACgwLEgZQbGF5ZXIYgICAgICA4AkM","parentKey":"ahJhbmd1bGFyLXNwcmluZy1nYWVyEwsSBlBhcmVudBiAgICAgICACgw","name":"Charlie","rank":"1d"},{"id":"ahJhbmd1bGFyLXNwcmluZy1nYWVyJgsSBlBhcmVudBiAgICAgICACgwLEgZQbGF5ZXIYgICAgICA4AoM","parentKey":"ahJhbmd1bGFyLXNwcmluZy1nYWVyEwsSBlBhcmVudBiAgICAgICACgw","name":"Wookstock","rank":"9p"},{"id":"ahJhbmd1bGFyLXNwcmluZy1nYWVyJgsSBlBhcmVudBiAgICAgICACgwLEgZQbGF5ZXIYgICAgICA4AsM","parentKey":"ahJhbmd1bGFyLXNwcmluZy1nYWVyEwsSBlBhcmVudBiAgICAgICACgw","name":"Lucy","rank":"4d"}]

We can also query a particular player by URL /service/players/{id} , with the id string shown above. The id string is in fact the encoded primary key which can be converted to google Key object by com.google.appengine.api.datastore.KeyFactory.stringToKey(String) method.

Now the primary key is about 80 characters long, which can locate ANY datastore entity because it encodes the full key (kind, id, ancestor path). It is not necessary if we know that it is used to locate a Player object. Preferably I would like query a player using only the Long id part of the key. I would discuss the way to do that by using a custom Spring Data Repository in next part.

The source can be found at GitHub (tagged v0.2).



Friday, January 31, 2014

Spring Data on GAE - Part 1 - Basic JPA

I would like to create an applicaton using AngularJS as front-end and Spring as back-end. The data access layer will be implemented by Spring Data JPA with DataNucleus JPA provider. The whole application will be running on Google App Engine.

There was debate over JPA vs JDO on app engine (see here). Personally I prefer to use JPA due to its widely adoption and also its support by Spring Data. However, we will see some design issues later due to the App Engine Datastore owned relationship feature.

1. Data model


Firstly I need to define a simple data model for this experiment. Here I created a POJO named Player to represent the players on the game server:

@Entity
public class Player {
    @Id
    @GeneratedValue(strategy = GenerationType.IDENTITY)
    private Long id;

    private String name;
    private String rank;

    public Player() {
    }

    public Player(String name, String rank) {
        super();
        this.name = name;
        this.rank = rank;
    }

    // setters and getters skipped
}

Due to the design of Datastore, we will soon see problem of using Long as the primary key. However, for simplicity we just use Long here temporarily.

2. JPA Configurations


As I use Spring MVC and Spring Data to implement the server side presentation and data layer, I add the following dependencies in build.gradle:

build.gradle
dependencies {
    compile "org.springframework:spring-webmvc:3.2.6.RELEASE"
    compile "org.springframework.data:spring-data-jpa:1.3.0.RELEASE"
    compile "com.fasterxml.jackson.core:jackson-databind:2.3.1"
    compile "org.datanucleus:datanucleus-enhancer:3.1.1"
    compile "org.codehaus.jackson:jackson-mapper-asl:1.9.9"
    compile "org.slf4j:slf4j-simple:1.7.5"
    ...
}

Note that the older spring-data-jpa 1.3.0 is used instead of the latest release because of a bug in DataNucleus (http://www.datanucleus.org/servlet/jira/browse/NUCJPA-250).

To specify DataNucleus as the JPA provider, the persistence unit is set as follows:

META-INF/persistence.xml
<persistence version="2.0" 
    xmlns:xsi="http://www.w3.org/2001/XMLSchema-instance" 
    xmlns="http://java.sun.com/xml/ns/persistence" 
    xsi:schemalocation="http://java.sun.com/xml/ns/persistence
    http://java.sun.com/xml/ns/persistence/persistence_2_0.xsd">
    <persistence-unit name="jpa.unit">
        <provider>org.datanucleus.api.jpa.PersistenceProviderImpl</provider>
        <class>com.angularspring.sample.domain.Player</class>
        <exclude-unlisted-classes />
        <properties>
            <property name="datanucleus.ConnectionURL" value="appengine" />
            <property name="datanucleus.appengine.ignorableMetaDataBehavior" value="NONE" />
        </properties>
    </persistence-unit>
</persistence>

3. Spring Data Repository


We need something to actually interact with the database. Here I do not use native JPA API (i.e. EntityManager) for database operations, instead I use Spring Data JPA. Spring Data can greatly reduce boilerplate codes for DAO implementation. We just need to define a player repository interface:

package com.angularspring.sample.repository;
public interface PlayerRepository extends JpaRepository<Player, Long> {
}

To complete the JPA repository configuration, we need to specify the followings in a Spring configuration:

applicationContext.xml
<jpa:repositories base-package="com.angularspring.sample.repository" />

<tx:annotation-driven transaction-manager="transactionManager" />

<bean class="org.springframework.orm.jpa.JpaTransactionManager" id="transactionManager">
    <constructor-arg ref="entityManagerFactory" />
</bean>

<bean class="org.springframework.orm.jpa.LocalContainerEntityManagerFactoryBean" id="entityManagerFactory">
    <property name="persistenceUnitName" value="jpa.unit" />
    <property name="packagesToScan" value="com.angularspring.sample.domain" />
    <property name="loadTimeWeaver">
        <bean class="org.springframework.instrument.classloading.SimpleLoadTimeWeaver" />
    </property>
</bean>

4. Spring MVC Restful Service


The Spring MVC is used to implement RESTful service. Firstly, the web.xml is setup to include the DispatcherServlet and the context loader (loads the Spring configuration in previous section):

web.xml
<servlet>
  <servlet-name>dispatcher</servlet-name>
  <servlet-class>org.springframework.web.servlet.DispatcherServlet</servlet-class>
  <load-on-startup>1</load-on-startup>
</servlet>
<servlet-mapping>
  <servlet-name>dispatcher</servlet-name>
  <url-pattern>/service/*</url-pattern>
</servlet-mapping>
<listener>
  <listener-class>org.springframework.web.context.ContextLoaderListener</listener-class>
</listener>


And the following configuration for the dispatcher:

dispatcher-servlet.xml
<mvc:annotation-driven />
<context:component-scan base-package="com.angularspring.sample" />
<context:annotation-config /> 

5. Spring MVC Controller


Finally, we need to add a Spring MVC Controller to test the application from browser. A PlayerController class is configured to map the path "/service/players" to a service returning all players, and map path "/service/players/{id}" to return a particular player, both in JSON format. This object to JSON conversion is handled automatically by the jackson-mapper-asl library if we put that in classpath.

@Controller
@RequestMapping("/players")
public class PlayerController {
    private static Logger logger = LoggerFactory.getLogger(PlayerController.class);

    @Autowired
    PlayerRepository playerRepository;

    @RequestMapping(method = RequestMethod.GET)
    public @ResponseBody List<player> getAllPlayers() {
        return playerRepository.findAll();
    }

    @RequestMapping(value = "/{id}", method = RequestMethod.GET)
    public @ResponseBody Player getPlayer(@PathVariable("id") Long id) {
        return playerRepository.findOne(id);
    }

    @RequestMapping(value = "/initDB", method = RequestMethod.GET)
    @Transactional(readOnly = false, isolation = Isolation.READ_COMMITTED)
    public ResponseEntity<string> initDB() {
        List<player> players = new ArrayList<player>();
        players.add(new Player("Snoopy", "9p"));
        players.add(new Player("Wookstock", "9p"));
        players.add(new Player("Charlie", "1d"));
        players.add(new Player("Lucy", "4d"));
        players.add(new Player("Sally", "5d"));
        playerRepository.save(players);
        return new ResponseEntity<string>("5 players inserted into database", HttpStatus.OK);
    }
}

I have also added an initDB() method to initialize the datastore using HTTP request. The above will add 5 players into the database.

6. Test the application


We could test the application by using the Eclipse Google plugin. Start the google web application and try with the following in your browser (assume port is 8888). I used Firefox instead of IE because Firefox will output JSON directly.

    http://localhost:8888/service/players

It should return nothing, as expected, because we have no records in the database. So what we see is an empty array in JSON format (i.e. [ ]).

We can then initialize the database by:

    http://localhost:8888/service/players/initDB

Query the players again. It should give us something like the following:

[{"id":4573968371548160,"name":"Sally","rank":"5d"},{"id":4925812092436480,"name":"Snoopy","rank":"9p"},{"id":5488762045857792,"name":"Charlie","rank":"1d"},{"id":6051711999279104,"name":"Wookstock","rank":"9p"},{"id":6614661952700416,"name":"Lucy","rank":"4d"}]

7. Problem


It looks simple to use JPA with GAE.

However, when we test the application in browser, one would easily discover something unusual: not all entities are displayed immediately after the database initialization! It may display 2 records at the first time, 3 records at the second time, and finally get all 5 records displayed if we submit the HTTP request enough times. The reason is about the eventual consistency characteristic of Google App Engine. All those entities are unowned entities, so they are not belonging to the same entity group. In App Engine Datastore architecture, transactionality and data consistency only apply to entity group level.

For details please read the owned relationship of Datastore entities. Also note the unsupported features of JPA on Datastore (e.g. we cannot do many-to-many mapping) when you do the data modelling.

The handling of owned relationship and transactionality on GAE will be discussed in next part.

The source can be found at GitHub (tagged v0.1).