Category Archives: software

Pedestrian Safety in Manhattan

For the final project in my Realtime and Big Data Analytics class at NYU, I worked on an analysis of the effectiveness of pedestrian safety measures in Manhattan with fellow students Rui Shen and Fei Guan. The main idea behind this project was to look at the number of accidents occurring within a fixed distance of an intersection in Manhattan and determine if the accident rate correlated with any features of the intersection, such as the presence of traffic signals or high traffic volume. We used a number of big data tools and techniques (like Apache Hadoop and MapReduce) to analyze this data and found some rather interesting results.

The first step was to collect data about intersections, accidents, and various features of the intersections. To do this, we relied heavily on open source data sets. We extracted the locations of intersections, speed bumps, and traffic signals from OpenStreetMap. We used NYC Department of Transportation data for traffic volume information, traffic signal locations, and traffic camera locations. Finally, we used NYC Open Data for information on accident counts and traffic volume, as well as the locations of speed bumps, arterial slow zones, and neighborhood slow zones. Some of the data could be used mostly off of the shelf, but other datasets required further processing, such as normalizing traffic volume over time and geocoding the street addresses of traffic camera locations.

The next step was to merge the feature and accident data with the relevant intersections. To do this, we used big data tools to assign intersection identifiers to every corresponding feature and accident record. As Hadoop can’t natively handle spatial data, we needed some additional tools to help us determine which features existed within an intersection. There were three distinct types of spatial data that we needed to process: point data (such as accidents), line data (such as traffic volume) and polygon data (such as neighborhood slow zones). Fortunately, GIS Tools for Hadoop helped us solved this problem. The GIS Tools implement many spatial operations on top of Hadoop, such as finding spatial geometry intersections, overlaps, and inclusions. This toolkit also includes User Defined Functions (UDFs) which can be used with Hive. For this task, we used Hive and the UDFs to associate the feature and accident data with the appropriate intersections. We experimented with different sizes of spatial buffers around an intersection and decided that a twenty-meter radius captured most of the related data points without overlapping with other intersections.

Examples of the different types of spatial data we had to correlate with intersections: area data (blue), point data (red) and line data (green).
Examples of the different types of spatial data that could exist within an intersection: area data (blue), point data (purple) and line data (green).

Once all of the relevant data had an intersection identifier assigned to it, we wrote a MapReduce job to aggregate all of the distinct data sets into one dataset that had all of the intersection feature information in a single record. In the reduce stage, we examined all of the data for a given intersection and did some further reduction, such as normalizing the traffic volume value for the intersection or calculating the sum of all of the accidents occurring within the intersection buffer.

The last step was to calculate correlation metrics on the data. To do this, we used Apache Spark. We segmented the data set into thirds by traffic volume, giving us low, moderate, and high traffic volume data sets.  We then calculated Spearman and Pearson correlation coefficients between the accident rate and the individual features and then analyzed the results. Although most features showed very little correlation with the accident rate, there were a few features that produced a moderate level of correlation. First, we found that there is a moderate positive correlation between accidents and the presence of traffic lights. This seemed odd at first but on second consideration it made sense. I have seen many random acts of bravery occur at traffic signals where people would try to cross the street just as the light was changing. Second, we found that there was a moderate negative correlation between high traffic volume and accidents. Again, this was not immediately intuitive, but our speculation was that drivers and pedestrians would be more cautious at busy intersections.

As this project was only a few weeks long, we didn’t have time to do a more in-depth analysis. I think we would have found even more interesting results had we done a better multivariate analysis which would allow us to calculate correlation metrics across all variables instead of just examining single variant correlation. One observation that we made was that intersections in high-traffic business or tourist areas have different accident profiles than intersections in residential areas. Therefore, it would be wise to include more socio-economic information for each intersection, such as land-use information and population information.

Despite the time constraints, the small amount of analysis we did was very interesting and made me look at something as simple as crossing the street in a whole new light.

Super Mario Clouds

Super Mario CloudsI took the Super Mario Clouds class this weekend at NYC Resistor. It was taught by Jonathan Dahan and David Huerta. The goal of the class was to recreate Cory Arcangel’s Super Mario Clouds project. This involved hacking a Super Mario Brothers cartridge to show just the clouds. Here is a video of his finished project:

The first step was to take apart a Super Mario Brothers cartridge. Interestingly enough, the actual board inside was much smaller than the cartridge. The game was divided into two chips: character/sprite data on the left chip and program data on the right chip. Since we were going to use the existing Super Mario sprite data, we only needed to remove the program chip. After carefully desoldering the program chip, we replaced it with a socket and a 27C256 EEPROM.

Disassembled SMBNext, We talked about how a Nintendo worked. There were some very helpful explanations in the Nerdy Nights Tutorials. The Nintendo uses a custom 6502 processor which has the audio processing built in to the chip. There is a separate picture processing unit used to display graphics. The program ROM is limited to a mere 32 KB, so Nintendo had to do some clever graphics manipulation in order to create a rather seamless side-scrolling experience.

IMG_9491The next step was to download Cory Arcangel’s Super Mario Clouds code. There was quite a bit of software setup required to run  and compile the code from scratch. As a word of warning: some of these tools work best in Windows. As I am a Mac user, I used a Windows virtual machine on VirtualBox. To run the code, we used the FCEUX Nintendo emulator. Here is a screenshot of the original code as rendered by the emulator:

clouds-3But what if we wanted to modify the code? The original Super Mario Clouds code was written in nbasic. To create a new binary, we first had to compile the code with nbasic and then convert it to 6502 assembly with nesasm. Those tools can be found here.  Finally, an NES splitter is needed to split the resulting .nes file into .chr and .prg files (for the respective character and program chips).

Once we had the .prg file, it was time to burn our programs to the chip. This was done using an EEPROM programmer. We simply selected the chip type, uploaded our binary and let the programmer software do the rest.

ProgrammerAfter uploading our programs to the cartridge, it was time for the moment of truth. We plugged our cartridges into an old Nintendo and hoped for the best. Fortunately, most of the cartridges worked on the first try!

Nintendo hackThis project was a great way to spend an afternoon with my head in the (Super Mario) clouds. Now I really want to create my own Super Mario piece. I never thought I would be so excited to write assembly code…

Probabilistic Postal Address Elementalization

I haven’t posted in a few months because most of my time has been consumed with work and school. With respect to school, I have been taking Statistical Natural Language Processing at New York University. As my final project for this class, I have been working on something which I have been curious about for quite some time – probabilistic postal address elementalization.

What exactly does “postal address elementalization” mean? This is the process of breaking a postal address into tokens and classifying the function of each token, such as house number, street suffix or zip code. For my experiment, I created twelve distinct classes: five for administrative areas (country, state, county, city and neighborhood), one for postal codes and six for street address components (house number, prefix, pre-directional, base street name, post-directional and suffix). An example of an elementalized address is below:

address_elementalizationAlthough this seems like a straightforward problem, it is complicated by the fact that many countries have different languages and address formats. For example, in Ghana and Cameroon, there is no standard postal code system. In the Netherlands and Ireland, there is no province or state in the address. In some countries, such as France and Mexico, the postal code is placed before the city whereas other countries place it after the city. Furthermore, known names of terms in specific classes can overlap, so disambiguating streets can be tricky, like Avenue N in Brooklyn (“N” could be a post-directional or street name) or N Broadway in St. Louis (“St.” could be a suffix or part of a city name).

Traditionally, address elementalization has been implemented by building rule-based systems to handle each individual address format. This makes the process of building international address parsers very time consuming as one would need to implement a new rule-based parser for every single country. By building a statistical model to do this, implementing a new international format is as simple as training the model with a new data set.

Some interesting characteristics of postal addresses are that they have a grammar (as defined by the address format) and the elements are contextually dependent. These traits make the problem well-suited for natural language processing. There are natural language processing techniques that are used for similar purposes, namely part-of-speech taggers which are used to classify the parts of speech in a sentence.

For my final project, I looked at four different techniques of statistical part-of-speech tagging and applied them to the problem of postal address elementalization. The code is here.

The first strategy was a Hidden Markov model (HMM) tagger. HMMs are statistical models that can be used to find the most likely sequence of states for an input. This is done by using transition probabilities (the probability of a specific state given the previous state) and emission probabilities (the probability of the proposed state given the current input token). These probabilities are learned by observing a training set.  The model then tries to classify the input left to right, calculating the probability of a proposed state as the product of the transmission probability, emission probability, and maximum probability from the previous state. The model tries different state combinations over the input and returns the sequence of possible states with the highest overall probability.  A variation of this is the trigram HMM tagger, which uses the two previous probabilities to calculate the transition probability.

The second strategy was a Maximum-Entropy Markov model (MEMM) tagger. MEMMs are similar to HMMs in that they try to find the sequence of states that has the maximum total probability for an input. However, instead of just using observed counts for the transition and emission probabilities, we train a maximum-entropy distribution to calculate the probabilities. The main advantage of doing this is that we can have custom features factor into the probability, such as the current token length or whether the token contains a number. This allows for greater flexibility in tuning the tagger, but it makes the overall classification time slower.

The third strategy was a Transformation-Based Learning (TBL) tagger, also known as a Brill tagger. The TBL tagger generates rules based on observations in training. These rules indicate observed conditions on when a tag should be swapped with a different tag. During classification, initial tags are assigned to the terms based on the observed probability. The tagger iterates through the rules which were learned in training, swapping tags as necessary, until there are no more rules to be applied or a given score threshold is met.

The fourth strategy was a Conditional Random Field (CRF) tagger. CRFs implement a number of feature functions which take the proposed states and the input observation and return some value between 0 and 1. These features, like the features in MEMMs, can be just about anything, such as whether the current token is capitalized or whether it ends in -ed. Each of these feature functions has a different weight based on observations in training. The score for a sequence of states is calculated by summing the feature functions for every word over all words and then normalizing. Just like HMMs and MEMMs, we find the sequence that maximizes the overall probability.

From my initial work, I found that the Maximum-Entropy Markov model and the Conditional Random Field taggers consistently had the highest overall accuracy of the group. Both consistently had accuracies over 98%, even on partial addresses. For full format American addresses, these taggers had an accuracy around 99.7%. If I had to choose between them, I would go with the Conditional Random Field tagger as it was considerably faster in tagging the addresses.

I didn’t have enough time to finish everything that I wanted to implement, so this is still a work in progress. I still want to implement some smoothing techniques for unknown states, such as Katz backoff and Kneser-Ney interpolation. There are also a few more part-of-speech tagging techniques I would like to experiment with. I guess that old adage is true in this case – time flies when you are having fun…  🙂