Categories
Data

Reducing Spatial Data Set Size with Douglas-Peucker

In a previous post I discussed how to reduce the size of a spatial data set by clustering. Too many data points in a visualization can overwhelm the user and bog down on-the-fly client-side map rendering (for example, with a javascript tool like Leaflet). So, I used the DBSCAN clustering algorithm to reduce my data set from 1,759 rows to 158 spatially-representative points. This series of posts discusses this data set in depth.

The Douglas-Peucker algorithm

Today I’ll show another method for reducing the size of a spatial data set in Python using the shapely package’s implementation of the Douglas-Peucker algorithm. Douglas-Peucker reduces the number of points in a curve composed of line segments by recursively dividing the line. The final result is a subset of points from the original curve.

All of my code is available in this GitHub repo, particularly this IPython notebook. Let’s begin. First I import the necessary modules and load the full data set:

import pandas as pd, numpy as np, matplotlib.pyplot as plt
from shapely.geometry import LineString
from time import time
df = pd.read_csv('summer-travel-gps-full.csv')
coordinates = df.as_matrix(columns=['lat', 'lon'])

Next, I create a shapely LineString from the coordinate data. This is the curve that will comprise line segments between the coordinate points. I will then simplify the geometry of this curve using shapely’s simplify method.

If preserve topology is set to False, the method will use the Douglas-Peucker algorithm, which is fine – I don’t need to preserve topology because I just need a reduced set of points. I also specify a tolerance value. All points in the simplified figure will be within the tolerance distance of the original geometry. After running the algorithm, I’ll print out a couple of descriptive stats to see what I ended up with:

line = LineString(coordinates)
tolerance = 0.015
simplified_line = line.simplify(tolerance, preserve_topology=False)
print len(line.coords), 'coordinate pairs in full data set'
print len(simplified_line.coords), 'coordinate pairs in simplified data set'
print round(((1 - float(len(simplified_line.coords)) / float(len(line.coords))) * 100), 1), 'percent compressed'
1759 coordinate pairs in full data set
178 coordinate pairs in simplified data set
89.9 percent compressed

The Douglas-Peucker algorithm reduced the size of the data set by about 90%, from 1,759 data points in the original full set to 178 points in the new reduced data set. That’s not bad – these stats are comparable to results from clustering. Now let’s save the simplified set of coordinates as a new pandas dataframe:

lon = pd.Series(pd.Series(simplified_line.coords.xy)[1])
lat = pd.Series(pd.Series(simplified_line.coords.xy)[0])
si = pd.DataFrame({'lon':lon, 'lat':lat})
si.tail()
           lat        lon
173  41.044556  28.983285
174  41.008992  28.968268
175  41.043487  28.985488
176  40.977637  28.823879
177  48.357110  11.791346

Find matching points in the original data set

The original full data set was reverse-geocoded and included city/country data and timestamps. The simplified set however only contains coordinate lat-long data. So, I’ll write a short routine that, for each row in the simplified set, finds the row label of the row that contains its matching lat-long coordinates in the original full set.

First I add a new column to the simplified set to contain the index of the matching row from the original full data set. Then for each row in the simplified set I loop through each row in the original full set, comparing tuples of coordinates. If the points match, I save this full set row’s index as the match in the new column I added to the simplified set. It’s simpler than it sounds:

start_time = time()
si['df_index'] = None
for si_i, si_row in si.iterrows():
    si_coords = (si_row['lat'], si_row['lon'])
    for df_i, df_row in df.iterrows():
        if si_coords == (df_row['lat'], df_row['lon']):
            si['df_index'][si_i] = df_i
            break
print 'process took %s seconds' % round(time() - start_time, 2)
process took 5.37 seconds

Now I select the rows from the original full data set whose indices appear in the df_index column of the simplified data set. Then I save the data set to CSV and examine its tail:

rs = df.ix[si['df_index'].dropna()]
rs.to_csv('summer-travel-gps-simplified.csv', index=False)
rs.tail()
            lat        lon              date      city  country
1730  41.044556  28.983285  07/08/2014 16:44  Istanbul   Turkey
1739  41.008992  28.968268  07/08/2014 20:03  Istanbul   Turkey
1745  41.043487  28.985488  07/08/2014 22:18  Istanbul   Turkey
1751  40.977637  28.823879  07/09/2014 09:03  Istanbul   Turkey
1758  48.357110  11.791346  07/09/2014 13:20    Munich  Germany

Looks good! I now have 178 rows of the original full data set, with city/country data and timestamps.

Comparing the full vs reduced data sets

Let’s plot the new simplified set of GPS coordinate data against the original full set:

plt.figure(figsize=(10, 6), dpi=100)
rs_scatter = plt.scatter(rs['lon'], rs['lat'], c='m', alpha=.4, s=150)
df_scatter = plt.scatter(df['lon'], df['lat'], c='k', alpha=.3, s=10)
plt.xlabel('longitude')
plt.ylabel('latitude')
plt.title('simplified set of coordinate points vs original full set')
plt.legend((rs_scatter, df_scatter), ('simplified', 'original'), loc='upper left')
plt.show()

shapely-simplified-vs-full

The new reduced data set closely approximates the spatial distribution of the original full data set. You can compare it to the results I got by clustering: the Douglas-Peucker gave us a better reduced data set than the k-means clustering algorithm did, and one comparable to result of the DBSCAN clustering algorithm.

Now you can visualize the data set in more interesting ways:

most-isolated-projected

Leave a Comment