# Reducing Spatial Data Set Size with Douglas-Peucker

In a previous post I discussed how to reduce the size of a spatial data set by clustering. Too many data points in a visualization can overwhelm the user and bog down on-the-fly client-side map rendering (for example, with a javascript tool like Leaflet). So, I used the DBSCAN clustering algorithm to reduce my data set from 1,759 rows to 158 spatially-representative points. This series of posts discusses this data set in depth.

## The Douglas-Peucker algorithm

Today I’ll show another method for reducing the size of a spatial data set in Python using the shapely package’s implementation of the Douglas-Peucker algorithm. Douglas-Peucker reduces the number of points in a curve composed of line segments by recursively dividing the line. The final result is a subset of points from the original curve.

All of my code is available in this GitHub repo, particularly this IPython notebook. Let’s begin. First I import the necessary modules and load the full data set:

```import pandas as pd, numpy as np, matplotlib.pyplot as plt
from shapely.geometry import LineString
from time import time
coordinates = df.as_matrix(columns=['lat', 'lon'])
```

Next, I create a shapely LineString from the coordinate data. This is the curve that will comprise line segments between the coordinate points. I will then simplify the geometry of this curve using shapely’s simplify method.

If preserve topology is set to False, the method will use the Douglas-Peucker algorithm, which is fine – I don’t need to preserve topology because I just need a reduced set of points. I also specify a tolerance value. All points in the simplified figure will be within the tolerance distance of the original geometry. After running the algorithm, I’ll print out a couple of descriptive stats to see what I ended up with:

```line = LineString(coordinates)
tolerance = 0.015
simplified_line = line.simplify(tolerance, preserve_topology=False)
print len(line.coords), 'coordinate pairs in full data set'
print len(simplified_line.coords), 'coordinate pairs in simplified data set'
print round(((1 - float(len(simplified_line.coords)) / float(len(line.coords))) * 100), 1), 'percent compressed'
```
```1759 coordinate pairs in full data set
178 coordinate pairs in simplified data set
89.9 percent compressed
```

The Douglas-Peucker algorithm reduced the size of the data set by about 90%, from 1,759 data points in the original full set to 178 points in the new reduced data set. That’s not bad – these stats are comparable to results from clustering. Now let’s save the simplified set of coordinates as a new pandas dataframe:

```lon = pd.Series(pd.Series(simplified_line.coords.xy)[1])
lat = pd.Series(pd.Series(simplified_line.coords.xy)[0])
si = pd.DataFrame({'lon':lon, 'lat':lat})
si.tail()
```
```           lat        lon
173  41.044556  28.983285
174  41.008992  28.968268
175  41.043487  28.985488
176  40.977637  28.823879
177  48.357110  11.791346
```

## Find matching points in the original data set

The original full data set was reverse-geocoded and included city/country data and timestamps. The simplified set however only contains coordinate lat-long data. So, I’ll write a short routine that, for each row in the simplified set, finds the row label of the row that contains its matching lat-long coordinates in the original full set.

First I add a new column to the simplified set to contain the index of the matching row from the original full data set. Then for each row in the simplified set I loop through each row in the original full set, comparing tuples of coordinates. If the points match, I save this full set row’s index as the match in the new column I added to the simplified set. It’s simpler than it sounds:

```start_time = time()
si['df_index'] = None
for si_i, si_row in si.iterrows():
si_coords = (si_row['lat'], si_row['lon'])
for df_i, df_row in df.iterrows():
if si_coords == (df_row['lat'], df_row['lon']):
si['df_index'][si_i] = df_i
break
print 'process took %s seconds' % round(time() - start_time, 2)
```
```process took 5.37 seconds
```

Now I select the rows from the original full data set whose indices appear in the df_index column of the simplified data set. Then I save the data set to CSV and examine its tail:

```rs = df.ix[si['df_index'].dropna()]
rs.to_csv('summer-travel-gps-simplified.csv', index=False)
rs.tail()
```
```            lat        lon              date      city  country
1730  41.044556  28.983285  07/08/2014 16:44  Istanbul   Turkey
1739  41.008992  28.968268  07/08/2014 20:03  Istanbul   Turkey
1745  41.043487  28.985488  07/08/2014 22:18  Istanbul   Turkey
1751  40.977637  28.823879  07/09/2014 09:03  Istanbul   Turkey
1758  48.357110  11.791346  07/09/2014 13:20    Munich  Germany
```

Looks good! I now have 178 rows of the original full data set, with city/country data and timestamps.

## Comparing the full vs reduced data sets

Let’s plot the new simplified set of GPS coordinate data against the original full set:

```plt.figure(figsize=(10, 6), dpi=100)
rs_scatter = plt.scatter(rs['lon'], rs['lat'], c='m', alpha=.4, s=150)
df_scatter = plt.scatter(df['lon'], df['lat'], c='k', alpha=.3, s=10)
plt.xlabel('longitude')
plt.ylabel('latitude')
plt.title('simplified set of coordinate points vs original full set')
plt.legend((rs_scatter, df_scatter), ('simplified', 'original'), loc='upper left')
plt.show()
```

The new reduced data set closely approximates the spatial distribution of the original full data set. You can compare it to the results I got by clustering: the Douglas-Peucker gave us a better reduced data set than the k-means clustering algorithm did, and one comparable to result of the DBSCAN clustering algorithm.

Now you can visualize the data set in more interesting ways: