In a previous post I discussed how to reduce the size of a spatial data set by clustering. Too many data points in a visualization can overwhelm the user and bog down on-the-fly client-side map rendering (for example, with a javascript tool like Leaflet). So, I used the DBSCAN clustering algorithm to reduce my data set from 1,759 rows to 158 spatially-representative points. This series of posts discusses this data set in depth.
The Douglas-Peucker algorithm
Today I’ll show another method for reducing the size of a spatial data set in Python using the shapely package’s implementation of the Douglas-Peucker algorithm. Douglas-Peucker reduces the number of points in a curve composed of line segments by recursively dividing the line. The final result is a subset of points from the original curve.
All of my code is available in this GitHub repo, particularly this IPython notebook. Let’s begin. First I import the necessary modules and load the full data set:
import pandas as pd, numpy as np, matplotlib.pyplot as plt from shapely.geometry import LineString from time import time df = pd.read_csv('summer-travel-gps-full.csv') coordinates = df.as_matrix(columns=['lat', 'lon'])
Next, I create a shapely LineString from the coordinate data. This is the curve that will comprise line segments between the coordinate points. I will then simplify the geometry of this curve using shapely’s simplify method.
If preserve topology is set to False, the method will use the Douglas-Peucker algorithm, which is fine – I don’t need to preserve topology because I just need a reduced set of points. I also specify a tolerance value. All points in the simplified figure will be within the tolerance distance of the original geometry. After running the algorithm, I’ll print out a couple of descriptive stats to see what I ended up with:
line = LineString(coordinates) tolerance = 0.015 simplified_line = line.simplify(tolerance, preserve_topology=False) print len(line.coords), 'coordinate pairs in full data set' print len(simplified_line.coords), 'coordinate pairs in simplified data set' print round(((1 - float(len(simplified_line.coords)) / float(len(line.coords))) * 100), 1), 'percent compressed'
1759 coordinate pairs in full data set 178 coordinate pairs in simplified data set 89.9 percent compressed
The Douglas-Peucker algorithm reduced the size of the data set by about 90%, from 1,759 data points in the original full set to 178 points in the new reduced data set. That’s not bad – these stats are comparable to results from clustering. Now let’s save the simplified set of coordinates as a new pandas dataframe:
lon = pd.Series(pd.Series(simplified_line.coords.xy)[1]) lat = pd.Series(pd.Series(simplified_line.coords.xy)[0]) si = pd.DataFrame({'lon':lon, 'lat':lat}) si.tail()
lat lon 173 41.044556 28.983285 174 41.008992 28.968268 175 41.043487 28.985488 176 40.977637 28.823879 177 48.357110 11.791346
Find matching points in the original data set
The original full data set was reverse-geocoded and included city/country data and timestamps. The simplified set however only contains coordinate lat-long data. So, I’ll write a short routine that, for each row in the simplified set, finds the row label of the row that contains its matching lat-long coordinates in the original full set.
First I add a new column to the simplified set to contain the index of the matching row from the original full data set. Then for each row in the simplified set I loop through each row in the original full set, comparing tuples of coordinates. If the points match, I save this full set row’s index as the match in the new column I added to the simplified set. It’s simpler than it sounds:
start_time = time() si['df_index'] = None for si_i, si_row in si.iterrows(): si_coords = (si_row['lat'], si_row['lon']) for df_i, df_row in df.iterrows(): if si_coords == (df_row['lat'], df_row['lon']): si['df_index'][si_i] = df_i break print 'process took %s seconds' % round(time() - start_time, 2)
process took 5.37 seconds
Now I select the rows from the original full data set whose indices appear in the df_index column of the simplified data set. Then I save the data set to CSV and examine its tail:
rs = df.ix[si['df_index'].dropna()] rs.to_csv('summer-travel-gps-simplified.csv', index=False) rs.tail()
lat lon date city country 1730 41.044556 28.983285 07/08/2014 16:44 Istanbul Turkey 1739 41.008992 28.968268 07/08/2014 20:03 Istanbul Turkey 1745 41.043487 28.985488 07/08/2014 22:18 Istanbul Turkey 1751 40.977637 28.823879 07/09/2014 09:03 Istanbul Turkey 1758 48.357110 11.791346 07/09/2014 13:20 Munich Germany
Looks good! I now have 178 rows of the original full data set, with city/country data and timestamps.
Comparing the full vs reduced data sets
Let’s plot the new simplified set of GPS coordinate data against the original full set:
plt.figure(figsize=(10, 6), dpi=100) rs_scatter = plt.scatter(rs['lon'], rs['lat'], c='m', alpha=.4, s=150) df_scatter = plt.scatter(df['lon'], df['lat'], c='k', alpha=.3, s=10) plt.xlabel('longitude') plt.ylabel('latitude') plt.title('simplified set of coordinate points vs original full set') plt.legend((rs_scatter, df_scatter), ('simplified', 'original'), loc='upper left') plt.show()
The new reduced data set closely approximates the spatial distribution of the original full data set. You can compare it to the results I got by clustering: the Douglas-Peucker gave us a better reduced data set than the k-means clustering algorithm did, and one comparable to result of the DBSCAN clustering algorithm.
Now you can visualize the data set in more interesting ways: