A spatial index such as R-tree can drastically speed up GIS operations like intersections and joins. Spatial indices are key features of spatial databases like PostGIS, but they’re also available for DIY coding in Python. I’ll introduce how R-trees work and how to use them in Python and its geopandas library. All of my code is in this notebook in this urban data science GitHub repo.

An R-tree represents individual objects and their bounding boxes (the “R” is for “Rectangle”) as the lowest level of the spatial index. It then aggregates nearby objects and represents them with their aggregate bounding box in the next higher level of the index. At yet higher levels, the R-tree aggregates bounding boxes and represents them by *their* bounding box, iteratively, until everything is nested into one top-level bounding box.

To search, the R-tree takes a query box and, starting at the top level, sees which (if any) bounding boxes intersect it. It then expands each intersecting bounding box and sees which of the child bounding boxes inside it intersect the query box. This proceeds recursively until all intersecting boxes are searched down to the lowest level, and returns the matching objects from the lowest level.

But what if the two sets of features (say, a polygon and a set of points) that you’re intersecting have approximately the same bounding boxes? Because the polygon’s minimum bounding box is approximately the same as the set of points’ minimum bounding box, the R-tree intersects *every* nested bounding box and considers *every* point to be a possible match. Thus, using an R-tree spatial index makes the operation run no faster than it would without the spatial index!

Let’s look at how to use R-trees in Python and how to solve this limitation.

## Simple example: R-tree spatial index

Python’s geopandas offers an implementation of R-tree to speed up spatial queries. Let’s say we have a polygon representing the city boundary of Walnut Creek, California:

And we also have a geopandas GeoDataFrame of lat-long points representing street intersections in the vicinity of this city. Some of these points are within the city’s borders, but others are outside of them:

We can use geopandas’s R-tree spatial index to find which street intersections lie within the boundaries of the city:

spatial_index = gdf.sindex possible_matches_index = list(spatial_index.intersection(polygon.bounds)) possible_matches = gdf.iloc[possible_matches_index] precise_matches = possible_matches[possible_matches.intersects(polygon)]

First we have our GeoDataFrame (called *gdf*) create an R-tree spatial index for its geometry. Then we intersect this index with the bounding box of our polygon. This returns a set of possible matches. That is, there are no false negatives but there *may* be some false positives if an R-tree rectangle within the city border’s bounding box contains some points outside the city border. Finally, to identify the *precise* matches (those points exactly within our polygon), we intersect the *possible* matches with the polygon itself:

Here we can see all of the street intersections within the city of Walnut Creek in blue, and all those outside of it in red.

## Advanced example: same bounding boxes

How can we use an R-tree spatial index to find the points within a polygon, if the points and the polygon have identical bounding boxes?

The R-tree index works great when the two sets of features you are intersecting or joining have different bounding boxes (such as the polygon and the points in the preceding example). However, an R-tree provides no speed up when the features’ bounding boxes are identical, because it identifies *every* point as a possible match because the bounding box of the polygon intersects *every* nested rectangle inside the index. This is a limitation of R-trees themselves.

Fortunately, we can work around this limitation in Python. Let’s say we have a polygon representing the borders of the city of Los Angeles, and a GeoDataFrame of approximately one million street intersections in and around LA:

We want to find which street intersections are within LA’s city boundary. Notice that our polygon and points have the same minimum bounding boxes, so an R-tree would offer no speed up because rectangle expansion would identify every point as a possible match.

But, we could sub-divide our polygon into smaller sub-polygons with smaller minimum bounding boxes, using shapely. I simply overlay my polygon with evenly-spaced quadrat lines, then split it into separate polygons along these lines:

Now we can just iterate through these small sub-polygons to quickly identify which points lie within each, using the R-tree spatial index (as demonstrated in the code snippet earlier):

This spatial intersection now can take full advantage of the R-tree index and reduces the computation time from 20+ minutes down to just a few seconds. Here we can see all of the street intersections within the city of Los Angeles in blue, and all those outside of it in red:

## Conclusion: R-trees and Python

Long story short, if you’re doing spatial intersections or joins with Python, you should use geopandas and its implementation of the R-tree spatial index. If you’re intersecting lots of points with a polygon – and the points and polygon have identical minimum bounding boxes – you can subdivide the polygon then intersect each sub-polygon with the points, using the index. This yields a drastic decrease in computation time.

All of my code is in this notebook in this urban data science GitHub repo.

veramente bello

Hey Geoff,

While I don’t feel as poetic as Leonardo, I really dig your tutorials and your site. It is the best resource I’ve come across for trying to work with the nascent geopandas. I tried implementing an rtree spatial index–copying what you did here–but am not having any luck.

I am trying to get the intersection area of parcels with different spatial layers. I will use CPAD holdings as an example here. I then want to get the proportion that cpad overlaps each parcel (like 0.26). This works fine, but slowly, without an index. Here’s my code:

spatial_index = parcels.sindex

possible_matches_index = list(spatial_index.intersection(df_union.bounds))

possible_matches = parcels.iloc[possible_matches_index]

precise_matches = possible_matches[possible_matches.intersects(df_union)]

And then to get the pct overlap:

constraints[‘cpad_pct’] = precise_matches.area / parcels.area

df_union is the unary_union of cpad holdings, which I had to do to get the parcel.intersection(df) to work properly.

It breaks at: (spatial_index.intersection(df_union.bounds))

THE ERROR:

AttributeError: ‘NoneType’ object has no attribute ‘intersection’

Do you know why it might be getting returned as a nonetype? It takes 30 sec to build the sindex, so that seems to work, but when I print spatial_index, it returns ‘None’.

Any thoughts?

Sounds like the spatial index’s constructor is returning None… in which case it would be an issue with geopandas or rtree. Or your data if there’s something about it that prevents index construction. Might try with a small sample of your data to see if you can get the index to build. Or try with some other dummy data to make sure a trivial example is working on your machine. If all else fails, might want to open a new issue at the geopandas or rtree GitHub repos.

Great, thanks for the info. I will try it out with a few different test datasets to make sure its not a data issue. If its not, Ill open a new github issue. Thanks!

Check out this issue on OSMnx that references the problem you’re experiencing – it might be an rtree bug that was recently fixed.

Hey Geoff,

Let’s say we want to map points to TAZs. We have one file with many taz polygons.

Is the fastest approach to simply loop overall the TAZs and then run the r-tree query for all points on each individual TAZ? The following example works, but I wonder if it can be faster.

val_gdf is the GeoDataFrame of points. taz_gdf is the GeoDataFrame with TAZ polygons.

sindex = val_gdf.sindex

val_gdf[‘taz_key’] = np.nan

for i, bounds in enumerate(taz_gdf.bounds.iterrows()):

possible_matches_index = list(sindex.intersection(bounds[1]))

possible_matches = val_gdf.iloc[possible_matches_index]

precise_matches_index = possible_matches.intersects(taz_gdf.iloc[i].geometry)

precise_matches_index = precise_matches_index[precise_matches_index].index

val_gdf.loc[precise_matches_index, ‘taz_key’] = taz_gdf.iloc[i][‘taz_key’]

Hey Andrew — fastest way to get the TAZ that each point falls within would be to perform a spatial join:

`geopandas.sjoin(gdf_points, gdf_polygons, how='left', op='within')`

The resulting GeoDataFrame will contain all of the points, with new columns for the TAZ it falls within. I have a demo of geopandas spatial joins here.

So funny story. I was finding that sjoin was taking roughly 10x longer than looping w/ r-tree index. It turns out I was using v0.1.0 of Geopandas. I had manually added it to my Anaconda install before there was a conda install available. That version has been overriding the updated one. Anyway, after some digging and deleting I am using v0.2.1 and the join method is by far the fastest option.