Categories
Data

R-tree Spatial Indexing with Python

r-tree spatial index with python geopandas: Thumbnail of Walnut Creek, California city boundary and street intersections inside and outside city limits Check out the journal article about OSMnx, which implements this technique.

A spatial index such as R-tree can drastically speed up GIS operations like intersections and joins. Spatial indices are key features of spatial databases like PostGIS, but they’re also available for DIY coding in Python. I’ll introduce how R-trees work and how to use them in Python and its geopandas library. All of my code is in this notebook in this urban data science GitHub repo.

An R-tree represents individual objects and their bounding boxes (the “R” is for “Rectangle”) as the lowest level of the spatial index. It then aggregates nearby objects and represents them with their aggregate bounding box in the next higher level of the index. At yet higher levels, the R-tree aggregates bounding boxes and represents them by their bounding box, iteratively, until everything is nested into one top-level bounding box.

To search, the R-tree takes a query box and, starting at the top level, sees which (if any) bounding boxes intersect it. It then expands each intersecting bounding box and sees which of the child bounding boxes inside it intersect the query box. This proceeds recursively until all intersecting boxes are searched down to the lowest level, and returns the matching objects from the lowest level.

But what if the two sets of features (say, a polygon and a set of points) that you’re intersecting have approximately the same bounding boxes? Because the polygon’s minimum bounding box is approximately the same as the set of points’ minimum bounding box, the R-tree intersects every nested bounding box and considers every point to be a possible match. Thus, using an R-tree spatial index makes the operation run no faster than it would without the spatial index!

Let’s look at how to use R-trees in Python and how to solve this limitation.

Simple example: R-tree spatial index

Python’s geopandas offers an implementation of R-tree to speed up spatial queries. Let’s say we have a polygon representing the city boundary of Walnut Creek, California:

Walnut Creek, California city boundary

And we also have a geopandas GeoDataFrame of lat-long points representing street intersections in the vicinity of this city. Some of these points are within the city’s borders, but others are outside of them:

Walnut Creek, California city boundary and street intersections

We can use geopandas’s R-tree spatial index to find which street intersections lie within the boundaries of the city:

spatial_index = gdf.sindex
possible_matches_index = list(spatial_index.intersection(polygon.bounds))
possible_matches = gdf.iloc[possible_matches_index]
precise_matches = possible_matches[possible_matches.intersects(polygon)]

First we have our GeoDataFrame (called gdf) create an R-tree spatial index for its geometry. Then we intersect this index with the bounding box of our polygon. This returns a set of possible matches. That is, there are no false negatives but there may be some false positives if an R-tree rectangle within the city border’s bounding box contains some points outside the city border. Finally, to identify the precise matches (those points exactly within our polygon), we intersect the possible matches with the polygon itself:

r-tree spatial index with python geopandas: Walnut Creek, California city boundary and street intersections inside and outside city limits

Here we can see all of the street intersections within the city of Walnut Creek in blue, and all those outside of it in red.

Advanced example: same bounding boxes

How can we use an R-tree spatial index to find the points within a polygon, if the points and the polygon have identical bounding boxes?

The R-tree index works great when the two sets of features you are intersecting or joining have different bounding boxes (such as the polygon and the points in the preceding example). However, an R-tree provides no speed up when the features’ bounding boxes are identical, because it identifies every point as a possible match because the bounding box of the polygon intersects every nested rectangle inside the index. This is a limitation of R-trees themselves.

Fortunately, we can work around this limitation in Python. Let’s say we have a polygon representing the borders of the city of Los Angeles, and a GeoDataFrame of approximately one million street intersections in and around LA:

Los Angeles, California city boundary and street intersections

We want to find which street intersections are within LA’s city boundary. Notice that our polygon and points have the same minimum bounding boxes, so an R-tree would offer no speed up because rectangle expansion would identify every point as a possible match.

But, we could sub-divide our polygon into smaller sub-polygons with smaller minimum bounding boxes, using shapely. I simply overlay my polygon with evenly-spaced quadrat lines, then split it into separate polygons along these lines:

Los Angeles, California city boundary subdivided by quadrats

Now we can just iterate through these small sub-polygons to quickly identify which points lie within each, using the R-tree spatial index (as demonstrated in the code snippet earlier):

r-tree spatial index with python geopandas: Los Angeles, California city boundary subdivided and with street intersections

This spatial intersection now can take full advantage of the R-tree index and reduces the computation time from 20+ minutes down to just a few seconds. Here we can see all of the street intersections within the city of Los Angeles in blue, and all those outside of it in red:

r-tree spatial index with python geopandas: Los Angeles, California city boundary and street intersections inside and outside city limits

Conclusion: R-trees and Python

Long story short, if you’re doing spatial intersections or joins with Python, you should use geopandas and its implementation of the R-tree spatial index. If you’re intersecting lots of points with a polygon – and the points and polygon have identical minimum bounding boxes – you can subdivide the polygon then intersect each sub-polygon with the points, using the index. This yields a drastic decrease in computation time.

All of my code is in this notebook in this urban data science GitHub repo.

21 replies on “R-tree Spatial Indexing with Python”

Hey Geoff,

While I don’t feel as poetic as Leonardo, I really dig your tutorials and your site. It is the best resource I’ve come across for trying to work with the nascent geopandas. I tried implementing an rtree spatial index–copying what you did here–but am not having any luck.

I am trying to get the intersection area of parcels with different spatial layers. I will use CPAD holdings as an example here. I then want to get the proportion that cpad overlaps each parcel (like 0.26). This works fine, but slowly, without an index. Here’s my code:

spatial_index = parcels.sindex
possible_matches_index = list(spatial_index.intersection(df_union.bounds))
possible_matches = parcels.iloc[possible_matches_index]
precise_matches = possible_matches[possible_matches.intersects(df_union)]

And then to get the pct overlap:
constraints[‘cpad_pct’] = precise_matches.area / parcels.area

df_union is the unary_union of cpad holdings, which I had to do to get the parcel.intersection(df) to work properly.

It breaks at: (spatial_index.intersection(df_union.bounds))
THE ERROR:
AttributeError: ‘NoneType’ object has no attribute ‘intersection’

Do you know why it might be getting returned as a nonetype? It takes 30 sec to build the sindex, so that seems to work, but when I print spatial_index, it returns ‘None’.

Any thoughts?

Sounds like the spatial index’s constructor is returning None… in which case it would be an issue with geopandas or rtree. Or your data if there’s something about it that prevents index construction. Might try with a small sample of your data to see if you can get the index to build. Or try with some other dummy data to make sure a trivial example is working on your machine. If all else fails, might want to open a new issue at the geopandas or rtree GitHub repos.

Great, thanks for the info. I will try it out with a few different test datasets to make sure its not a data issue. If its not, Ill open a new github issue. Thanks!

Hey Geoff,

Let’s say we want to map points to TAZs. We have one file with many taz polygons.

Is the fastest approach to simply loop overall the TAZs and then run the r-tree query for all points on each individual TAZ? The following example works, but I wonder if it can be faster.

val_gdf is the GeoDataFrame of points. taz_gdf is the GeoDataFrame with TAZ polygons.

sindex = val_gdf.sindex
val_gdf[‘taz_key’] = np.nan
for i, bounds in enumerate(taz_gdf.bounds.iterrows()):
possible_matches_index = list(sindex.intersection(bounds[1]))
possible_matches = val_gdf.iloc[possible_matches_index]
precise_matches_index = possible_matches.intersects(taz_gdf.iloc[i].geometry)
precise_matches_index = precise_matches_index[precise_matches_index].index
val_gdf.loc[precise_matches_index, ‘taz_key’] = taz_gdf.iloc[i][‘taz_key’]

Hey Andrew — fastest way to get the TAZ that each point falls within would be to perform a spatial join:

geopandas.sjoin(gdf_points, gdf_polygons, how='left', op='within')

The resulting GeoDataFrame will contain all of the points, with new columns for the TAZ it falls within. I have a demo of geopandas spatial joins here.

So funny story. I was finding that sjoin was taking roughly 10x longer than looping w/ r-tree index. It turns out I was using v0.1.0 of Geopandas. I had manually added it to my Anaconda install before there was a conda install available. That version has been overriding the updated one. Anyway, after some digging and deleting I am using v0.2.1 and the join method is by far the fastest option.

Curious about the referenced .ipynb at the end of the article. In it, the OMNX function `quadrat_cut_geometry` is called. If I look at the source code, this function has 2 required arguments (`geometry` and `quadrat_width`).

in the Notebook, only the first argument is provided:

“`
geometry_cut = ox.quadrat_cut_geometry(geometry)
“`

Am I missing something? Is the second value inferred somewhere? Is there a method that attempts to guess what the optimal subdivision of the geometry would be?

Sorry about that. I had recently updated the code locally to work with OSMnx v0.3, but I forgot to push it to GitHub. Just pushed it now, so the notebook should all be working. Thanks for the heads up.

Hey Geoff,

How would R-trees work for spatial data which is continuous? For example, land-use data?

Is the dataset in question raster or vector-based? It doesn’t necessarily make sense to spatially-index a raster. If it’s vector-based land use data, then you can just build an r-tree as per usual.

Thank you. It is vector-based. But, when we define the MBRs, is there is a limit on how small the spatial resolutions should be? For example, a census block, or a building?

What about intersecting lines with a polygon? I tried to use a gdf of lines where you use the point gdf but the result is just returning the lines that intersect with the polygon without intersecting their geometry (the parts of lines that fall out of the polygon are not clipped).

Hi Geoff, great tutorial. I’ve read a ton of this stuff and it’s really helped me out.

Stupid question; where did you find documentation about the .sindex property of a GeoDataFrame? I’m not seeing it on the official docs page, but maybe I’m not looking in the right place.

Thanks,
Quin

@Geoff Boeing : I really liked your approach and it really exposed me to spatial indexes. But I am facing an issue while adapting your code into my problem. I have a vector file (geojson) containing a lot of polygons and in this file , there are many small polygons which are inside bigger polygons, I have to remove these smaller polygons. One possible solution can be looping over each geometry and checking using (shapely contains method to check whether it lies inside bigger polygon or not, but this is very computationally costly). Therefore, can you please guide how can we utilize spatial indexing in this case ?

Leave a Comment