Check out the journal article about OSMnx, which implements this technique.
A spatial index such as R-tree can drastically speed up GIS operations like intersections and joins. Spatial indices are key features of spatial databases like PostGIS, but they’re also available for DIY coding in Python. I’ll introduce how R-trees work and how to use them in Python and its geopandas library. All of my code is in this notebook in this urban data science GitHub repo.
An R-tree represents individual objects and their bounding boxes (the “R” is for “Rectangle”) as the lowest level of the spatial index. It then aggregates nearby objects and represents them with their aggregate bounding box in the next higher level of the index. At yet higher levels, the R-tree aggregates bounding boxes and represents them by their bounding box, iteratively, until everything is nested into one top-level bounding box.
To search, the R-tree takes a query box and, starting at the top level, sees which (if any) bounding boxes intersect it. It then expands each intersecting bounding box and sees which of the child bounding boxes inside it intersect the query box. This proceeds recursively until all intersecting boxes are searched down to the lowest level, and returns the matching objects from the lowest level.
But what if the two sets of features (say, a polygon and a set of points) that you’re intersecting have approximately the same bounding boxes? Because the polygon’s minimum bounding box is approximately the same as the set of points’ minimum bounding box, the R-tree intersects every nested bounding box and considers every point to be a possible match. Thus, using an R-tree spatial index makes the operation run no faster than it would without the spatial index!
Let’s look at how to use R-trees in Python and how to solve this limitation.
Simple example: R-tree spatial index
Python’s geopandas offers an implementation of R-tree to speed up spatial queries. Let’s say we have a polygon representing the city boundary of Walnut Creek, California:
And we also have a geopandas GeoDataFrame of lat-long points representing street intersections in the vicinity of this city. Some of these points are within the city’s borders, but others are outside of them:
We can use geopandas’s R-tree spatial index to find which street intersections lie within the boundaries of the city:
spatial_index = gdf.sindex possible_matches_index = list(spatial_index.intersection(polygon.bounds)) possible_matches = gdf.iloc[possible_matches_index] precise_matches = possible_matches[possible_matches.intersects(polygon)]
First we have our GeoDataFrame (called gdf) create an R-tree spatial index for its geometry. Then we intersect this index with the bounding box of our polygon. This returns a set of possible matches. That is, there are no false negatives but there may be some false positives if an R-tree rectangle within the city border’s bounding box contains some points outside the city border. Finally, to identify the precise matches (those points exactly within our polygon), we intersect the possible matches with the polygon itself:
Here we can see all of the street intersections within the city of Walnut Creek in blue, and all those outside of it in red.
Advanced example: same bounding boxes
How can we use an R-tree spatial index to find the points within a polygon, if the points and the polygon have identical bounding boxes?
The R-tree index works great when the two sets of features you are intersecting or joining have different bounding boxes (such as the polygon and the points in the preceding example). However, an R-tree provides no speed up when the features’ bounding boxes are identical, because it identifies every point as a possible match because the bounding box of the polygon intersects every nested rectangle inside the index. This is a limitation of R-trees themselves.
Fortunately, we can work around this limitation in Python. Let’s say we have a polygon representing the borders of the city of Los Angeles, and a GeoDataFrame of approximately one million street intersections in and around LA:
We want to find which street intersections are within LA’s city boundary. Notice that our polygon and points have the same minimum bounding boxes, so an R-tree would offer no speed up because rectangle expansion would identify every point as a possible match.
But, we could sub-divide our polygon into smaller sub-polygons with smaller minimum bounding boxes, using shapely. I simply overlay my polygon with evenly-spaced quadrat lines, then split it into separate polygons along these lines:
Now we can just iterate through these small sub-polygons to quickly identify which points lie within each, using the R-tree spatial index (as demonstrated in the code snippet earlier):
This spatial intersection now can take full advantage of the R-tree index and reduces the computation time from 20+ minutes down to just a few seconds. Here we can see all of the street intersections within the city of Los Angeles in blue, and all those outside of it in red:
Conclusion: R-trees and Python
Long story short, if you’re doing spatial intersections or joins with Python, you should use geopandas and its implementation of the R-tree spatial index. If you’re intersecting lots of points with a polygon – and the points and polygon have identical minimum bounding boxes – you can subdivide the polygon then intersect each sub-polygon with the points, using the index. This yields a drastic decrease in computation time.
All of my code is in this notebook in this urban data science GitHub repo.
21 replies on “R-tree Spatial Indexing with Python”
veramente bello
Hey Geoff,
While I don’t feel as poetic as Leonardo, I really dig your tutorials and your site. It is the best resource I’ve come across for trying to work with the nascent geopandas. I tried implementing an rtree spatial index–copying what you did here–but am not having any luck.
I am trying to get the intersection area of parcels with different spatial layers. I will use CPAD holdings as an example here. I then want to get the proportion that cpad overlaps each parcel (like 0.26). This works fine, but slowly, without an index. Here’s my code:
spatial_index = parcels.sindex
possible_matches_index = list(spatial_index.intersection(df_union.bounds))
possible_matches = parcels.iloc[possible_matches_index]
precise_matches = possible_matches[possible_matches.intersects(df_union)]
And then to get the pct overlap:
constraints[‘cpad_pct’] = precise_matches.area / parcels.area
df_union is the unary_union of cpad holdings, which I had to do to get the parcel.intersection(df) to work properly.
It breaks at: (spatial_index.intersection(df_union.bounds))
THE ERROR:
AttributeError: ‘NoneType’ object has no attribute ‘intersection’
Do you know why it might be getting returned as a nonetype? It takes 30 sec to build the sindex, so that seems to work, but when I print spatial_index, it returns ‘None’.
Any thoughts?
Sounds like the spatial index’s constructor is returning None… in which case it would be an issue with geopandas or rtree. Or your data if there’s something about it that prevents index construction. Might try with a small sample of your data to see if you can get the index to build. Or try with some other dummy data to make sure a trivial example is working on your machine. If all else fails, might want to open a new issue at the geopandas or rtree GitHub repos.
Great, thanks for the info. I will try it out with a few different test datasets to make sure its not a data issue. If its not, Ill open a new github issue. Thanks!
Check out this issue on OSMnx that references the problem you’re experiencing – it might be an rtree bug that was recently fixed.
Hey Geoff,
Let’s say we want to map points to TAZs. We have one file with many taz polygons.
Is the fastest approach to simply loop overall the TAZs and then run the r-tree query for all points on each individual TAZ? The following example works, but I wonder if it can be faster.
val_gdf is the GeoDataFrame of points. taz_gdf is the GeoDataFrame with TAZ polygons.
sindex = val_gdf.sindex
val_gdf[‘taz_key’] = np.nan
for i, bounds in enumerate(taz_gdf.bounds.iterrows()):
possible_matches_index = list(sindex.intersection(bounds[1]))
possible_matches = val_gdf.iloc[possible_matches_index]
precise_matches_index = possible_matches.intersects(taz_gdf.iloc[i].geometry)
precise_matches_index = precise_matches_index[precise_matches_index].index
val_gdf.loc[precise_matches_index, ‘taz_key’] = taz_gdf.iloc[i][‘taz_key’]
Hey Andrew — fastest way to get the TAZ that each point falls within would be to perform a spatial join:
geopandas.sjoin(gdf_points, gdf_polygons, how='left', op='within')
The resulting GeoDataFrame will contain all of the points, with new columns for the TAZ it falls within. I have a demo of geopandas spatial joins here.
So funny story. I was finding that sjoin was taking roughly 10x longer than looping w/ r-tree index. It turns out I was using v0.1.0 of Geopandas. I had manually added it to my Anaconda install before there was a conda install available. That version has been overriding the updated one. Anyway, after some digging and deleting I am using v0.2.1 and the join method is by far the fastest option.
Curious about the referenced .ipynb at the end of the article. In it, the OMNX function `quadrat_cut_geometry` is called. If I look at the source code, this function has 2 required arguments (`geometry` and `quadrat_width`).
in the Notebook, only the first argument is provided:
“`
geometry_cut = ox.quadrat_cut_geometry(geometry)
“`
Am I missing something? Is the second value inferred somewhere? Is there a method that attempts to guess what the optimal subdivision of the geometry would be?
Sorry about that. I had recently updated the code locally to work with OSMnx v0.3, but I forgot to push it to GitHub. Just pushed it now, so the notebook should all be working. Thanks for the heads up.
Hi Jeff, Can you provide an implementation of KNN in this framework? Thanks!
Hey Geoff,
How would R-trees work for spatial data which is continuous? For example, land-use data?
Is the dataset in question raster or vector-based? It doesn’t necessarily make sense to spatially-index a raster. If it’s vector-based land use data, then you can just build an r-tree as per usual.
Thank you. It is vector-based. But, when we define the MBRs, is there is a limit on how small the spatial resolutions should be? For example, a census block, or a building?
[…] now, everybody has read (multiple times) Geoff Boeing’s excellent post on the importance of using spatial indices when performing spatial operations and the impact that […]
What about intersecting lines with a polygon? I tried to use a gdf of lines where you use the point gdf but the result is just returning the lines that intersect with the polygon without intersecting their geometry (the parts of lines that fall out of the polygon are not clipped).
[…] 该代码参考网址。同时文章中包含几何面较复杂的时候一些简单的处理方法。 […]
Hi Geoff, great tutorial. I’ve read a ton of this stuff and it’s really helped me out.
Stupid question; where did you find documentation about the .sindex property of a GeoDataFrame? I’m not seeing it on the official docs page, but maybe I’m not looking in the right place.
Thanks,
Quin
[…] With the spatial index, the process takes 2 minutes 19 seconds which is a little improvement in speed compared to non-spatial index processing. Spatial Indexes do not always help, especially where points data and polygons have the same extent. You can read a detailed explanation on when spatial indexes work and when they don’t from this blog post. […]
[…] est extrêmement rapide car il s’agit seulement de deux comparaisons). Voir l’excellent blog de Geoff Beoing sur le sujet. […]
@Geoff Boeing : I really liked your approach and it really exposed me to spatial indexes. But I am facing an issue while adapting your code into my problem. I have a vector file (geojson) containing a lot of polygons and in this file , there are many small polygons which are inside bigger polygons, I have to remove these smaller polygons. One possible solution can be looping over each geometry and checking using (shapely contains method to check whether it lies inside bigger polygon or not, but this is very computationally costly). Therefore, can you please guide how can we utilize spatial indexing in this case ?