Describing Cities with Computer Vision

What does artificial intelligence see when it looks at your city? I recently created a Twitter bot in Python called CityDescriber that takes popular photos of cities from Reddit and describes them using Microsoft’s computer vision AI. The bot typically does pretty well with straightforward images of city skylines and street scenes:

Some are even kind of wryly poetic, such as this description of Los Angeles:

Or this description of San Francisco:

But the AI sometimes struggles with other photos. And when it’s wrong, it’s often hilariously far-off:

There has been much discussion recently (example) about the impact that computer vision — and machine learning more generally — could have on urban studies and urban planning… for better or for worse. On one hand, we can develop and train better models for more accurate insights into urban patterns and urban change. Modeling has always been a useful tool in the planning toolkit, and new data science methods might be able to make planners more efficient and accurate.

On the other hand, planners should be cautious and critical of claims about using AI to “solve” cities. Machine learning models are no better than their training, and biases in training data and researchers can result in biased estimates and predictions. Despite some popular accounts, AI and big data do not spell the end of theory.

Perhaps the CityDescriber bot showed one aspect of this in a light-hearted way. I don’t mean to broadly mock Microsoft’s algorithm: in fact, it tends to describe most of these photos in a literal, accurate, and mundane way. This is a substantial accomplishment. But what about the descriptions that are just bafflingly incorrect? The AI saw something that triggered a completely incorrect prediction, even though a child could recognize the photo’s contents in an instant. In particular, it seems to have not been well-trained on aerial shots looking down on cities.

As planners and researchers, we need to consider artificial intelligence and machine learning with some enthusiasm and some skepticism. What exactly are the models telling us? Why? What are their biases? How do they reinforce entrenched biases that came built-into their training data? What do they “see”… and what do they not see? Researchers may strive to build objective models, but they usually reflect our own experiences and points of view. As planners, we need to be cognizant of this as we increasingly use machine learning over the next decade to better understand cities and their citizens.

3 thoughts on “Describing Cities with Computer Vision”

  1. That’s really cool, despite the howlers 🙂 Have you released the source code for it? Would love to see how you hooked the bits together.

Leave a Reply

Your email address will not be published. Required fields are marked *