TLDR – Designing a Geospatial Search Service
Geospatial search (aka proximity search) powers many apps like Google Maps, Tinder, and Uber by helping users find nearby locations. Designing such a system involves balancing efficiency, scalability, and accuracy. Bounding Box Optimization: Approximate circular search areas with rectangles to reduce search overhead.
Geohashing: Converts lat/lon into a compact string for fast prefix-based lookup; efficient, but requires checking neighbors to cover edge cases.
Quadtree: A recursive, memory-efficient structure that adapts to data density; more precise but operationally heavier (batch updates, memory rebuilds).
System Architecture: Stateless, horizontally scalable location service using read replicas; a separate write-light service manages POI data.
Data Modeling: Can store geohashes in normalized or denormalized formats; tradeoffs between simplicity, speed, and edit performance.
Next-Level Considerations: Handle moving users, caching hot zones, dynamic radius expansion, or explore alternatives like Google’s S2 Geometry.
This foundational system design pattern applies broadly in ride-sharing, dating apps, navigation, and IoT. Choose tools (Geohash vs. Quadtree) based on accuracy needs, query volume, and update frequency.
Behind many apps is a geospatial search service . Also known as proximity search, this allows users to find nearby shops, landmarks, people, and other points of interest, often displayed as pins on a map. Google, Yelp, WhatsApp, Uber, and Tinder all rely on geospatial search to power their map navigation features. Discovering nearby spots is one of the most common use cases for location-based apps.
How might we go about implementing a proximity service? That is, given a pair of coordinates and a radius, how do we implement a service to find all nearby points of interest? And in doing so, how do we keep our search both efficient and scalable? What are our options and what are the tradeoffs between them?
Let's explore what it takes to design a geospatial search service.
One way to simplify the complexity of circular radius searches is by using a bounding box , a rectangular area that encompasses the circle. The bounding box approach can speed up the search by narrowing down the relevant geographical area before filtering out points outside the circular radius.
While we will focus on geospatial queries, we should be mindful of the point-of-interest management service we will work in tandem with. Non-functionally, our geospatial search must offer: Speed. The search service should return nearby points of interest quickly.
Availability. Our architecture should not hinge on a single point of failure. Scalability. We should accommodate traffic spikes during peak hours. A starting point for our design might look like this:
The term "point of interest" (PoI) refers to any place of relevance to users, such as restaurants and hotels, based on our business needs. The management service is tasked with mutating point-of-interest data, which is unlikely to change frequently, so this is a write-only service expecting a low request volume.
By contrast, the location service is read-heavy and expects a high request volume, especially during peak hours in densely populated areas. For higher scalability and availability, our location service... will remain stateless to allow for horizontal scaling, adding and removing servers in line with traffic load, and
will interface with a database cluster containing a primary database for writes and multiple read replicas .
The location service has a single responsibility: finding points of interest within a radius of the user's position. Making this search efficient will be the crux of our discussion. To find points of interest within a radius, our first instinct might be:
B-trees in relational DBs are hierarchical—each level narrows down the search space. For range queries, especially over multiple columns, we must traverse several levels to find all values in the range, meaning that this lookup will lead to a full table scan . Even with indices on latitude and longitude, we cannot speed up this query by much, as we still need to intersect two sets of results.
Our lookup is conditioned by our two-dimensional data . We could keep pushing in this direction for incremental improvements, e.g. with a composite index, but this might also be a good time to rethink our approach. How can we simplify this lookup? Is there a solution to combine our two columns into one?
One such solution is geohashing , which encodes latitude and longitude into a single value. To achieve this, we first split the world along the prime meridian, i.e. into the Western and Eastern hemispheres, labeling each half in binary:
Next we split each hemisphere into halves and label them, prefixing each label with that of the original hemisphere, starting from the bottom:
This process is recursive: we keep subdividing each labeled quadrant into smaller and smaller ones. Each of these labels is known as a geohash —the longer the geohash, the smaller the section of the world it refers to. We are naming increasingly granular sections in a grid of the world, instead of relying on a pair of coordinates.
As grid sections become smaller and smaller, labels become longer and longer. For example, the label 11110000110001101101011000000111011 refers to a public square in the heart of Berlin. For convenience, we typically encode geohashes in base32, which compresses this binary label into a handier u33dc1r . Geohash precision can range from the size of the world to the order of centimeters.
How does this relate to our problem? We need to find all points of interest within a radius, i.e. we need to find a geohash that labels an area that covers the entire circle drawn by the radius. This means that the radius will dictate our geohash precision, which in turn means that only a subset of all these precisions is practical for our purposes. Realistically, this is likely to be a geohash of length four, five, or six.
Is geohashing the solution? Suppose we were to store a geohash for each point of interest. Given the geohash for the user's location, could we then query by matching it to the initial characters in the geohash for each point of interest?
Not quite. Geohashes have a useful property: the longer the shared prefix between two geohashes, the closer their locations. For instance, the geohashes 8n8zn3 and 8n8zn9 are adjacent, sharing all characters but the last.
But the opposite is not always true: close locations do not necessarily have similar prefixes. Two locations that are close to each other may have no shared prefix at all, e.g. adjacent locations on each side of the equator. Further, two locations can be right next to each other and still be assigned different geohashes.
To address these edge cases, we can fetch all points of interest in the user's geohash and also in its eight neighbors. Since a geohash can be computed in constant time, the retrieval will remain efficient, but pay attention to the tradeoff between inclusion and precision. We will be querying for slightly more than the rectangular area that covers the size of the circle around the user's location.
In essence, a quadtree is a tree where the root represents the entire world and each non-leaf node has four children. To find all locations within a radius, we start at the root and traverse the tree until we reach the target leaf node holding the user's location. If the target leaf node has the desired number of points of interest, we return the node, if not, else we add locations from its siblings until reaching the threshold.
The implementation details of the quadtree are beyond the scope of this discussion, but we should keep its operational implications in mind. Notably, a quadtree is an in‑memory data structure, built during server startup. While its memory consumption is low enough, it will take a few minutes to build—during this time, a server cannot serve traffic, so we should only rebuild a subset of quadtrees at a time.
Geohashes and quadtrees are both widely adopted, but the quadtree's operational implications might lead us to lean on geohashing as the simpler of the two. Assuming we have settled on geohashing, how should we model our data? The points_of_interest table should be straightforward:
The opposite approach would be to normalize : ensuring each row holds a single geohash and a single point of interest, with a composite key made up of both. Zooming back out to our high-level design, let's trace the lifecycle of a request:
The user navigates to a map tile and triggers a request for nearby points of interest, informing the user's coordinates and a search radius. Our load balancer receives and forwards the request to the location service. The location service converts the user's coordinates into a geohash.
The location service queries the geohashes table (on a read-only replica) for all IDs of points of interest matching as much of the geohash as required by the search radius. Alternatively, we match on the entire geohash if we have laid out one column per geohash length.
The geohash service then queries the points_of_interest table for all entries matching the IDs obtained from the query to the geohashes table. The location service finally returns the nearby points of interest, to be displayed to the client as pins on a map.
To implement this lifecycle, we have discussed two algorithms fundamental to geospatial search, i.e. geohashing and quadtrees, digging into their working principles as well as some of their strengths, limitations, operational implications, and data modeling considerations as they apply to our design. To go further, consider:
how to optimize geospatial searches for a user in motion, how to devise a caching strategy for densely populated areas, how to fetch all neighbors of a geohash to account for its edge cases, how to auto-expand the search radius if no points of interest are found, and
when to bring in a modern option like Google's S2 geometry library. Designing a Geospatial Search Service: Next Steps
Geospatial search is an area of very active research, with many practical applications beyond the discovery of nearby locations—such as geofencing, route planning, mobility prediction, and real-time navigation. Designing a location search service can be an accessible point of entry to this rapidly evolving field. We'll write more about some of the specifics covered here in future articles, so stay tuned!
More on Designing A Geospatial Search Service
Full-text search is a search technique that scans entire documents or data entries for keywords, returning results that match the query. Unlike traditional searches that look for exact values in specific fields, full-text search analyzes large blocks of text and ranks results based on relevance. It's commonly used in search engines and content management systems to allow users to find information using natural language or incomplete queries.
What is the difference between Quadtree and Geohash system design? What's the difference between quadtree and -tree indexes for geospatial data?
