June 2012
Volume 27 Number 06
SQL Server - Custom Indexing for Latitude-Longitude Data
By James McCaffrey | June 2012
In this article I describe a technique to create custom indexes for geographical data that contains latitude and longitude location information. Custom indexes allow quick, real-time retrieval of data. For example, consider the following scenario. Suppose you have a very large (several million items) set of data that looks something like this:
0001 47.610 -122.330
0002 36.175 -115.136
0003 47.613 -122.332
...
Here the first column is a user ID (or the ID of any object associated with a location), the second column is the user’s latitude and the third column is the user’s longitude. Latitudes range from -90.0 degrees to +90.0 degrees and represent the up-down coordinate. Longitudes range from -180.0 degrees to +180.0 degrees and represent the left-right coordinate. The locations of the first and third users are near Seattle. A latitude value of zero is the equator. A longitude of zero is the Prime Meridian, which runs through Greenwich, England. Now suppose your data is stored in a SQL table and you want to find all the users who are located in the same 1-degree-by-1-degree rectangle as user 0001. You could do so with a simple SQL statement like this:
SELECT userID
FROM tblUsers
WHERE latitude >= 47.0 AND latitude < 48.0
AND longitude >= -123.0 AND longitude < -122.0
In many situations this approach works well. If you require real-time performance (perhaps a response time of less than 3 seconds) when the size of your data exceeds some threshold (depending on system resources), however, the response time starts to degrade dramatically.
One approach for getting real-time performance in situations like this is to assign each location to a sector ID. So suppose you’re able to generate an auxiliary table from the initial data that looks something like this:
0001 49377
0002 45424
0003 49377
...
Here, the first column is the user ID and the second column is a sector ID. Now if the sector ID is indexed, a SQL statement like the following will be very fast even if your data set has hundreds of millions of items:
SELECT userID
FROM tblSectors
WHERE sector = 49377
The sector ID is acting as a custom index into the main data set. So the real challenge is to write a function that accepts a latitude and longitude and returns a sector.
To better understand the concept of custom indexes, take a look at the sample Windows Presentation Foundation (WPF) application in Figure 1. The application has an embedded Bing Maps map control. After centering the map to a location in Redmond, Wash., and zooming in, I double-clicked on the map. The double-click event was wired to an event handler that determined the latitude and longitude of the click and placed a red marker dot at the click location. Next, the app computed the sector ID of the click and then searched a back-end SQL database with 100 million random user-location items and found eight users in the same sector. The search time was 0.36 seconds—pretty impressive performance.
Figure 1 Custom Spatial Indexing in Action
Microsoft SQL Server 2008 has a sophisticated spatial index feature you can use in the situation just described. And, in many cases, using a SQL Server spatial index is your best option. In at least two scenarios, however, creating custom indexes is a better approach. First, spatial indexes can be created only on a column of SQL type geometry or geography. If you have a legacy database or data source where latitudes and longitudes are stored as plain numeric values, converting those values to type geography might not be feasible. Second, even though spatial indexes in SQL Server 2008 are extremely powerful and quite customizable, in some scenarios you might need complete control over your indexing scheme.
Although creating custom indexes isn’t a new concept, few concrete examples are available. In the remainder of this article, I’ll present a complete example of how to create and use custom indexing. To fully understand this article, you need to have at least intermediate-level C# and T-SQL coding skills. You can get the key C# code for the example presented here from archive.msdn.microsoft.com/mag201206Indexing.
Mapping Latitude-Longitude to a Sector ID
There are many ways to map a latitude-longitude pair to a sector ID. The approach I describe here is the simplest, and is best explained with the diagram shown in Figure 2. The first decision in a custom indexing scheme is to choose how to segment the globe. In Figure2 I’ve divided each degree of latitude (the row values) and each degree of longitude (the columns) into two parts, as indicated by the fraction = 0.5 in the upper-left corner of the figure. This results in 360 latitude intervals, from [-90.0, -89.5) to [+89.5, +90.0], and 720 longitude intervals, from [-180.0, -179.5) to [+179.5, +180.0]. I use a square bracket to mean inclusive and a parenthesis to mean exclusive. With a fraction value of 0.5, there are 360 * 720 = 259,200 sectors numbered from 0 to 259,199. I order sectors from left to right and top to bottom, as shown in Figure 2.
Figure 2 Custom Indexing Design
fraction = 0.5 |
[-180.0, -179.5) 0 |
[-179.5, -170.0) 1 |
[-170.0, -169.5) 2 |
[179.5, 180.0] 719 |
|||
[-90.0, -89.5) -> 0 | 0 | 1 | 2 | ... | 719 | ||
[-89.5, -80.0) -> 1 | 720 | 721 | 722 | ... | 1439 | ||
[-80.0, -79.5) -> 2 | 1440 | 1441 | 1442 | ... | |||
... | |||||||
[89.5, 90.0] -> 359 | 258,480 | ... | 259,199 |
Smaller values of the fraction parameter create more intervals per degree and generate more sectors. In this example I use the same value of fraction for both latitude and longitude intervals, but you can use different values if that better suits your data distribution. The sectors are not all the same area, however, as I explain shortly. If you have a row (latitude) index ri and a column (longitude) index ci, the corresponding sector ID sid can be computed as follows:
sid = (ri * 360 * 1/fraction) + ci
For example, in Figure 2, sector 1441 is associated with row index 2 and column index 1 and is computed as (2 * 360 * 1/0.5) + 1 = (2 * 720) + 1 = 1441. The term 360 * 1/fraction determines how many column intervals are in every row, and then multiplying by ri gives the first sector ID in the appropriate row. Adding the ci term essentially moves ci columns to the right from the beginning of the appropriate row and gives the final sector ID result.
The final part of the puzzle is to map a latitude value to a row index and a longitude value to a column index. A naive approach would be to do a simple linear search. Painful past experience with very large data sets has taught me that in this case, a binary search is a better approach. The idea is to start with a low index of zero, the middle index and a high index equal to the last possible index. Next, determine the corresponding latitude (or longitude) interval of the middle index. If the target latitude (or longitude) falls in the interval, the middle index is the correct return value. If the target latitude (or longitude) is smaller than the current interval, move the new middle index to halfway between the low index and the old middle index. If the target latitude (or longitude) is larger than the current interval, move the new middle index to halfway between the old middle index and the high index. Repeat the process until the correct middle index is found.
A C# implementation of a method LatIndex that accepts a latitude value and a fraction and returns the corresponding row index is presented in Figure3. The partner LonIndex method that returns a column index for a longitude is completely symmetric except that the constant 180 is replaced by 360, and the constants -90.0 and 90.0 are replaced by -180.0 and 180.0.
Figure 3 The Row/Latitude Index of a Latitude
static int LatIndex(double latitude, double fraction)
{
latitude = Math.Round(latitude, 8);
int lastIndex = (int)(180 * (1.0 / fraction) - 1);
double firstLat = -90.0;
if (latitude == -90.0) return 0;
if (latitude == 90.0) return lastIndex;
int lo = 0;
int hi = lastIndex;
int mid = (lo + hi) / 2;
int ct = 0; // To prevent infinite loop
while (ct < 10000)
{
++ct;
double left = firstLat + (fraction * mid); // Left part interval
left = Math.Round(left, 8);
double right = left + fraction; // Right part interval
right = Math.Round(right, 8);
if (latitude >= left && latitude < right)
return mid;
else if (latitude < left)
{
hi = mid - 1; mid = (lo + hi) / 2;
}
else
{
lo = mid + 1; mid = (lo + hi) / 2;
}
}
throw new Exception("LatIndex no return for input = " + latitude);
}
Because the method in Figure 3 works with latitudes as type double, the input latitude parameter is rounded to eight decimal places to avoid nasty equality-of-two-type-double errors. I use a somewhat-hacky ct variable to prevent an infinite loop. To keep the code short, I’ve removed most normal error checking—such as validating the input parameters—that you would use in a production scenario. Notice that the main loop checks for intervals in the form [a,b), so I must explicitly check for the last 90.0 value.
With the sector design and helper methods in place, I can map a latitude-longitude pair to a sector with a method like this:
static int LatLonToSector(double latitude, double longitude,
double fraction)
{
int latIndex = LatIndex(latitude, fraction); // row
int lonIndex = LonIndex(longitude, fraction); // col
return (latIndex * 360 * (int)(1.0 / fraction)) + lonIndex;
}
Sector Size
The custom indexing scheme I describe in this article divides the globe into numerically equal intervals based on the value of the fraction parameter. For example, if fraction is 0.5, each interval is 0.5 of a degree of latitude or longitude. However, this doesn’t mean that all sectors have the same geographical area. Take a look at Figure 4. Latitude lines run parallel to each other and are all the same distance apart, about 111 kilometers. So the distance indicated by label A in Figure 4 is the same as the distance indicated by B. But lines of longitude get closer together as they near each pole. So the distance indicated by label C is less than the distance indicated by D.
Figure 4 Sectors Have Different Geographical Areas
The distance, in kilometers, between any two points on the globe can be estimated using the following method:
public static double Distance(double lat1, double lon1,
double lat2, double lon2)
{
double r = 6371.0; // approx. radius of earth in km
double lat1Radians = (lat1 * Math.PI) / 180.0;
double lon1Radians = (lon1 * Math.PI) / 180.0;
double lat2Radians = (lat2 * Math.PI) / 180.0;
double lon2Radians = (lon2 * Math.PI) / 180.0;
double d = r * Math.Acos((Math.Cos(lat1Radians) *
Math.Cos(lat2Radians) *
Math.Cos(lon2Radians - lon1Radians)) +
Math.Sin(lat1Radians) * Math.Sin(lat2Radians)));
return d;
}
So if you know the corresponding latitude and longitude of a given sector, you can compute its approximate width and height, and then its approximate area. To determine the left endpoint of the interval containing a sector, you can use this method:
static double SectorToLat(int sector,
double fraction)
{
int divisor = 360 * (int)(1.0 / fraction);
int row = sector / divisor;
return -90.0 + (fraction * row);
}
This method essentially performs the reverse process of method LatLonToSector. For example, if the input sector is 1441 and the fraction parameter has the value 0.5, as shown in Figure 2, the local divisor variable is 360 * 1.0 / 0.5 = 720, which is the number of longitude intervals. Then the value of variable row is 1441 / 720 = 2, which is the row index (note the integer division). Finally, -90.0 + 0.5 * 2 = -90.0 + 1.0 = -89.0, which is the left part of the [-89.0, -79.5) interval associated with sector 1441.
The method to determine the left part of the longitude interval of a sector is similar but not completely symmetric:
static double SectorToLon(int sector, double fraction)
{
int divisor = 360 * (int)(1.0 / fraction);
int col = sector % divisor;
return -180.0 + (fraction * col);
}
With these two helper methods, you can determine the approximate area, in square kilometers, of a sector that has been defined by a fraction parameter:
static double Area(int sector, double fraction)
{
double lat1 = SectorToLat(sector, fraction);
double lon1 = SectorToLon(sector, fraction);
double lat2 = lat1 + fraction;
double lon2 = lon1 + fraction;
double width = Distance(lat1, lon1, lat1, lon2);
double height = Distance(lat1, lon1, lat2, lon1);
return width * height;
}
Adjacent Sectors
In some development scenarios, you might need to determine which sectors are adjacent to a given sector. For example, in Figure 2, if a user’s location is in sector 721, you might want to determine which users are in adjacent sectors 0, 1, 2, 720, 722, 1440, 1441 and 1442. Notice that left-right sectors differ by plus or minus one, except for sectors on the left and right edges of the sector mapping. And up-down sectors, except for sectors on the top and bottom rows, differ by plus or minus the number of column intervals, which is 360 * (1 / fraction). To write a method AdjacentSectors that accepts a sector (and a generating fraction), it’s useful to know whether a sector is on the left or right mapping edge, or on the top or bottom row. These four methods are presented in Figure 5.
Figure 5 Helper Methods for the AdjacentSectors Method
static bool IsLeftEdge(int sector, double fraction)
{
int numColumns = (int)(1.0 / fraction) * 360;
if (sector % numColumns == 0) return true;
else return false;
}
static bool IsRightEdge(int sector, double fraction)
{
if (IsLeftEdge(sector + 1, fraction) == true) return true;
else return false;
}
static bool IsTopRow(int sector, double fraction)
{
int numColumns = (int)(1.0 / fraction) * 360;
if (sector >= 0 && sector <= numColumns - 1) return true;
else return false;
}
static bool IsBottomRow(int sector, double fraction)
{
int numColumns = (int)(1.0 / fraction) * 360;
int numRows = (int)(1.0 / fraction) * 180;
int firstValueInLastRow = numColumns * (numRows - 1);
int lastValueInLastRow = numColumns * numRows - 1;
if (sector >= firstValueInLastRow && sector <= lastValueInLastRow)
return true;
else
return false;
}
You can write the method AdjacentSectors in many ways that trade off clarity and code size. One approach is to return an array of sector values where the [0] return value is the adjacent sector that’s to the upper left of the input sector, [1] is directly above, [2] is above and to the right, [3] is to the left, [4] is to the right, [5] is below and to the left, [6] is directly below, and [7] is below and to the right. The code in Figure 6 outlines one way to implement AdjacentSectors. Although the general case where the input sector isn’t on a mapping edge is straightforward, some of the special cases, including the four-corner sectors, are tricky. See the code download that accompanies this article for the complete implementation.
Figure 6 An Implementation of the AdjacentSectors Method
static int[] AdjacentSectors(int sector, double fraction)
{
int[] result = new int[8]; // Clockwise from upper-left
int numCols = (int)(1.0 / fraction) * 360;
int numRows = (int)(1.0 / fraction) * 180;
int firstValueInLastRow = numCols * (numRows - 1);
int lastValueInLastRow = numCols * numRows - 1;
// General case
if (IsLeftEdge(sector, fraction) == false &&
IsRightEdge(sector, fraction) == false &&
IsTopRow(sector, fraction) == false &&
IsBottomRow(sector, fraction) == false)
{
result[0] = sector - numCols - 1;
result[1] = sector - numCols;
result[2] = sector - numCols + 1;
result[3] = sector - 1;
result[4] = sector + 1;
result[5] = sector + numCols - 1;
result[6] = sector + numCols;
result[7] = sector + numCols + 1;
return result;
}
// Deal with special cases here. See code download.
throw new Exception("Unexpected logic path in AdjacentSectors");
}
Putting Together a Geographical Indexing Scheme
A good way to understand the custom geographical indexing scheme described in this article is to examine a simple but complete end-to-end example. First create a dummy SQL database using T-SQL code similar to this:
use master
if exists(select * from sys.sysdatabases where name='dbGeoData')
drop database dbGeoData
create database dbGeoData on
(
name=dbGeoData,
filename='D:\SomePlace\dbGeoData.mdf'
)
Next, create a table to hold some dummy data:
use dbGeoData
create table tblUsers
(
userID int primary key,
latitude real not null,
longitude real not null
)
Then write some C# code to generate dummy data:
Random r = new Random(0);
string initialDataFile = "..\\..\\UserIDLatLon.txt";
FileStream ofs = new FileStream(initialDataFile, FileMode.Create);
StreamWriter sw = new StreamWriter(ofs);
for (int i = 0; i < 1000000; ++i)
{
double lat = (90.0 - (-90.0)) * r.NextDouble() + (-90.0);
double lon = (180.0 - (-180.0)) * r.NextDouble() + (-180.0);
sw.WriteLine(i.ToString().PadLeft(6,'0') + "," +
lat.ToString("F8") + "," + lon.ToString("F8"));
}
sw.Close(); ofs.Close();
Here I’m creating 1 million lines of comma-separated data. To generate random latitudes and longitudes in the range [hi,lo), I’m using the standard pattern (hi - lo) * random.NextDouble() + lo. Next, copy the dummy data into the SQL table using the command-line bcp.exe program:
> bcp.exe dbGeoData..tblUsers in UserIDLatLon.txt -c -t , -r \n -S(local) -T
This command means, “Copy into table tblUsers, which is in database dbGeoData, the data in file UserIDLatLon.txt, which is character data (that is, a text file) where each column is separated-terminated by the comma character and each line is return-terminated by a newline character. The database server is on the local machine, and Trusted (Windows) authentication is used to connect.”
Then create the auxiliary custom indexing sector data using C# code like this:
string initialDataFile = "..\\..\\UserIDLatLon.txt";
string sectorDataFile = "..\\..\\ UserIDSector.txt";
FileStream ifs = new FileStream(initialDataFile, FileMode.Open);
StreamReader sr = new StreamReader(ifs);
FileStream ofs = new FileStream(sectorDataFile, FileMode.Create);
StreamWriter sw = new StreamWriter(ofs);
string line = "";
string[] tokens = null;
while ((line = sr.ReadLine()) != null)
{
tokens = line.Split(',');
int userID = int.Parse(tokens[0]);
double lat = double.Parse(tokens[1]);
double lon = double.Parse(tokens[2]);
int sector = LatLonToSector(lat, lon, 0.5);
sw.WriteLine(userID.ToString().PadLeft(6, '0') + "," + sector);
}
sw.Close(); ofs.Close(); sr.Close(); ifs.Close();
The code loops through the dummy data file one line at a time; parses out the user ID, latitude and longitude; computes the associated sector with a fraction parameter of 0.5; and writes the user ID and sector in comma-separated form to a text file.
Next, create a SQL table to hold the sector data:
create table tblSectors
(
userID int primary key,
sector int
)
Then copy the sector data into the table:
> bcp.exe dbGeoData..tblSectors in UserIDSector.txt -c -t , -r \n -S(local) -T
Finally, index the sector data:
if exists(select name from sys.sysindexes where name='ndxSector')
drop index ndxSector on tblSectors
create nonclustered index ndxSector on tblSectors(sector)
At this point you can experiment in several ways. For example, you can select all the users who are in the same sector as user 3 with T-SQL code, like so:
declare @userID int
set @userID = 3
declare @sector int
select @sector = sector from tblSectors where userID=@userID
print @sector
select userID from tblSectors where sector = @sector
Of course, if you’re accessing your SQL data from an application, you can use ADO.NET or LINQ equivalents to this kind of SQL code.
Get Real-Time Performance
You can use custom indexing in many application scenarios. For example, suppose you have an ASP.NET Web application or a Silverlight application with a Bing Maps map control. You could create a system that allows the user to click on the map. The click event is associated with code that determines the latitude and longitude of the click, computes the associated sector and then retrieves from a back-end SQL database all users in the sector. As illustrated in Figure 1, even with several hundred million rows you can get real-time performance.
In some situations you might want to create several custom indexes. For example, you might want a low-granularity index with a fraction = 1.0 (so each sector is 1 degree by 1 degree), a mid-granularity index with fraction = 0.25 and a high-granularity index with fraction = 0.05. With multiple indexes you can successively filter your SQL data. First determine how many users are in the low-granularity sector. If you have too many users for your purposes, determine how many users are in the mid-granularity sector. If you have too few users, determine the seven adjacent sectors and the users in those sectors. And so on.
Let me reiterate that SQL Server 2008 has powerful built-in spatial indexing you should consider using before creating custom indexes. Custom indexes require explicit, additional SQL tables that impose an increased management burden on your system. That said, however, in some scenarios custom indexes can lead to blazingly fast performance. With the increasing use of GPS-enabled mobile devices, the ability to gather huge amounts of geographical data is only going to expand. The ability to create custom indexing schemes can be an important skill to help you analyze this data.
Dr. James McCaffrey works for Volt Information Sciences, where he manages technical training for software engineers working at the Microsoft Redmond, Wash., campus. He has worked on several Microsoft products, including Internet Explorer and MSN Search. McCaffrey is also the author of “.NET Test Automation Recipes” (Apress, 2006). He can be reached jammc@microsoft.com.
Thanks to the following technical experts for reviewing this article: Isaac Kunen and Dan Liebling