February 2009
Volume 24 Number 02
Under The Table - Spatial Data Support In SQL Server 2008
By Bob Beauchemin | February 2009
Contents
Using the Geography Data Type
Spatial Indexes
Visualizers
The Geometry Data Type
Spatial Visualization
The SQL Server Spatial Data Library
Wrapping Up
SQL Server 2008 offers a number of enhancements, including new data types, new T-SQL statements, and incremental changes to Microsoft .NET Framework support, XML support, and the Service Broker feature. By far, the biggest and most exciting change for developers is the SQL Server 2008 support for spatial data—a powerful addition to the SQL Server programmer's toolbox. In this column, I'm going to present a short overview of spatial data and illustrate some interesting uses.
Spatial data is defined as data that is represented by 2D or 3D images. SQL Server 2008 supports 2D images, and 3D may be added in a future release. Spatial data can be further subdivided into geometric data (data that can use calculations involving Euclidian geometry) and geographic data (data that identifies geographic locations and boundaries on the earth).
Using the Geography Data Type
Let's start with a simple application of spatial data, which could be useful to most traditional online transaction processing (OLTP) systems that involve customers. Spatial data is useful in this case because every customer has an address. You usually think of addresses as street, city, state, country, and ZIP code, but an address is a point on the earth and can also be represented as a latitude/longitude pair. Technically, an address can comprise a parcel of land that can be represented as a polygon, but let's stick to a single point in this case, just for simplicity. Knowing where a customer lives as latitude/longitude allows you to answer questions such as:
- What are the three closest bank branches for a specific customer?
- Which salesperson lives closest to the customer?
- How many customers does your company have within a 10 mile radius of, for instance, Seattle, Washington?
- How many customers live more than 2 miles from your nearest branch location?
The process of converting a street address to a latitude/longitude pair is called address geocoding. A number of online services (including MapPoint, Virtual Earth, and Google Earth) provide geocoding as a service. To convert a street address in the U.S. to a point, you can encapsulate a call to the MapPoint geocoding Web service in a SQLCLR function. This is illustrated in the code sample. But where should you store the latitude/longitude inside your SQL Server database?
SQL Server 2008 ships with the two necessary data types for storing spatial data: geometry and geography. Both data types are implemented using the .NET architecture for user-defined types, meaning that they can have properties and methods. Let's stick with the geography data type for now because this maps more closely to the problem at hand.
You can define an instance of a geography type with a simple SQL variable declaration (DECLARE @g geography) or as a column in a table, and you can initialize this data type in a variety of ways. For a geography type representing an instance of a point, the simplest way is to use the STPointFromText static method of the geography type. The STPointFromText method requires not only a textual representation of a point in the Open Geospatial Consortium's well-known text format (WKT)—that is, POINT(x,y)—but also a spatial reference identifier (SRID). The SRID identifies the spatial reference system used for either a round-earth or flat-earth mapping and, for now, it's enough to know that the MapPoint geocoder Web service uses GPS coordinates that correspond to SRID 4326. This represents the World Geodetic System 1984 (WGS 84). (To find all the SRIDs that SQL Server 2008 supports, simply query the metadata table sys.spatial_reference_systems.)
So your code to initialize the address geography would look something like this:
DECLARE @addr nvarchar(256) = 'Some sample address, City, State, Zip'; DECLARE @addr_as_xy nvarchar(30); DECLARE @g geography; SET @addr_as_xy = dbo.Geocoder(@addr); SET @g = geography::STPointFromText(@addr_as_xy, 4326);
Note that because geography is implemented as a .NET-based data type, there are some interesting points to mention about the code. First, STPointFromText is a static method, so it must be called with the datatype::method syntax. In addition, the method names of .NET-based data types are case sensitive so, STPointFromText must use exact capitalization.
There's one more important thing to point out: the geocoder function must be coded to return the location as "xy" rather than latitude/longitude. The Open Geospatial Consortium defines their WKB and well-known binary (WKB) formats as using x,y coordinate pairs. Therefore, what's being passed in to STPointFromText must be POINT(Longitude/Latitude), not POINT(Latitude/Longitude). If you'd rather use latitude/longitude, SQL Server will construct geography instances using Geographic Markup Language (GML), an XML vocabulary that uses Latitude/Longitude. Or, if you have a point, there is a special static method on the geometry data type that takes three parameters: latitude, longitude, and SRID.
Now that you've geocoded your customer information, you can also go ahead and geocode your bank branch locations (or other branch office information, warehouse information, or anything related to location) and your salesperson locations, as well. Suppose you finish with tables named customer, salesperson, and branch. Each one would have a column (let's call it geog for simplicity) of type geography that indicates its location.
You can use this information to answer the questions mentioned earlier. For this, you need to use methods on the geography data type. The method for calculating the distance between two geography objects is, not surprisingly, called STDistance. The methods you could use to answer the third question (how many customers there are within a 10-mile radius of Seattle) would be STBuffer and STIntersects. SQL Server also has a method that's used to calculate a more approximate buffer named BufferWithTolerance (it's a little faster than calculating an exact buffer).
To answer the first question, you would use a query that looks like this:
SELECT TOP(3) b.name, c.geog.STDistance(b.geog)/1000 [Distance in km] FROM customer c, branch b WHERE c.customerid = '12345' -- this query looks at customer 12345 ORDER BY c.geog.STDistance(b.geog)
The query to calculate the closest salesperson would be done in approximately the same way.
How about the people who live within a 10-mile radius of Seattle? For this query, you would put a buffer around the city limits of Seattle (represented as either a polygon of the city limits or a point that you designate as downtown Seattle) and select all points (customers) that intersect that buffer. Represented in code, this is:
-- or declare POINT for "downtown Seattle" -- 1609.344 meters per mile DECLARE @Seattle geography = 'POLYGON(....)'; SELECT c.customerid FROM customer c WHERE c.geog.STIntersects(@Seattle.STBuffer(10 * 1609.344));
These examples illustrate more interesting points about how you interact with instances of the geography type. When you use STBuffer, for instance, which is an instance method on the geography data type, you use the instance.method syntax rather than the type::method syntax you would use for static methods like STPointFromText. In addition, notice that when you're calculating the distance between two geographies or determining whether one geography instance intersects another, you use the syntax Instance1.method(Instance2).
To answer the question about the number of customers who live more than two miles from the nearest bank branch, you must be able to aggregate the geographies of all the bank branches, with a buffer of two miles around each one, and then look for customers that don't intersect with this collection of geographies. The geography data type does have a method named STUnion, but it's specified as a method on a geography instance that returns the union between that instance and another instance.
This is not the set-based aggregation I had in mind. Luckily, SQL Server 2005 introduced user-defined aggregates that can be written in .NET code, and you can use one here to get an aggregate union of a set of values. Without going too far into the specifics of coding user-defined aggregates, they require four method initializations: Init (answer initialization), Terminate (return answer), Accumulate (called for each row), and Merge (called when merging processing from multiple threads).
In the aggregate, the Accumulate method can simply unite the current row geography with the answer geography each time it's called. In fact, you don't have to write this aggregate yourself; it's part of the SQL Server Spatial Tools project on CodePlex. This aggregate will even allow you to pass in a buffer for each row, which solves the two-mile buffer problem. The query can then be written like this:
SELECT COUNT(*) FROM customer c WHERE c.geog.STIntersects( SELECT dbo.GeographyUnionAggregate(b.geog,2*1609.344) FROM branch b)=0
As you can see by these examples, geocoding addresses and storing them along with the post office's form of the address can bring business value to an application.
Spatial Indexes
Once your business gets bigger, calculating the distance between every customer and every salesperson or every customer and every bank branch might become too slow. SQL Server 2008 spatial data support includes spatial indexing. The spatial indexes are ordinary B-tree indexes that are meant to make spatial queries run faster, just like the relational indexes in SQL Server make relational queries run faster.
Mapping two-dimensional spatial data to a one-dimensional B-tree is accomplished by means of tessellation; that is, dividing the area up into small subareas and recording the subareas that intersect each spatial instance. For the geography data type, that means dividing the entire globe into hemispheres and projecting each hemisphere onto a plane. Each geography instance then covers one or more subsections (tiles) of this plane; the spatial index would contain a row for each tile that an instance covers. For the geometry type, because you're specifying your own rectangular coordinate system, you can specify the boundaries (bounding box) that your spatial index covers.
There are a number of rules that govern how spatial instances that intersect tiles are mapped to rows in a spatial index, and you can define how fine the grid system will be for your spatial index at multiple levels. For more information about the specifics of spatial indexes, SQL Server Books Onlineis your best reference.
To return to the customer system, you could define a spatial index on the geog column in your customer table with the following data definition language (DDL):
CREATE SPATIAL INDEX cust_geog_idx ON dbo.customer(geog) GRIDS =(LEVEL1=HIGH,LEVEL2=HIGH,LEVEL3=HIGH,LEVEL4=HIGH));
Defining this index with a high-granularity grid would be best for your index because each customer address is a point, which would only appear in one tile, as opposed to lines or polygons, which could intersect many tiles.
You should note that spatial indexes are only used with certain spatial methods. Currently, they are used with the following geography predicates:
instance1.STIntersects(instance2) = 1 instance1.STEquals(instance2) = 1 instance1.STDistance(instance2) < number instance1.STDistance(instance2) <= number
When you're dealing with geometry instead of geography, the spatial index also works with the STContains, STOverlaps, STTouches, and STWithin methods, but only when checking for 1 (True). Based on the way the spatial indexes work, you would really want to rephrase your spatial query about "customers who live more than two miles from a bank branch" to count the customers where STIntersects returns True and subtract that number from the total number of customers.
Visualizers
Notice that in the few examples I've worked through so far, it would be really helpful to be able to see the spatial data in a form other than rows and columns. In fact, you might have noticed that selecting the spatial data type itself returns its binary representation. Using the ToString method or the STAsText method returns the WKB format. That's a little better, but it still doesn't give you any idea about geographic locations unless you can project longitude/latitude pairs onto a map in your head. Spatial data almost always needs a graphic visualizer to make it more useful, and when you're talking about geographic data, that visualizer would normally include a map.
SQL Server Management Studio 2008 includes a simple visualizer in the form of a Spatial results tab in the Query results window. This visualizer works with a geography or geometry column in the query results and plots the spatial data types on a grid. If multiple spatial columns appear in the query results, you can choose the one to visualize. The column to be displayed must be in the SQL Server binary format; using ToString() or STAsText() won't work with the visualizer.
For a geography data type, you even get a choice of map projections such as Mercator or Equirectangular, but the data is not overlaid on a map by default. One cheap and easy way to overlay a map if you have a table that contains map outline data is to do a UNION ALL between your rowset and a rowset that contains the map geographies. Figure 1shows points representing a set of more than 700 cities from the Mondial database in the SQL Server Management Studio visualizer. The SQL Server query that generated it was:
SELECT geog, name FROM Mondial.dbo.city WHERE geog IS NOT NULL
Figure 1 A Rowset of Points from the Mondial Database City Table
An even better choice is to use a commercial or shareware visualizer that provides map overlay by default, as shown in Figure 2. You'll note that the Spatial Results tab in SQL Server Management Studio shows a rowset of more than 700 points with a map overlay. The SQL Server query that produced it looks like this:
SELECT geog, name FROM Mondial.dbo.city WHERE geog IS NOT NULL UNION ALL SELECT geog, cntry_name FROM SpatialSamples.dbo.cntry00
Figure 2 A Rowset of Points with a Map Overlay
One visualizer you might use is Craig Dunn's Geoquery 2008 program. This is a free visualizer for SQL Server 2008 query results that includes map projections and also lets you choose the color and line thickness of your spatial data objects. You could also write code to display your data in Microsoft Virtual Earth or Google Earth, but that's beyond the scope of this column.
The Geometry Data Type
Let's get back to the other spatial data type that I touched on briefly, the geometry data type, and discuss where you would use geometry as opposed to geography. The geometry data type represents a flat plane with x and y coordinates, as opposed to geography that represents latitude and longitude, angles from the center to the surface of the earth (ellipsoidal coordinates). The geometry data type could be used for problems that don't need to take the earth's shape into consideration, relatively small planar surfaces such as office cubical layouts or warehouses. When you're trying to locate a coworker's rectangular office in cubicle land, the curvature of the earth doesn't matter, so using the geometry data type and linear calculations will suffice. When you're dealing with geometry, you set your unit of measure and your point of origin (let 0,0 be the bottom-left corner of your warehouse, for example).
Although sometimes you can use the geometry type for spatial locations on the earth, the process is a little bit more complicated. Because there are a number of different ways to map the earth to a coordinate system, and because these affect calculation methods such as STArea and STDistance, your spatial data instances need to specify an SRID.
Whether you are using geometry or geography, an SRID is required. For the geometry data type, if you've mandated the coordinate system and unit of measure yourself (for example, when mapping your office layout) you would choose SRID 0, which specifies an unknown or local spatial reference system. SRID 0 is the default for the geometry data type in SQL Server. For geography, it is important that you choose one of the SRIDs that SQL Server is able to use. SRID 4326 is the default and the one that is used by GPS systems.
The geographic methods built into SQL Server can use any of the 390 spatial reference systems that are enumerated in the system metadata table, sys.spatial_reference_systems. These spatial reference systems come from the European Petroleum Survey Group's (EPSG)geodetic parameter registry. EPSG no longer exists—its functions have been absorbed into the European Petroleum Survey Group. You can assume that an association of oil and gas producers would want to make sure that they have the most accurate positions on the earth when they go exploring.
Each SRID not only names a unit of measure (most use meters) but also include a string that specifies a datum (set of reference points on the earth), geoid, coordinate system, and map projection. For more information about coordinate systems, you should read Isaac Kunen's white paper " Introduction to Spatial Coordinate Systems: Flat Maps for a Round Planet."
Because the SQL Server geography data type's calculation is sensitive to the SRID that is used, attempting to use spatial methods (for example, Distance and Intersects) between instances of spatial types that do not have the same SRID will return the value NULL. I'll talk more about converting between SRIDs later.
Spatial Visualization
I've covered some of the most obvious applications for spatial data and the fact that many of them use visualizers to provide a pictorial view of the data. Suppose that, instead of just using a map of the countries of the world or the states in the U.S., I wanted to visualize my data by projecting it against county boundaries or correlate it with congressional districts or with census data? In this case, I'd have to acquire what I call "spatial reference data." Spatial reference data can either be downloaded from public Web sites or purchased from companies such as Environmental Systems Research Institute (ESRI). Note that the availability of free public spatial reference data varies widely depending on local custom. In the U.S., most of the data I've talked about, such as county boundaries and census data, is more freely available than in many other countries.
So you've gone out to your favorite public Web site, for example the U.S. Census Bureau's Tiger Web site, and downloaded geographic reference files. But the files are in ESRI Shapefile format, not SQL Server geography columns in tables.
How do you import these files into SQL Server? And many publicly available files use SRID 4269, not the SRID 4326 that your geocoder emitted. How do you convert between different SRIDs so that all of your spatial methods won't return NULL? The problem that needs to be solved here is more of an extract-transform-load (ETL) problem than a simple conversion problem. You do want to bring along non-spatial data like census track numbers or population figures. And, although you could (in theory) transform your data between SRIDs on the fly, this would slow your queries down considerably. This is definitely an ETL problem.
SQL Server 2008 doesn't really ship anything in the box to help you here. There are programs such as Morten Nielsen's Shape2SQL tool that loads Shapefiles into SQL Server tables. But Shapefiles aren't the only kind of third-party spatial file, and you may want to perform transformations between SRIDs or other more specialized transformations. There are a variety of third-party commercial products to specialize in spatial data transformations and bulk loading. They include SAFE Software's FME for SQL Server, Manifold's Spatial Extender for SQL Server, and, of course, ESRI's line of geographic information system (GIS) products. These can move data into SQL Server or move data from SQL Server into a full-fledged GIS. I've successfully used SAFE Software's product called FME for SQL Server, which includes a library of transformations, supports almost every spatial data format, and even provides a series of components for the SQL Server ETL system: SQL Server Integration Services.
The SQL Server Spatial Data Library
The format that you will choose to use to move data in and out of SQL Server will depend on the specific needs of your organization and also on the needs of your specific application. SQL Server data types intrinsically support the WKB and WKT formats, as well as geographic markup language (GML) format. SQL Server exposes the geometry and geography types and the associated methods in a .NET-based library named Microsoft.SqlServer.Types.dll. Not only does this library ship with SQL Server, but because it's based on .NET, you can download it as part of the SQL Server 2008 Feature Pack and move the computations and transformations to the middle-tier or client if necessary. Bear in mind that only the server side provides spatial indexes to optimize your spatial data queries.
SQL Server is really storing instances of the .NET classes SqlGeometry and SqlGeography; the classes and methods are intrinsic to the library. You can use spatial methods in SQL Server queries or encapsulate your operations in T-SQL or SQLCLR stored procedures, user-defined functions, and triggers. You can extend the base functionality by using SqlGeometry and SqlGeography as members in .NET user-defined types and user-defined aggregates, like the GeographyUnionAggregate that I used earlier.
The spatial data library also includes a builder API that you can use to optimize loading and for custom transformations and aggregations. This consists of a SqlGeometryBuilder and SqlGeographyBuilder and sink interfaces (IGeometrySink and IGeographySink) that you can use to do optimized construction of spatial data instances. You implement one of the sink interfaces, which allow you to begin or end figures, add lines, and set the appropriate SRID. Once you have the data in place, you simply call the Populate method to create the instances that you've set up.
Wrapping Up
The elements I've covered here represent just a little taste of the functionality that's been added to SQL Server with the inclusion of spatial data types and an industry standard spatial library. Data that you gather through geocoding, public reference data, or your own spatial data for warehouses and offices can be stored in SQL Server, managed with the same administrative tools that you use for other kinds of business data, indexed, queried, and analyzed to bring a whole new dimension to your business.
To get you started, remember that any sort of data that you will be able to capture with a GPS system can be imported and can be used with the rest of your database data. With GPS capabilities becoming more commonplace in phones, cameras, automobiles, and other devices and appliances, and with so many information systems now containing address data, the potential for integrating GPS data into interesting features—those that really meet the needs of the user—is truly unlimited. So stay tuned for some exciting developments.
Bob Beauchemin is a database-centric application practitioner and architect, course author and instructor, writer, and Developer Skills Partner at SQLskills. He's written books and articles on SQL Server, data access and integration technologies, and database security. You can reach him at bobb@sqlskills.com.