Using hierarchyid Data Types (Database Engine)
The hierarchyid data type is system-provided. Use hierarchyid as a data type to create tables with a hierarchical structure, or to reference the hierarchical structure of data in another location. Use hierarchyid functions to query and perform work with hierarchical data by using Transact-SQL.
Hierarchical data is defined as a set of data items that are related to each other by hierarchical relationships. Hierarchical relationships are where one item of data is the parent of another item. Hierarchical data is common in databases. Examples include the following:
An organizational structure
A file system
A set of tasks in a project
A taxonomy of language terms
A graph of links between Web pages
New in SQL Server 2008, the hierarchyid type makes it easier to store and query hierarchical data. hierarchyid is optimized for representing trees, which are the most common type of hierarchical data.
Key Properties of hierarchyid
A value of the hierarchyid data type represents a position in a tree hierarchy. Values for hierarchyid have the following properties:
Extremely compact
The average number of bits that are required to represent a node in a tree with n nodes depends on the average fanout (the average number of children of a node). For small fanouts, (0-7) the size is about 6*logAn bits, where A is the average fanout. A node in an organizational hierarchy of 100,000 people with an average fanout of 6 levels takes about 38 bits. This is rounded up to 40 bits, or 5 bytes, for storage.
Comparison is in depth-first order
Given two hierarchyid values a and b, a<b means a comes before b in a depth-first traversal of the tree. Indexes on hierarchyid data types are in depth-first order, and nodes close to each other in a depth-first traversal are stored near each other. For example, the children of a record are stored adjacent to that record.
Support for arbitrary insertions and deletions
By using the GetDescendant method, it is always possible to generate a sibling to the right of any given node, to the left of any given node, or between any two siblings. The comparison property is maintained when an arbitrary number of nodes is inserted or deleted from the hierarchy. Most insertions and deletions preserve the compactness property. However, insertions between two nodes will produce hierarchyid values with a slightly less compact representation.
Limitations of hierarchyid
The hierarchyid data type has the following limitations:
A column of type hierarchyid does not automatically represent a tree. It is up to the application to generate and assign hierarchyid values in such a way that the desired relationship between rows is reflected in the values. Some applications might not even want to have a column of type hierarchyid represent a tree. Perhaps the values are references to location in a hierarchy defined in another table.
It is up to the application to manage concurrency in generating and assigning hierarchyid values. There is no guarantee that hierarchyid values in a column are unique unless the application uses a unique key constraint or enforces uniqueness itself through its own logic.
Hierarchical relationships represented by hierarchyid values are not enforced like a foreign key relationship. It is possible and sometimes appropriate to have a hierarchical relationship where A has a child B, and then A is deleted leaving B with a relationship to a nonexistent record. If this behavior is unacceptable, the application must query for descendants before deleting parents.
Indexing Strategies
There are two strategies for indexing hierarchical data:
Depth-first
A depth-first index, rows in a subtree are stored near each other. For example, all employees that report through a manager are stored near their managers' record.
Breadth-first
A breadth-first stores the rows each level of the hierarchy together. For example, the records of employees who directly report to the same manager are stored near each other.
Examples
The GetLevel() method can be used to create a breadth first ordering. In the following example, both breadth-first and depth-first indexes are created:
USE AdventureWorks ;
GO
CREATE TABLE Organization
(
EmployeeID hierarchyid,
OrgLevel as EmployeeID.GetLevel(),
EmployeeName nvarchar(50) NOT NULL
) ;
GO
In a depth-first index all nodes in the subtree of a node are co-located. Depth-first indexes are therefore efficient for answering queries about subtrees, such as "Find all files in this folder and its subfolders".
CREATE CLUSTERED INDEX Org_Breadth_First
ON Organization(OrgLevel,EmployeeID) ;
GO
CREATE UNIQUE INDEX Org_Depth_First
ON Organization(EmployeeID) ;
GO
In a breadth-first index all direct children of a node are co-located. Breadth-first indexes are therefore efficient for answering queries about immediate children, such as "Find all employees who report directly to this manager".
Whether to have depth-first, breadth-first, or both, and which to make the clustering key (if any), depends on the relative importance of the above types of queries, and the relative importance of SELECT vs. DML operations. For a detailed example of indexing strategies, see Tutorial: Using the hierarchyid Data Type.
When to Use Alternatives to hierarchyid
Two alternatives to hierarchyid for representing hierarchical data are:
Parent/Child
XML
hierarchyid is generally superior to these alternatives. However, there are specific situations detailed below where the alternatives are likely superior.
Parent/Child
When using the Parent/Child approach, each row contains a reference to the parent. The following table defines a typical table used to contain the parent and the child rows in a Parent/Child relationship:
USE AdventureWorks ;
GO
CREATE TABLE ParentChildOrg
(
EmployeeID int PRIMARY KEY,
ManagerId int REFERENCES ParentChildOrg(EmployeeID),
EmployeeName nvarchar(50)
) ;
GO
Comparing Parent/Child and hierarchyid for Common Operations
Subtree queries are significantly faster with hierarchyid.
Direct descendant queries are slightly slower with hierarchyid.
Moving non-leaf nodes is slower with hierarchyid. Inserting non-leaf nodes and inserting or moving leaf nodes has the same complexity with hierarchyid.
Parent/Child might be superior when the following conditions exist:
The size of the key is very critical. For the same number of nodes, a hierarchyid value is equal to or larger than an integer-family (smallint, int, bigint) value. This is only a reason to use Parent/Child in rare cases, because hierarchyid has significantly better locality of I/O and CPU complexity than the common table expressions required when you are using a Parent/Child structure.
Queries rarely query across sections of the hierarchy. In other words if queries usually address only a single point in the hierarchy. In these cases co-location is not important. For example, Parent/Child is superior if the organization table is only used for running payroll for individual employees.
Non-leaf subtrees move frequently and performance is very important. In a parent/child representation changing the location of a row in a hierarchy affects a single row. Changing the location of a row in a hierarchyid usage affects n rows, where n is number of nodes in the sub-tree being moved.
If this the non-leaf subtrees move frequently and performance is very important, but most of the moves are at a well-defined level of the hierarchy, consider splitting the higher and lower levels into two hierarchies. This makes all moves into leaf-levels of the higher hierarchy. For instance, consider a hierarchy of Web sites hosted by a service. Sites contain many pages arranged in a hierarchical manner. Hosted sites might be moved to other locations in the site hierarchy, but the subordinate pages are rarely re-arranged. This could be represented via:
CREATE TABLE HostedSites ( SiteId hierarchyid, PageId hierarchyid ) ; GO
XML
An XML document is a tree, and therefore a single XML data type instance can represent a complete hierarchy. In SQL Server when an XML index is created, hierarchyid values are used internally to represent the position in the hierarchy.
Using XML data type can be superior when all the following are true:
The complete hierarchy is always stored and retrieved.
The data is consumed in XML format by the application.
Predicate searches are extremely limited and not performance critical.
For example, if an application tracks multiple organizations, always stores and retrieves the complete organizational hierarchy, and does not query into a single organization, a table of the following form might make sense:
CREATE TABLE XMLOrg
(
Orgid int,
Orgdata xml
) ;
GO
Migrating from Parent/Child to hierarchyid
Most trees are represented today using Parent/Child. The easiest way to migrate from a Parent/Child structure to a table using hierarchyid, is to use a temporary column or a temporary table to keep track of the number of nodes at each level of the hierarchy. For an example of migrating a Parent/Child table, see lesson 1 of Tutorial: Using the hierarchyid Data Type.
Query Transforms for hierarchyid
To maximize performance of querying hierarchies, SQL Server automatically performs three transformations of queries involving hierarchyid. The result of these transforms can be seen in the showplan output for transformed queries.
IsDescendantOf is transformed to a range seek
Given E as a column or a variable, E.IsDescendantOf(c) is transformed into a range seek. This significantly reduces the cost of finding descendants. If there is a depth-first index on E, this transformation helps because all descendants of E are co-located. For example, the code snippet EmployeeId.IsDescendantOf(@value) is executed as EmployeeId >= @Value AND EmployeeId <= @Value.DescendantLimit(). DescendantLimit is an internal method that determines the least upper bound of all possible descendants of a node. Notice that @value does not have to be a parameter. It could be a column, perhaps from a join condition.
GetAncestor is transformed to a range scan and residual predicate
GetAncestor(n) gives the nth ancestor of a node. This is useful when the precise relationship (parent, child, grandparent, etc.) between two nodes is needed, in contrast to the more general IsDescendantOf.
For example, execute the following query to find all employees whose direct manager is @value:
SELECT * FROM Employees WHERE EmployeeId.GetAncestor(1) = @value
This is transformed into a range scan for descendants of @value, with the original predicate as a residual. The code is transformed into the following:
SELECT * FROM Employees
WHERE
EmployeeId >= @Value AND EmployeeId <= @value.DescendantLimit()
AND EmployeeId.GetAncestor(1) = @value
The effect of this is to limit the scan to the subtree of @value.
GetAncestor is transformed to an index lookup using the breadth-first-index
In the above query, if @value is in the upper levels of the tree, the optimization above does not significantly reduce the number of rows scanned. When questions about nth ancestor are common, applications should create a breadth-first-index as previously described.
When a breadth-first index is present, the above query is further transformed to the following:
SELECT * FROM Employees
WHERE
EmployeeId >=@value AND EmployeeId <= @Value.DescendantLimit()
AND @value.GetLevel()+1 = EmployeeId.GetLevel()
The last line (containing the GetLevel methods) becomes an index lookup in the breadth-first-index. If EmployeeId is the clustering key, then it is the second column in the breadth-first index, and the two predicates become an index lookup that specifies exactly the n co-located direct reports of @value.
The GetAncestor transforms are not limited to queries for direct parents. The argument to GetAncestor can be any variable or constant.