Shredding XML

JCE H 41 Reputation points
2020-09-02T15:27:21.717+00:00

Kind of an optimisation question and kind of sharing in the hope it might be of use to someone else.

PROBLEM

I needed to work out where the nodes were in an XML download so that I knew what I was looking for. Maybe there's a T-SQL solution to this already but on a brief Google search it wasn't obvious so I had to come up with something. Inadvertently I ended up shredding the XML at the same time but at least I got the data I was after. Ideally, I would have kept it to varchar(8000) and 1 page but I cannot do that because the XML is a blob so it had to go varchar(max). As you can imagine that is VERY dangerous performance-wise and being a table function that essentially is a loop within a loop scanning sequentially through each cell that can go parallel is a recipe for disaster. So, the function is to be used with caution, preferably only on 1 row of data or a few rows of data at a time to get an idea of what nodes are present. If you can keep the total data length under a page (8000 bytes) then so much the better. The function is inspired by the split-string function from the legend that is Jeff Moden - thanks, Jeff! If there's a more efficient way of doing this from T-SQL then I'm all ears.

MY SOLUTION / BODGIT.

SET ANSI_NULLS ON
GO
SET QUOTED_IDENTIFIER ON
GO
CREATE FUNCTION [dbo].[tfnFindXMLnodes] (@XML xml)

-- Table-valued user-defined function - TVF
/*
AUTHOR JCEH
CREDIT TO JEFF MODEN FOR HIS INSPIRATIONAL SPLITSTRING FUNCTION. I just jazzed it up a little.
*/

RETURNS @TableList TABLE(Seq int,ElementOrAttribute char(1) COLLATE Latin1_General_CS_AS,NodeLevel int,NodePath varchar(1000),Element varchar(255),AttributeOrValue varchar(255),StringX varchar(255))

BEGIN
DECLARE @XMLstring varchar(max) = CAST(@XML as varchar(max)) -- XML input as string to analyse
DECLARE @XMLattrib varchar(8000)
DECLARE @xcl int = 0,@NodePath varchar(1024)='',@StringNode varchar(255),@type char(1),@StringAttribute varchar(255),@StringElement varchar(255),@level int=0, @smallint smallint = 1-- Outputs
DECLARE @StartString varchar(5)='<',@EndString varchar(5)='>' ,@StartPointer int,@EndPointer int, @EndSpace int -- Start and end points
DECLARE @StartStringA varchar(5)=' ',@EndStringA varchar(5)='="',@StartPointerA int,@EndPointerA int -- Start and end points of attributes

--- Find starting point  
SELECT @StartPointer = CHARINDEX(@StartString, @XMLstring) --- Find first opening bracket  
	  ,@EndPointer = CHARINDEX(@EndString, @XMLstring,CHARINDEX(@StartString, @XMLstring)) -- Find first closing bracket  
	  ,@EndSpace = CHARINDEX(@EndStringA, @XMLstring,CHARINDEX(@StartString, @XMLstring)) -- Find first =" after opening in case of attributes present  

---------------------------------------------------------------------------------------------------  

WHILE (@StartPointer < LEN(@XMLstring) + 1)   
BEGIN -- while loop 1  
	SET @int = @int+1  
    IF @EndPointer = 0 SET @EndPointer = LEN(@XMLstring)+1  
    IF @EndSpace   = 0 SET @EndSpace   = LEN(@XMLstring)+1  
	--- Are we looking at an element on its own or are there also attributes?  
	SET @type = CASE   
					WHEN @EndPointer<@EndSpace   
						THEN 'E'   
					ELSE 'a'   
					END  
	SET @StringNode   = LTRIM(RTRIM(SUBSTRING(@XMLstring, @StartPointer, @EndPointer-@StartPointer+1)))  
	SET @level        = CASE WHEN @StringNode LIKE '%</%' THEN @level-1 ELSE @level+1 END  
	SET @StringElement= CASE WHEN @type='E' THEN SUBSTRING(@StringNode,CHARINDEX(@StartString,@StringNode,1),255) ELSE SUBSTRING(@StringNode,CHARINDEX(@StartString,@StringNode,1),CHARINDEX(' ',@StringNode,1)) END  
	SET @StringAttribute = CASE WHEN @StringNode LIKE '%</%' THEN LEFT(@StringNode,CHARINDEX('</',@StringNode,1)-1) ELSE '' END  
    SET @NodePath	 = CASE   
						WHEN @StringNode LIKE '%</%' THEN LEFT(@NodePath,LEN(@NodePath)-CHARINDEX('/',REVERSE(@NodePath),1)) -- Closing part  
						ELSE @NodePath+CASE WHEN @level=1 THEN '' ELSE '/'END+CASE WHEN @StringElement LIKE '<%/>' THEN SUBSTRING(@StringElement,2,LEN(@StringElement)-3) WHEN @StringElement LIKE '<%>' THEN RTRIM(SUBSTRING(@StringElement,2,LEN(@StringElement)-2)) ELSE RTRIM(SUBSTRING(@StringElement,2,255)) END   
						END  

	INSERT INTO @TableList(Seq,ElementOrAttribute,NodeLevel,NodePath,Element,AttributeOrValue,StringX)   
					VALUES(@int,CASE WHEN @type='a' THEN 'e' WHEN @StringAttribute='' THEN 'E' ELSE 'V' END,@level  
					--,@NodePath on lines below!  
					,CASE   
						WHEN @StringElement LIKE '</%>'  AND RIGHT(@NodePath,CHARINDEX('/',REVERSE(@NodePath),1))<>'/'+CASE WHEN @StringElement LIKE '</%>' THEN SUBSTRING(@StringElement,3,LEN(@StringElement)-3) WHEN @StringElement LIKE '<%>' THEN RTRIM(SUBSTRING(@StringElement,2,LEN(@StringElement)-2)) ELSE RTRIM(SUBSTRING(@StringElement,2,255)) END THEN @NodePath+'/'+CASE WHEN @StringElement LIKE '</%>' THEN SUBSTRING(@StringElement,3,LEN(@StringElement)-3) WHEN @StringElement LIKE '<%>' THEN RTRIM(SUBSTRING(@StringElement,2,LEN(@StringElement)-2)) ELSE RTRIM(SUBSTRING(@StringElement,2,255)) END   
						ELSE @NodePath END  
														/*Element*/,CASE WHEN @StringElement LIKE '</%>' THEN SUBSTRING(@StringElement,3,LEN(@StringElement)-3) WHEN @StringElement LIKE '<%>' THEN RTRIM(SUBSTRING(@StringElement,2,LEN(@StringElement)-2)) ELSE RTRIM(SUBSTRING(@StringElement,2,255)) END  
																		,@StringAttribute,@StringNode)  

	---------------------------------------------------------------------------------------------------  
	--- If attributes present then setup sub-loop to obtain each attribute in turn  
	IF(@type='a')  
	BEGIN -- Attributes present  
	SET @XMLattrib	      = SUBSTRING(@XMLstring --- Need substring to split with all attributes  
									 ,CHARINDEX(@StartString, @XMLstring,@StartPointer)+1 -- starting at opening bracket  
									 ,CHARINDEX(@EndString, @XMLstring,@StartPointer)-CHARINDEX(@StartString, @XMLstring,@StartPointer) -- ending where closing angle bracket.  
									 )  
	SET @StartStringA = ' '  
	SET @EndStringA   = '="'  
    SELECT @StartPointerA = CHARINDEX(@StartStringA,@XMLattrib,1) --- Find first attribute start (assume space before it)  
	      ,@EndPointerA   = CHARINDEX(@EndStringA,@XMLattrib,CHARINDEX(@StartStringA, @XMLattrib,1)+1) -- Find first attribute value start after the space  

		  WHILE (@StartPointerA<LEN(@XMLattrib))  
		  BEGIN -- while loop 2  
			SET @int = @int+1 -- For seq number  
			IF @EndPointerA = 0 SET @EndPointerA = LEN(@XMLattrib)+1 -- if no end then use whole string  

			SET @StringAttribute = SUBSTRING(@XMLattrib,@StartPointerA,@EndPointerA-@StartPointerA)  

			INSERT INTO @TableList(Seq,ElementOrAttribute,NodeLevel,NodePath,Element,AttributeOrValue,StringX)   
			VALUES(@int,CASE WHEN @StringAttribute LIKE '%"%"%' THEN 'v' ELSE @type END,@level,@NodePath,RTRIM(SUBSTRING(@StringElement,2,255)),CASE	WHEN @StringAttribute LIKE '"%"/>' THEN SUBSTRING(@StringAttribute,2,LEN(@StringAttribute)-4) WHEN @StringAttribute LIKE '"%">' THEN SUBSTRING(@StringAttribute,2,LEN(@StringAttribute)-3) WHEN @StringAttribute LIKE '"%"' THEN SUBSTRING(@StringAttribute,2,LEN(@StringAttribute)-2) ELSE @StringAttribute END,@XMLattrib)  

			--- Go to next value  
			IF(@smallint>0)  
			BEGIN -- >0 then space is currently starting point and needs switching  
			SET @StartPointerA = @EndPointerA + 1  
			SET @EndPointerA = CHARINDEX(@StartStringA, @XMLattrib, @StartPointerA)  
			END   -- >0 then space is currently starting point and needs switching  

			--- Go to next attribute  
			IF(@smallint<0)  
			BEGIN -- <0 then =" is currently starting point and needs switching  
			SET @StartPointerA = @EndPointerA + 1  
			SET @EndPointerA = CHARINDEX(@EndStringA, @XMLattrib, @StartPointerA)  
			END   -- <0 then space is currently starting point and needs switching  

			SET @smallint=@smallint*CAST(-1 as smallint) -- Flip start and end delimiters  

		  END   -- while loop 2  
	  
	SET @level		= CASE WHEN @StringAttribute LIKE '%/>' THEN @level-1 ELSE @level END  
    SET @NodePath	= CASE WHEN @StringAttribute LIKE '%/>' THEN LEFT(@NodePath,LEN(@NodePath)-CHARINDEX('/',REVERSE(@NodePath),1)) ELSE @NodePath END  

	END   -- Attributes present  
	---------------------------------------------------------------------------------------------------  

	SET @StartPointer = @EndPointer + 1  
    SET @EndPointer = CHARINDEX(@EndString, @XMLstring, @StartPointer)  
	SET @EndSpace = CHARINDEX(@EndStringA, @XMLstring,@StartPointer)  

END  -- while loop 1  

---------------------------------------------------------------------------------------------------  

    RETURN  

---------------------------------------------------------------------------------------------------  

END -- FUNCTION
GO

Developer technologies Transact-SQL
{count} votes

Accepted answer
  1. Yitzhak Khabinsky 26,586 Reputation points
    2020-09-02T15:49:59.533+00:00

    Hi @JCE H ,

    Thanks for sharing your method.

    Please find below XML way to do the same. It is one of the rare cases where OPENXML is handy.
    Otherwise, XQuery rules the world.

    Get every single leaf node from XML:

    DECLARE @xml XML =  
    N'<Parent>  
     <DataExtensions>  
     <DataItem key="Key1" type="String">  
     <value>12345</value>  
     </DataItem>  
     <DataItem key="Key2" type="String">  
     <value>6789</value>  
     </DataItem>  
     </DataExtensions>  
     <TestKey1>12345</TestKey1>  
     <Answers>  
     <AnswersList>  
     <Entry key="Key2" type="Answer">  
     <value ID="Key1ID" Text="12"/>  
     </Entry>  
     <Entry key="Key3" type="Answer">  
     <value ID="Key3ID" Text="12345"/>  
     </Entry>  
     </AnswersList>  
     </Answers>  
     <TestKey2>12345</TestKey2>  
     <TestKey3>67890</TestKey3>  
     <TestKey4>12345</TestKey4>  
    </Parent>';  
      
    DECLARE @DocHandle INT;  
    EXEC sp_xml_preparedocument @DocHandle OUTPUT, @xml;  
      
    ;WITH rs AS  
    (  
       SELECT * FROM OPENXML(@DocHandle,'/*')  
    ), cte AS  
    (  
       -- anchor  
        SELECT id  
          ,ParentID  
          , nodetype  
          , [text]  
          ,CAST(id AS VARCHAR(100)) AS [Path]  
          ,CAST('/' + rs.localname AS VARCHAR(1000))   
       + N'['   
            + CAST(ROW_NUMBER() OVER(ORDER BY (SELECT 1)) AS NVARCHAR)    
            + N']' AS [XPath]  
        FROM rs  
        WHERE ParentID IS NULL  
        UNION ALL  
        --recursive member  
        SELECT t.id  
          ,t.ParentID  
          , nodetype = (SELECT nodetype FROM rs WHERE id = t.ParentID)  
          , t.[text]  
          , CAST(a.[Path] + ',' + CAST( t.ID AS VARCHAR(100)) AS VARCHAR(100)) AS [Path]  
          , CAST(a.[XPath] + '/' + IIF(t.nodetype = 2, '@', '')  
             + t.localname AS VARCHAR(1000))   
      + N'['  
     + CAST(ROW_NUMBER() OVER(PARTITION BY t.localname ORDER BY (SELECT 1)) AS NVARCHAR)  
     + N']' AS [XPath]  
        FROM rs AS t  
          INNER JOIN cte AS a ON t.ParentId = a.id  
    )  
    SELECT ID, ParentID, nodetype, [Path]  
       , REPLACE([XPath],'[1]/#text[1]','') AS XPath, [text] AS [Value]  
    FROM cte  
    WHERE [text] IS NOT NULL  
    ORDER BY [Path];  
      
    EXEC sp_xml_removedocument @DocHandle;  
    

    Output
    ID ParentID nodetype XPath Value
    35 11 1 /Parent[1]/TestKey1 12345
    36 15 2 /Parent[1]/Answers[1]/AnswersList[1]/Entry[1]/@Key Key2
    37 16 2 /Parent[1]/Answers[1]/AnswersList[1]/Entry[1]/@type Answer
    38 18 2 /Parent[1]/Answers[1]/AnswersList[1]/Entry[1]/value[1]/@ID Key1ID
    39 19 2 /Parent[1]/Answers[1]/AnswersList[1]/Entry[1]/value[1]/@Text 12
    40 21 2 /Parent[1]/Answers[1]/AnswersList[1]/Entry[2]/@Key Key3
    41 22 2 /Parent[1]/Answers[1]/AnswersList[1]/Entry[2]/@type Answer
    42 24 2 /Parent[1]/Answers[1]/AnswersList[1]/Entry[2]/value[1]/@ID Key3ID
    43 25 2 /Parent[1]/Answers[1]/AnswersList[1]/Entry[2]/value[1]/@Text 12345
    29 4 2 /Parent[1]/DataExtensions[1]/DataItem[1]/@Key Key1
    30 5 2 /Parent[1]/DataExtensions[1]/DataItem[1]/@type String
    31 6 1 /Parent[1]/DataExtensions[1]/DataItem[1]/value 12345
    34 10 1 /Parent[1]/DataExtensions[1]/DataItem[2]/value 6789
    32 8 2 /Parent[1]/DataExtensions[1]/DataItem[2]/@Key Key2
    33 9 2 /Parent[1]/DataExtensions[1]/DataItem[2]/@type String
    44 26 1 /Parent[1]/TestKey2 12345
    45 27 1 /Parent[1]/TestKey3 67890
    46 28 1 /Parent[1]/TestKey4 12345

    1 person found this answer helpful.

3 additional answers

Sort by: Most helpful
  1. MelissaMa-MSFT 24,221 Reputation points
    2020-09-03T02:57:39.997+00:00

    Hi @JCE H ,

    Thanks very much for your sharing.

    I also did some research from myside.

    1. Normally we could use OpenXML or XQuery to shred XML. XQuery is usually faster than OpenXML when shredding small XML documents, but slower when shredding large documents. XQuery is much more flexible than OpenXML. You could refer Basics of XML and SQL Server, Part 2: Shredding XML for more details.
    2. I also found one expert mentioned that having typed the XML using a schema and created an XML Index which would have a big performance improvement. Please refer the answer in The Best Way to shred XML data into SQL Server database columns.
    3. You could also refer one lastest and good article about shreding XML in Let's XML Shred!.
    4. You could also consider to use SSIS: Shred data from XML file in folder into columns of a table .

    If the response is helpful, please click "Accept Answer" and upvote it.

    Best regards
    Melissa


  2. Jeff Moden 11 Reputation points
    2020-09-08T14:16:57.527+00:00

    I appreciate the honorable mention but, apparently, I didn't inspire the right way. The first thing I see in the function is that not only is it an mTVF instead of a high performance iTVF but it also has a WHILE loop in it. I've not done a deep dive on what has been done but it's seriously likely that I'd create an iTVF instead of an mTVF and the chances of me using a WHILE loop for such a thing are virtually nonexistent.


  3. JCE H 41 Reputation points
    2020-09-02T15:33:23.997+00:00

    USAGE

    DECLARE @XML TABLE(LocalIndex_xml int IDENTITY(1000,1) NOT NULL,XMLreceived varchar(8000) NOT NULL)
    INSERT INTO @XML (XMLreceived) VALUES('<Organisation orgRecordClass="RC1"><Name>NHS CHILTERN CCG</Name><Date><Type value="Operational"/><Start value="2011-12-13"/></Date><Date><Type value="Legal"/><Start value="2013-04-01"/><End value="2018-03-31"/></Date><OrgId root="2.16.840.1.113883.2.1.3.2.4.18.48" assigningAuthorityName="HSCIC" extension="10H"/><Status value="Active"/><LastChangeDate value="2019-03-28"/><GeoLoc><Location><AddrLn1>ROOM XR37, 2ND FLOOR</AddrLn1><AddrLn2>AMERSHAM HOSPITAL</AddrLn2><AddrLn3>WHIELDEN STREET</AddrLn3><Town>AMERSHAM</Town><County>BUCKINGHAMSHIRE</County><PostCode>HP7 0JD</PostCode><Country>ENGLAND</Country></Location></GeoLoc><Roles><Role id="RO98" uniqueRoleId="161501" primaryRole="true"><Date><Type value="Operational"/><Start value="2011-12-13"/></Date><Date><Type value="Legal"/><Start value="2013-04-01"/><End value="2018-03-31"/></Date><Status value="Active"/></Role></Roles><Rels><Rel id="RE5" uniqueRelId="526349"><Date><Type value="Operational"/><Start value="2017-04-01"/><End value="2019-03-28"/></Date><Status value="Inactive"/><Target><OrgId root="2.16.840.1.113883.2.1.3.2.4.18.48" assigningAuthorityName="HSCIC" extension="69JAV"/><PrimaryRoleId id="RO173" uniqueRoleId="295013"/></Target></Rel><Rel id="RE5" uniqueRelId="372161"><Date><Type value="Operational"/><Start value="2015-04-01"/></Date><Date><Type value="Legal"/><Start value="2015-04-01"/><End value="2018-03-31"/></Date><Status value="Active"/><Target><OrgId root="2.16.840.1.113883.2.1.3.2.4.18.48" assigningAuthorityName="HSCIC" extension="Q82"/><PrimaryRoleId id="RO210" uniqueRoleId="202026"/></Target></Rel><Rel id="RE5" uniqueRelId="186985"><Date><Type value="Operational"/><Start value="2012-10-01"/><End value="2015-03-31"/></Date><Date><Type value="Legal"/><Start value="2013-04-01"/><End value="2015-03-31"/></Date><Status value="Inactive"/><Target><OrgId root="2.16.840.1.113883.2.1.3.2.4.18.48" assigningAuthorityName="HSCIC" extension="Q69"/><PrimaryRoleId id="RO210" uniqueRoleId="163564"/></Target></Rel></Rels><Succs><Succ uniqueSuccId="37464"><Date><Type value="Legal"/><Start value="2018-04-01"/></Date><Type>Successor</Type><Target><OrgId root="2.16.840.1.113883.2.1.3.2.4.18.48" assigningAuthorityName="HSCIC" extension="14Y"/><PrimaryRoleId id="RO98" uniqueRoleId="297653"/></Target></Succ></Succs></Organisation>')

    SELECT LocalIndex_xml+ROW_NUMBER() OVER(ORDER BY Seq) as LocalIndex_XML3,LocalIndex_xml as LocalIndex_XML2,i.NodeLevel,i.NodePath,i.ElementName,i.AttributeName,i.AttributeValue
    FROM (--i
    SELECT
    x.LocalIndex_xml
    ,MIN(Seq) as Seq
    ,MAX(s.NodeLevel) as NodeLevel
    ,s.NodePath
    ,s.Element as ElementName
    ,MAX(CASE WHEN s.ElementOrAttribute='a' THEN s.AttributeOrValue END) as AttributeName
    ,MAX(CASE WHEN s.ElementOrAttribute IN('v','V') THEN s.AttributeOrValue END) as AttributeValue
    FROM @XML as x
    CROSS APPLY dbo.tfnFindXMLnodes(x.XMLreceived) as s
    GROUP BY x.LocalIndex_xml
    ,s.NodePath
    ,s.Element --as ElementName
    ) as i
    --WHERE i.AttributeValue IS NOT NULL

    0 comments No comments

Your answer

Answers can be marked as Accepted Answers by the question author, which helps users to know the answer solved the author's problem.