How good T-SQL in generating XML for multiple rows?

Naomi 7,361 Reputation points
2022-02-13T20:17:16.777+00:00

Hi everybody,

We have a complex extract process which first selects rows of potential updates (say, 100K rows of Ids), then makes a loop over these 100K rows and calls a stored procedure that returns XML through output parameter and this XML is inserted into another table and also sha value is calculated using a custom UDF. The procedure to generate XML is quite complex (it creates table variable with supporting data first, then uses very complex query with many FOR XML PATH inside, then uses `SET @XML .modify('

delete //[not(node()) and not(./@)]

');`

before finally returning this back as a result through output parameter.

My question is - do you think it's possible to change that procedure into a table level function (and if possible inline although I don't see how to remove empty nodes unless there is a way to run the statement above on multiple rows after returning them - if yes - how)? What is the reasonable amount of rows (in your experience) to generate such XML (in SQL 2016)? Assuming we can change that procedure into a table valued function we can call it for, say, 500 rows at once.

Thanks in advance.

SQL Server
SQL Server
A family of Microsoft relational database management and analysis systems for e-commerce, line-of-business, and data warehousing solutions.
12,708 questions
Transact-SQL
Transact-SQL
A Microsoft extension to the ANSI SQL language that includes procedural programming, local variables, and various support functions.
4,552 questions
{count} votes

6 answers

Sort by: Most helpful
  1. Ronen Ariely 15,096 Reputation points
    2022-02-14T03:25:22.353+00:00

    Hi,

    Without information to reproduce the specific scenario, I don't think that we can say what's the best way, except discuss some relevant golden roles (which can help a lot)...

    then makes a loop over these 100K rows

    (1) In general, using tabular databases as SQL Server which are designed to work on SET of rows, should be avoided and will probably provide worse performance then using languages like C++ which loop a collection like array.

    When working with SQL Server, you should not compare loop using TSQL but try to solve the needs without loop and compare working with SET in TSQL to working with loop in C# for example.

    For another extract I changed the XML generating process to do it as a single step for all rows and this cut the time from 70 min. to 2 min.

    This make sense as I mentioned above.

    calls a stored procedure

    (2) Do you mean that you call the SP for each row?!?

    or for the entire SET, which make more sense probably and might have overhead.

    this XML is inserted into another table and also sha value is calculated using a custom UDF.

    (3) So... if you meant that the SP is called for each rows then this also mean using the UDF for each row inside the loop?

    This sound like a bad design

    Note! DDL + DML + SP code could be very helpful to the discussion!

    The procedure to generate XML is quite complex (it creates table variable with supporting data first, then uses very complex query with many FOR XML PATH inside

    (4) This by itself might mean that it will be much better to run in the client side, which add to the reason not to use TSQL

    @XML .modify

    (5) This is a CLR method which uses XQuery that (in general) is the same as executing UDF SQLCLR function XQuery and it can be done in the client side using XQSharp for example, which might provide better performance.

    Note!!! SQL Server uses an implementation of a VERY old version (outdated) of XQuery 1.0 (using XPath version 1.0), which is aligned with the July 2004 working draft of XQuery.

    (6) The current w3c standards is XQuery 3.1 (with XPath 3.1). XQuery 3.1 extends XQuery to support JSON as well as XML, adding maps and arrays to the data model and supporting them with new expressions and new functions in the language. It can provide much better performance and more flexible.

    Using your client app you can use XQuery version 3.1 which was released in 2017. This might gain some improvement in performance simply by using XQuery 3.1 instead of version 1.0 and using the new features.

    Note: there was a connect asking Microsoft to update the version of XQuery which was opened on the topic in any version of the Microsoft feedback. as much as I remember. In the current feedback system it is not well formatted (probably poorly automatically copied from the old system). You can find these two related open feedback

    (7) Note! SQL Server supports both XQuery and OpenXML (using sp_xml_prepareDocument and sp_xml_removeDocument). As a golden rule, OpenXML provides better performance in shredding large documents, while XQuery is better for small documents.

    Think about using OpenXML instead of XQuery.

    ----------

    14150-image.pngRonen Ariely
    Personal Site | Blog | Facebook | Linkedin

    1 person found this answer helpful.

  2. Erland Sommarskog 101K Reputation points MVP
    2022-02-13T20:41:23.513+00:00

    It's impossible to give a good answer based on that small amount of information. We would need to see the existing code. And we may still give the wrong answer, because we don't know understand the overall process. Why is this XML generated in the first place?

    0 comments No comments

  3. Naomi 7,361 Reputation points
    2022-02-13T22:17:24.707+00:00

    The final output is supposed to be the concatenated XML of all changed rows (the final output is generated by SSIS package). The first step in the process is to generate XML for all rows identified as possible changed rows, store latest XML as well as calculated sha value (using special algorithm), then only send rows where sha value doesn't match the last one. The process of generating the final output currently takes from 20 min. to ~1h or more and I'm thinking of what step is more likely to optimize. For another extract I changed the XML generating process to do it as a single step for all rows and this cut the time from 70 min. to 2 min., but that was much simpler process with 2 procedures only. I want to try to optimize the current process for more complex output...

    0 comments No comments

  4. Yitzhak Khabinsky 24,946 Reputation points
    2022-02-14T19:02:45.793+00:00

    Hi @Naomi ,

    As usual, without a minimal reproducible example, it is impossible to help you.

    Here is a conceptual example for you. It is covering two methods.
    First method is based on calculating [Checksum] out of XML representation of the data. I am guessing it is exactly what you currently have.
    Second method is using set based relational INTERSECT operator to do the same. Obviously, it is simpler and 'better' than Method #1.

    You would need to comment one of the methods while testing.

    As usual, the answer is following a minimal reproducible example paradigm. You copy it to SSMS as-is, run it, and it working as the doctor ordered.

    SQL

    -- DDL and sample data population, start  
    DECLARE @Source TABLE (APK INT IDENTITY PRIMARY KEY, ID_NUMBER INT, UpdatedOn DATETIMEOFFSET(3));  
    DECLARE @Target TABLE (APK INT IDENTITY PRIMARY KEY, ID_NUMBER INT, UpdatedOn DATETIMEOFFSET(3));  
      
    INSERT INTO @Source (ID_NUMBER)  
      VALUES (null), (null), (7), (7), (5);  
      
    INSERT INTO @Target (ID_NUMBER)  
      VALUES (null), (7), (null), (7), (4);  
    -- DDL and sample data population, end  
      
    SELECT * FROM @Source;  
    SELECT * FROM @Target;  
      
    -- Method #1  
    WITH source AS  
    (  
    	SELECT sp.*, HASHBYTES('sha2_256', xmlcol) AS [Checksum]   
        FROM @Source AS sp  
        CROSS APPLY (SELECT sp.* FOR XML RAW) AS x(xmlcol)  
    ), target AS  
    (  
    	SELECT sp.*, HASHBYTES('sha2_256', xmlcol) AS [Checksum]   
        FROM @Target AS sp  
        CROSS APPLY (SELECT sp.* FOR XML RAW) AS x(xmlcol)  
    )  
    UPDATE T   
    SET T.ID_NUMBER = S.ID_NUMBER  
    	, T.UpdatedOn = SYSDATETIMEOFFSET()  
    FROM TARGET AS T  
        INNER JOIN SOURCE AS S  
    		ON T.APK = S.APK  
    WHERE T.[Checksum] <> S.[Checksum];  
      
    -- Method #2  
    UPDATE T   
    SET T.ID_NUMBER = S.ID_NUMBER  
    	, T.UpdatedOn = SYSDATETIMEOFFSET()  
    FROM @Target AS T  
        INNER JOIN @Source AS S  
    		ON T.APK = S.APK  
    WHERE NOT EXISTS (SELECT S.* INTERSECT SELECT T.*);  
      
    -- test  
    SELECT * FROM @Target;  
    
    0 comments No comments

  5. Naomi 7,361 Reputation points
    2022-02-14T20:02:31.407+00:00

    Thanks, we're only interested in the method 1 (there are too many tables involved and we do need to generate the final output as XML too). I did re-write the procedure into function currently returning XML. My question is - how can I incorporate the original approach of removing empty tags into that function and if impossible, would it still help to produce the XML for, say, 500 rows at once and then run loop to remove empty tags?