Best way to export data with LOB columns.

chrisrdba 361 Reputation points
2021-05-22T17:13:26.337+00:00

Greetings. I need to export several tables with LOB columns into new tables for long winded reasons. My current example has two XML columns on it, is 10 million records, and took about 12 hours to load through a job that calls SSIS.

A few fun facts:

  1. Nobody else is on the box, resources are fine, no blocking, etc.
  2. The source and destination are on the same box, the only real difference is they're on different disks, but this is a VM anyways, so not sure that really matters.
  3. From googling I'm under the impression this is totally common. I've seen a couple suggestions, but not really clear on a common solution.

Questions:

  1. Are there any settings within SSIS itself I can tweak to speed this up?
  2. Is there an better option than SSIS -- Bulk Insert, Insert...Select, etc.?
  3. Is there any reason to think something goofy like export all columns but LOB columns first, then do an update for LOB columns only would be faster?
  4. Because resources are fine, is there any reason to not export several tables at the same time, as there won't be resource contention?
  5. Editing the source to be an actual query instead of just relying on the GUI, I can convert one of the columns as such. If I use 1024, I get a warning about the column being too small for XML, but if I stick to 2048 I don't. Is this safe? Will it perform better? convert(varchar(2048), [RequestHashInfo]) as RequestHashInfo,
  6. My wait types are constantly ASYNC_NETWORK_IO -- why would this be under these circumstances?

Thanks!

SQL Server
SQL Server
A family of Microsoft relational database management and analysis systems for e-commerce, line-of-business, and data warehousing solutions.
13,052 questions
{count} votes

Accepted answer
  1. Erland Sommarskog 103.2K Reputation points MVP
    2021-05-22T18:41:15.59+00:00

    My wait types are constantly ASYNC_NETWORK_IO -- why would this be under these circumstances?

    ASYNC_NETWORK_IO means that SQL Server is waiting for the client to pick up data. It could be due to slow network, but more likely the client is slow with processing the data.

    I'm not sure why you brought in SSIS here. It sounds like you are copying the data to tables in the same database. In that case, it is probably not a good idea to extract the data to a client.

    I would do this with INSERT SELECT, having only the clustered in place, adding non-clustered indexes once data has been copied.

    I would also considering inserting the rows in batches to keep down the strain on the transaction log. But it is important that the batches are defined through the clustered index, since you will repeat a scan again and again. How many rows there should be per batch, depends on the size of those LOBs. Look at the total size of the table with sp_spaceused (the reserved column). Then compute the batch size, so that a batch is 500 MB in size on average. (There is no science behind that number; but I picked something which is not too big, nor too small.)

    0 comments No comments

3 additional answers

Sort by: Most helpful
  1. chrisrdba 361 Reputation points
    2021-05-22T19:09:53.293+00:00

    Thanks Erland -- I was wondering if this would be faster and coded it right after I posted the question. It's not done yet but looking very hopeful.

    Question just to satisfy my curiosity -- how/ where is ASYNC_NETWORK_IO coming into play? Literally the only piece besides the source/ destination DB is where the SSIS Package exists and is running the job from, but I can't imagine SSIS causes this overhead itself, does it?


  2. chrisrdba 361 Reputation points
    2021-05-22T22:26:31.9+00:00

    My 12 hour SSIS Package went down to a 1 hour INSERT...SELECT.

    I had no idea it would make that much of a difference.

    I'm off to enjoy the remainder of the weekend -- you do the same, and thanks!


  3. chrisrdba 361 Reputation points
    2021-05-23T20:39:11.773+00:00

    I started wondering more about this and did the following:

    1. Modified the SSIS Package to be an Execute SQL Task, containing the same INSERT...SELECT statement I was already using.
    2. Deployed the Package to the same server.
    3. Ran the job.

    It ran in 1 hour. Like the INSERT...SELECT from a job directly on the server, the plan had parallelism, and the wait types were mainly pageLatchIO_% and MEMORY_ALLOCATION_EXT.

    It would appear this isn't an issue with SSIS in general, but instead just a Data Flow Task issue.

    I have no idea how/ why doing the same export from a different Task produced such a radically different outcome, but it did.