Developer Friendly XML Extraction From SQL Server
This article explains an approach which can used to extract data for Ad Hoc purposes by regular developers who are more used to .Net than SQL Server.
Target Audience
This article is aimed at regular developers who find themselves needing to extract some XML to produce a report, provide test data or hand to a third party but who have only a passing familiarity with SQL and SQL Server. Maybe you're working in a small business. Maybe your DBA is off on his holidays and some big cheese decided he wants that data today.
Maybe you're a contractor who just got handed the hot potato... on the day a dozen of these exports NEED delivering.
Sounds like quite a specific scenario?
These are the particular events which inspired this article.
Whatever the reason, you might find yourself in a similar bind one day and if you don't happen to have the notes from the last time you did this then it can be quite a challenge.
If you spend all day every day doing ETL with SQL server databases then this is not really aimed at you. Not that the following is particularly inefficient or bad practice, you just probably already know the magic combination of words. On the other hand, maybe you're a BI developer but you don't do XML exports and you just got given this requirement which needs doing yesterday.
Overview
You could do this using an SSIS package but that means the non SQL developer suddenly has to get to grips with SSIS. You'd also need to run the package on the server and you might have authority issues there.
If this is purely a one off requirement and QUICK AND DIRTY will do the job you could use MS Access, skip to the end of the article for more about this.
The approach used is to collect the data using FOR XML, output this to a flat file using BCP. This is run from a bat file which appends the result to another file in order to prefix the result with the xml header that will make it rather more programme - usable.
There are some pitfalls along the way and you need quite a particular combination of "modes" on that for xml, a particular set of switches on the bcp.
More about this follows in the details, so let's get started.
Requirements
You will need SQL Server Management Studio and BCP. The versions you get with SQL Express will do. Alternatively, you can remote onto the sql server box to do this.
Read the next section very carefully!
This script uses AdventureWorks2012, which you could download and install from Microsoft here or you could just follow along on a table of your own and adapt the script to select a few fields out of that,
There is a potential issue with this approach. It works well for plain ansii characters but accented characters are "escaped"
You will find that the "for xml" converts any characters with accents or that are generally not plain ansii. This can possibly mean later processing has issues. It could conceivably be a major headache if your data is not stored in Latin characters.
Precautions
If you're not used to this sort of thing then be aware that extracting huge amounts of data from a busy live database can impact performance.
At the very least you should minimise the number of records you will be reading by using "Select Top 2".
Think about how many records you'll be reading. If you don't know how many records here is an efficient query which returns an approximate number of rows in a table:
use [AdventureWorks2012]
go
SELECT CONVERT(bigint, rows)
FROM sysindexes
WHERE id = OBJECT_ID('[HumanResources].[Employee]')
AND indid < 2
Obviously, substitute your database name at the top there and your table name for the OBJECT_ID.
Should you need to know more about this particular row count subject, there is an explanation here.
Writing the Extract SQL
An assumption here is that you know enough about SQL to write whatever extract you're trying to do. Your manager would be CRAZY to give you a task which was just completely impossible. Right ?
If you aren't very confident writing SQL you can use the Query Editor. Once you select a database there's the obvious "New Query" button which gives you a query pane to type in. What isn't particularly obvious is that you can right click and choose "Design Query in Editor" which allows you to get started by choosing tables and fields to include.
Baby Steps
Fire up SQL Server Management Studio or Visual Studio if you have SQL Server Data Tools installed.
Put a simple select version of your query together.
Here's an example with some added bits to bear in mind.
use [AdventureWorks2012]
go
SET NoCount on
SET TRANSACTION ISOLATION LEVEL READ UNCOMMITTED
select top 2
LoginID
,JobTitle
,BirthDate
,MaritalStatus
from AdventureWorks2012.HumanResources.Employee
The Use database, go... at the top sets the database you will be using.
This is good practice since it's all too easy to open up management studio and try and query Master.
Set NoCount on stops your query returning the row count. It's included here since (again) it's good practice generally.
The isolation level read uncommitted stops the query from applying read locks. This is generally a good thing if you're reading a load of data but you might consider leaving this off if you know you're only reading a few records on a lightly used database.
Because you're probably going to have a few goes with this before you use it for the full extract, there is no point in selecting more than a couple of trial rows so that "top 2" keeps this a very low impact query.
Notice that each of the fields extracted goes on a separate line and those after the first have a comma in front of them. It's a good idea to work this way because you will probably want to manipulate some of these fields, eg use coalesce/isnull to give default values where null, or format field types like dates,
By prefixing the field name with a comma you are safe to just delete a line (other than the first one) without leaving an extra comma trailing to cause a problem.
You remembered to include that "Top 2" clause to minimise the number of records you select?
OK, let's spin her up then.
Hit f5 ( yes, rather like debugging an app in Visual Studio ).
Assuming this works you get your results back in the Result Pane underneath your query.
Fix up any problems and repeat until your output is looking good.
You're ready to produce some XML now.
For XML Auto
Append " for xml Auto" to the end of your query. As it pretty obviously suggests, this is going to give you some xml. The Auto refers to the sort of "shape" to that xml. Once it runs you get a slightly odd output with a header XML_ then some sort of Guid like string. Click the link like line below to expand it.
You will then see a result like this:
(Note that the above image is truncated because wide files are scaled down in this site and the full width would end up too small.)
As you look at that picture, you probably notice the rather obvious red squiggly line.
Yes, this is a bad thing and it means there's an error.
The error message there says "XML document cannot contain multiple root level elements".
That's because there is no Root element. It depends on your purposes but you probably want one of those things.
The fix for that is to tell it to add a root to the xml, which you do by adding
, root('OptionalRootName')
As the parameter implies, you can give the root a name by including it quoted in brackets or miss the brackets and contents out entirely and just go with the default of "Root".
Those two lines we have there are an XML Node per row with Attributes for our fields.
Most requirements are for nested Nodes per field. You get that by appending
, Elements
It's also pretty unlikely that you want to a name for those record nodes of "AdventureWorks2012.HumanResources.Employee" and the simplest way round that is just to alias your results.
Select .....
From ....
As Employee
If we put all that together our script now looks like:
use [AdventureWorks2012]
go
SET NoCount on
SET TRANSACTION ISOLATION LEVEL READ UNCOMMITTED
select top 2
LoginID
,JobTitle
,BirthDate
,MaritalStatus
from AdventureWorks2012.HumanResources.Employee
for xml Auto, Elements, root('Employees')
Run that, expand the output and you will see:
Well that looks pretty good.
Most of the boxes are ticked, let's move on from the T-SQL to making this output an xml file.
Export to File
It would be great if at this point you could just select an option and that xml would be output to an xml file.
There is no such easy option though.
You can right click the underlined output and choose "Copy" to paste this into notepad.
That's fine for small pieces of data but result sets of any substantial size will be truncated. It's also manual which will interfere with your sleep if you need to run this thing live at 4am in the morning.
The way to get this output into a file is to use BCP. That allows you to redirect output conveniently and doesn't truncate you xml.
Introduction to BCP
BCP stands for Bulk Copy Programme ( program if you're American ). This is an "old school" utility in that there's none of that fancy graphical user interface to make things easy. This thing is controlled using command line switches which can look pretty scary to the uninitiated.
In order to run a query and redirect the output to an xml file you use the following pattern:
bcp "your script" queryout xmlFileName -w -T
Where your script is the sql server select script, xmlFileName will be something like c:\myextract.xml.
The switches -w -T are very important and specify
-w = Unicode - other options will munge your output
-T =Trusted connection using integrated security
That's fine if you're running on the server against the default instance.
If you're running this on your PC against a server database, you also need to tell it to use a different server using -S. As the above link explains the syntax for this is -S<server_name\instance_name>. Thus -S SomeOtherServer\NameOfInstance. In case you're thinking "What's instance name?" - you can run the database engine (sqlservr.exe) more than once on a machine. Each will be for a different "instance". You can name each instance, even if you just have the one.
Note that the S is a capital S, you can have problems where it doesn't find the instance otherwise.
Before you start making your bat file, there's something you need to know though. You will be running your query from BCP in one line. Anything but quite a simple select statement is going to get very unwieldy pretty quick here so your best bet is to turn your statement into a Stored Procedure. You can then "exec" that stored procedure.
Making a Stored Procedure
It's pretty easy to make a stored procedure out of your script. Before you do though, bear in mind you are possibly about to stray into someone else's territory here. If you are not a DBA and your company has someone else who usually does this sort of database work then you want to try and conform to any naming conventions they use.
In SQL Server Management Studio, expand the "Programmability" node under your database. Expand the Store Procedures node revealed. Take a look at how the stored procedures are named. If there is a pattern, adopt it. Otherwise, give your stored procedures some sort of a prefix which will group them together and a meaningful but short indication about what they're doing.
An example would be "XML_Extract_Employees".
Once you decide what you're going to call it, the simplest approach is to insert "Create Procedure dbo,yourName as" straight after you specify the database and add a "go" on the end.
In our example that then looks like:
use [AdventureWorks2012]
go
CREATE PROCEDURE dbo.XML_Extract_Employees
as
SET NoCount on
SET TRANSACTION ISOLATION LEVEL READ UNCOMMITTED
select top 2
LoginID
,JobTitle
,BirthDate
,MaritalStatus
from AdventureWorks2012.HumanResources.Employee as Employee
for xml Auto, Elements, root('Employees')
go
Hit f5 to run it, fix and repeat until you see "Command(s) completed successfully".
Your new stored procedure will not appear under the Stored Procedures node immediately.
Right click that node and choose Refresh to force it to read the list again.
You can do this next step manually pretty easily in a new query pane once you know how.
This first time let's go through the longer easier process first.
Assuming your stored procedure appears, you can right click it and select "Execute Stored Procedure" from the list.
That creates a new query pane.
You'll notice you have a return value set up, you can lose that, specify the Database name in the exec and reduce it to one line:
EXEC [AdventureWorks2012].[dbo].[XML_Extract_Employees]
Hit f5 and you will see than now familiar output in the Results pane below.
Note
The query is still limited to two records - we'll right click and use alter query to fix that later.
You will usually see a spurious error. Your stored procedure will be underlined with a red squiggly but still work. You can resolve that easily, choose Edit > Intellisense > Refresh local cache.
Bat File
You now have a stored procedure which will output your xml and it's time for command line fun.
You could type all this stuff in manually but that's error prone and these sort of extracts have a habit of turning into a regular requirement so you want some way to re-run easily. Preferably a way you could schedule.
The way to do this is to build a bat file, which you do in notepad.
You do that by starting off with a regular txt file, type/paste your commands in it and save. Set File Manager so your file extensions aren't hidden (if you haven't already). Then rename the file extension from txt to bat.
Double clicking a bat file runs the commands so remember to right click in order to edit it.
Start up Notepad and type in:
bcp "your stored procedure name" queryout c:\text.xml -w -T
With our example and a different drive that would look like:
bcp "EXEC [AdventureWorks2012].[dbo].[XML_Extract_Employees]" queryout D:\test.xml -w -T
Save and rename as explained above.
Then double click it to run.,
You should see a black "dos box" appear for a short while, bcp does it's stuff and the dos box then disappears.
Find your test.xml file and double click that.
Depending on your file extension associations, it will probably open in Internet Explorer and you can check it looks OK. (See below).
There is one little fly in the ointment though.
That file has no xml heading, that "<?``xml
version``=``"1.0"
?>" which you see on the first line is automatically added by Internet Explorer.You could do this:
- Manually - by appending your output in a file.
- In T-SQL - but this is rather tricky for the intended audience
- In the bat file - which is chosen here because it can be scheduled and is the easier option.
Create another txt file and paste in <?``xml
version``=``"1.0"
?> before saving it.
Append two "type" command lines to your bat file so it looks something like:
bcp "EXEC [AdventureWorks2012].[dbo].[XML_Extract_Employees]" queryout test.xml -w -T
type xmlHeader.txt > xmlEmployees.xml
type test.xml >> xmlEmployees.xml
Here, xmlHeader is your txt file with that xml version... line in it.
Type can copy and overwrite when used with a single > sign or append when used with two.
That header file is first copied to xmlEmployees.xml so this process and the file itself is re-usable.
The output from bcp is then appended.
Save that, double click to run and double click your xmlEmployees ( or equivalent ) to check it's OK in Internet Explorer.
One last step.
Fix the Record Limit
OK, you've proven this thing works end to end with 2 records. Go back into SQL Server Management Studio. Find your stored procedure, right click it in the Object Explorer window and choose Modify.
Strip out the "top 2" and hit f5 to apply the changes using the Alter Procedure script sql server gave you.
You can now run this thing.
Only 11 more to go ;^)
Checking Output
The output of this is one huge long continuous string which is spectacularly unfriendly to the developer trying to read it raw.
For small xml files you can just double click them and Internet Explorer will open the file up all nicely formatted. Old versions of Internet Explorer used to cope with large xml files but it seems something has crept in over the years. You will find large files display in a rather alarmingly unformatted manner which will probably make you think they're broken.
Before you start tearing your hair out trying to find any bad data, check the file using Visual Studio.
Drag your file over onto the edit pane and it'll open up as just one line.
From the menu choose Edit > Advanced > Format Document.
If it's good xml, this should format the output like you're probably used to seeing:
<?xml version="1.0" ?>
<Employees>
<Employee>
<LoginID>adventure-works\ken0</LoginID>
<JobTitle>Chief Executive Officer</JobTitle>
<BirthDate>1963-03-02</BirthDate>
<MaritalStatus>S</MaritalStatus>
</Employee>
<Employee>
<LoginID>adventure-works\terri0</LoginID>
<JobTitle>Vice President of Engineering</JobTitle>
<BirthDate>1965-09-01</BirthDate>
<MaritalStatus>S</MaritalStatus>
</Employee>
.......... etc ...........
Quick and Dirty Alternative
If you have MS Access available you can sometimes export a table from that as xml very easily. This is, however, going to be a rather unreliable and manual approach.
Create a new MS Access database.
Close the default table you start with.
You can clipboard the output from a SQL Server Management Studio select query, open a new table in datasheet view and paste your data in as a new table, save as something more meaningful than table1.
Alternatively, link a table:
From the "External Data" menu choose ODBC Database.
Choose the Link option.
Create a connection to your SQL Server database, or pick a DSN you already have.
Pick your table(s),
Note that this will still be reading data and you remember the precautions section up there at the top.
You can right click a linked Table or View and choose Export > xml.
This will mysteriously fail fairly often - back to Plan A.