Load CSV file with slightly different schemas during each run

Question

I have a CSV file with Column headers as the first line but often sometimes the column position will vary but will be available in a different place. Please help me to advise how to load this dynamic schema change in CSV file to SQL server tables. Currently I am using .net to load this data from CSV to SQL tables, but open to any other language or tool which can handle this change without modifying much in the code for future.

Answer

Hi @Shreedevi Sendarapatti Ramamoorthy,

*.csv/txt flat files always have host of problems where column delimiters, column separators, invisible characters like null terminators \0, special characters based on encoding, and line breaks are in the middle of the actual data. Or data elements are missing completely, i.e. column positions will vary.

The most reliable format for data feeds is XML enforced by an XSD. An XSD plays a role of a data contract between sender and receiver. It will guarantee proper data format/shape, data types, cardinality, and enforce data quality.

For example, your XML file has the following structure:



  
    Book A
    Jack Welch
  
  
    Book B
    Henry Ford

Sequential position or missing XML elements title, author, etc. it is all irrelevant. Pass a fully qualified XML file path @Filepath to a stored procedure.

/*
Step #1: Load XML file into single row db table
*/
DECLARE @tbl TABLE (XMLColumn XML);
DECLARE @Filepath VARCHAR(100) = 'e:\Temp\books.xml';
DECLARE @sql NVARCHAR(MAX) = 
N'SELECT BulkColumn 
   FROM Openrowset(Bulk ' + QUOTENAME(@Filepath, NCHAR(39)) + ', Single_Blob) x;';
PRINT @sql;
INSERT INTO @tbl (XMLColumn)
EXEC sys.sp_executesql @sql;
-- just to see
SELECT * FROM @tbl;
/*
Step #2: Shred XML without dynamic SQL
*/
--INSERT INTO 	-- uncomment when you are ready
SELECT c.value('@id', 'INT') AS id
	, c.value('(title/text())[1]', 'NVARCHAR(50)') As title
	, c.value('(author/text())[1]', 'NVARCHAR(50)') As author
FROM @tbl
CROSS APPLY XMLColumn.nodes('/books/book') AS t(c);

Share via

Load CSV file with slightly different schemas during each run

1 answer