Shredding multiline column using xml

I had an ETL project to copydata from one SQL Server to another. The conversion rules, the data constraints and other metadata were in an Excel spreadsheet. Many of the cells consisted of multiple rows of data. But how to extract the rows? I was able to shred these cells into SQL rows by using the SQL Server XML XQuery methods. I convert each line in the cell into an XML, then I use the SQL XML nodes method to shred each element into separate rows.

Before we can work with the xml, we need to do a little bit of housekeeping. We need to convert certain characters to ones that work with XML. For example XML doesn’t work with &, <, and >.  Then you append the opening and closing tag for each row (ie <row>…</row>) around the entire text of the column. Next you replace the end of line character with the closing tag followed by the opening tag (</row><row>). You would end up with a column value that looks like:

<row>First row</row><row>More Rows</row>

Finally you convert the text to the XML data type. You can use the .nodes method to shred the XML into separate records. Finally, you get the value for each row using the .value method.

Here is the script.

declare @tbl table (
col nvarchar(1000)

declare @nvalue nvarchar(1000);
declare @i int = 0;

while @i &lt; 10
set @nvalue = N'Row ' + cast(@i as nvarchar(2)) + nchar(13)+ nchar(10)
 + N'Date ' + cast(sysdatetime() as nvarchar(40)) + nchar(13)+ nchar(10)
 + N'Random ' + cast (rand() as nvarchar(40)) + nchar(13)+ nchar(10)
 + N'Control characters &amp; &lt; &gt;';

 insert @tbl values (@nvalue);

set @i += 1;

; with xmlRows as(
select t.*, 
cast ( (N'&lt;row&gt;' +
Replace(Replace(Replace(Replace(col, N'&amp;', '&amp;amp;')
, '&lt;', '&amp;lt;') , '&gt;', '&amp;gt;')
, nchar(13)+ nchar(10) , N'&lt;/row&gt;&lt;row&gt;')
+ N'&lt;/row&gt;')
as xml ) as xml 
from @tbl  t

select x.*, ('text()[1]', 'nvarchar(1000)') as rowValue
from xmlRows x 
cross apply x.xml.nodes ('row') as row(data);

This technique will work with other delimiters including tab, comma and pipe.

About Russ

Russ Loski is a data warehouse developer focusing on SSIS and SQL Server development. He has worked with SQL Server since 1998. He has used various technologies to move data from one source to others. He has worked for well-known companies such as the Bank of America, the Dallas Cowboys Football Club, T-Mobile and Coca Cola Enterprises. Currently he is developing a Data Warehouse for a large health care organization.
This entry was posted in Uncategorized and tagged , , . Bookmark the permalink.

Leave a Reply