Adding processing instructions and comments using XSLT

I have been playing with XSLT. I have learned how to return the entire XML document, adding an attribute here, an element there. But how do you insert a processing instruction or comment into your document?

This XSLT will simply copy your XML document, formatting it a little:

<?xml version="1.0" encoding="utf-8"?>
<xsl:stylesheet version="1.0" xmlns:xsl="http://www.w3.org/1999/XSL/Transform"
>
  <xsl:output method="xml" indent="yes"/>
  <xsl:template match="@* | node()">
    <xsl:copy>
      <xsl:apply-templates select="@* | node()"/>
    </xsl:copy>
  </xsl:template>
</xsl:stylesheet>

The template element responds to a section of XML that matches the match attribute. In this case the match if one of two kinds of nodes occur. @* matches any attribute and node() matches any other node, including elements, processing instructions, comments, etc. When any of these nodes (ie. anything) is found, make a copy and apply the template to the child attributes and nodes that are found. By default, the text is output. Thus this copies the entire document.

If I want to put a processing instruction and comment before the root node of the document, I make the following change:

<?xml version="1.0" encoding="utf-8"?>
<xsl:stylesheet version="1.0" xmlns:xsl="http://www.w3.org/1999/XSL/Transform"
>
  <xsl:output method="xml" indent="yes"/>
    <xsl:template match="/">
    	<xsl:processing-instruction name="test"><xsl:text>value="1.0"</xsl:text></xsl:processing-instruction>
 	<xsl:comment><xsl:text>Something brilliant.</xsl:text></xsl:comment>
    <xsl:copy>
      <xsl:apply-templates select="@* | node()"/>
    </xsl:copy>
  </xsl:template>
  <xsl:template match="@* | node()">
    <xsl:copy>
      <xsl:apply-templates select="@* | node()"/>
    </xsl:copy>
  </xsl:template>
</xsl:stylesheet>

The template with the match equal to “/” matches the root node. What this does is to first create a processing-instruction with the name “test” and the text value=”1.0″. It creates this:

<xsl:processing-instruction name="test"><xsl:text>value="1.0"</xsl:text></xsl:processing-instruction>

Produces:

<?test value="1.0"?>

Then it creates a comment.

<xsl:comment><xsl:text>Something brilliant.</xsl:text></xsl:comment>

Produces

<!--Something brilliant.-->

Pretty powerful for a little bit of code!

Posted in XML, XSLT | Leave a comment

Getting metadata from query using sp_describe_first_result_set

One of the tasks that eats a lot of my ETL development time is documenting the views and procedures I write. I need to know where exactly the data is coming from. sp_describe_first_result_set provides a flag for outputting the source for many of the columns in a query.

exec sys.sp_describe_first_result_set N'exec [Integration].[GetOrderUpdates] ''20100101'', ''20170122''', null, 1;

The final parameter “1” tells the procedure to dig a little to find the source for each of the columns. Now when I say “dig” I mean more like scraping the surface rather than deep research. It can identify the source table and column for a column that has been aliased, but even a simple function hides the source column from this procedure. For example this procedure is unable to identify the source column from the following expression.

CAST(o.OrderDate AS date) AS [Order Date Key],

But it is great start. I was able to identify the source of about half of the columns returned from this query and that is half that I don’t have to hunt down.

Posted in Documentation, ETL, Testing, TSQL | Leave a comment

Using a temporary table to handle sp_describe_first_result_set

In a previous post (Easing the Testing Stored Procedures with Temporary Tables) I introduced the sp_describe_first_result_set stored procedure. This procedure works like sp_executesql, except that it returns the columns that a query returns rather than the results of the query. But it has a problem. The column name is separated from the datatype and thus I have to do some shaping to be able to use the results in another query.

But why not load a temporary table with the results of sp_describe_first_result_set and shape the results the way I like? I can then take the results, copy it into my test code and away I go. You can tell I got into programming because I was lazy.

Drop  table if exists #ResultStructure
create table #ResultStructure (is_hidden bit NOT NULL
, column_ordinal int NOT NULL
, name sysname NULL
, is_nullable bit NOT NULL
, system_type_id int NOT NULL
, system_type_name nvarchar(256) NULL
, max_length smallint NOT NULL
, precision tinyint NOT NULL
, scale tinyint NOT NULL
, collation_name sysname NULL
, user_type_id int NULL
, user_type_database sysname NULL
, user_type_schema sysname NULL
, user_type_name sysname NULL
, assembly_qualified_type_name nvarchar(4000)
, xml_collection_id int NULL
, xml_collection_database sysname NULL
, xml_collection_schema sysname NULL
, xml_collection_name sysname NULL
, is_xml_document bit NOT NULL
, is_case_sensitive bit NOT NULL
, is_fixed_length_clr_type bit NOT NULL
, source_server sysname NULL
, source_database sysname NULL
, source_schema sysname NULL
, source_table sysname NULL
, source_column sysname NULL
, is_identity_column bit NULL
, is_part_of_unique_key bit NULL
, is_updateable bit NULL
, is_computed_column bit NULL
, is_sparse_column_set bit NULL
, ordinal_in_order_by_list smallint NULL
, order_by_list_length smallint NULL
, order_by_is_descending smallint NULL
, tds_type_id int NOT NULL
, tds_length int NOT NULL
, tds_collation_id int NULL
, tds_collation_sort_id tinyint NULL
);

Insert #ResultStructure
exec sys.sp_describe_first_result_set N'exec [Integration].[GetOrderUpdates] ''20100101'', ''20170122''';

select case when column_ordinal = 1 then 'create table #out (' else ', ' end
+ QUOTENAME (name) + ' ' + system_type_name
+ case when column_ordinal = max(column_ordinal) over () then ');' else '' end
 from #ResultStructure;

 
Posted in Testing, TSQL | Leave a comment

Using view as source for application

Recently someone posed a puzzling question on the MSDN forums (How to reorder a column in a table which also has a foreign key). They needed to be able to add columns to a table in a specific order. If you know basics of relational theory, you would understand that the order of columns in a table is not guaranteed. But the problem was that the software that referenced that table expected referenced columns by the order rather than the name. Olaf Helper had a suggestion which the person asking the question totally ignored: create a view with the columns in the correct order and reference the view from the application.

I tell you that you I can get very lazy when developing applications referencing SQL Server. It is easy to simply reference the table or use “Select * from.” It works until someone makes a change (removes a column, etc.).

However, I can create a view and reference that view from my application code. I don’t have to worry about cosmetic changes (column order, column renaming) affecting my application. I make sure that the view has the columns in the correct order and have the correct data type in order to make the process work. In fact, if I completely drop a column, I can replace the column in the view with a calculated column (NULL, constants, expressions). The only problem is that the column is read-only at that point.

This is very much the principle behind using interfaces when programming. As long as my interface doesn’t change, I can change how the code is implemented without breaking the application that uses the interface.

It has an additional benefit that it is much easier to test views. You don’t have to worry about getting every detail of your query just right. All of that is handled behind the scenes with a view.

Posted in ETL, Testing, TSQL | Leave a comment

Splitting strings in SQL Server

One of the query pains in SQL Server involves a simple split of a string. You want to pass into a stored procedure a comma delimited list of items and use that in your procedure. In order to get a tabular result from that string, you had to write your own function. But no longer. SQL Server 2016 now provides two methods for splitting a string.

Erik Darling describes the String_split function (https://www.brentozar.com/archive/2016/03/splitting-strings-sql-server-2016-rescue/):
declare @list nvarchar(200) = N'1,2,3,4,5';

select * from string_split(@list,N',') as ss
This amazing new function takes the delimiter and splits the string into rows. As Erik points out, this works very nicely using the apply key word.
declare @t table (list nvarchar(200));

insert @t values (N'1,2,3,4,5');

select * from @t t
CROSS APPLY
string_split(t.list,',') as ss

Microsoft doesn’t like to just provide one way to achieve the same goal. You can do the same thing using the OPENJSON function.
declare @list nvarchar(200) = N'1,2,3,4,5';

select * from
OPENJSON (N'[' + @list + ']', '$')
WITH ( Number int N'$'

) as arr ;
select * from string_split(@list,',') as ss

declare @t table (list nvarchar(200));

insert @t values (N'1,2,3,4,5');

select * from @t t
CROSS APPLY OPENJSON (N'[' + t.list + ']', '$')
WITH ( Number int N'$') as arr ;

select * from @t t
CROSS APPLY
string_split(t.list,',') as ss

I have not tested to see which performs better. However, it is great to be able to put aside my custom splitting code.

Posted in TSQL | Leave a comment

Easing the Testing Stored Procedures with Temporary Tables

I think that we would all agree that we should test our code before we release it.  Sometimes it very easy to do so. For example, testing a view is not difficult.  Say you had to make a small change to the view. You create a new view that returns the same results that the previous did.  Then you can then run various queries to look for differences. Here is one example:
 

Select * from firstView
Except 
Select * from secondView
 
Select * from secondView
Except 
Select * from firstView

Any rows that are returned have some difference. Or you could join on the keys in each view and do brute force comparisons. And if the view takes a long time to run, you could simply SELECT INTO a temporary table.
 
However, sometimes testing is challenging to do.  How do you compare the results from two stored procedures? You can’t create an EXCEPT query with the results from two stored procedures.

You could insert the results of each stored procedure into temporary tables:
 

Insert #tempFirst
Exec dbo.usp_myFirstProc
 
Insert #tempSecond
Exec dbo.usp_mySecondProc
 
Select * from #tempFirst
Except 
Select * from #tempSecond
 
Select * from #tempSecond
Except 
Select * from #tempFirst

 
But before that, you have to go through all of the tedious work of building the temporary table.  Identifying the columns is easy. Determining the datatypes is tedious.

Since SQL Server 2012, Microsoft has provided a hidden gem sp_describe_first_result_set (
https://msdn.microsoft.com/en-us/library/ff878602.aspx). This procedure takes dynamic SQL like the sp_executesql stored procedure and returns the structure of the first result set that the dynamic SQL returns. It returns column names, datatypes and much more information.

Let’s see how it works. There is a stored procedure in the WideWorldImporters database called GetOrderUpdates. I want to test the results of the procedure. To get the columns and the data types, I run the following query:

exec sys.sp_describe_first_result_set N'exec [Integration].[GetOrderUpdates] ''20100101'', ''20170122''';
Table with columns from sp_describe_first_result_set.

sp_describe_first_result_set Output

I need the name and system_type_name column values to create my temporary table. What I do is copy the results to Excel. Then I use an Excel function to create the column statements for the CREATE TABLE.

CREATE TABLE #Results (
 [Order Date Key] date NULL
, [Picked Date Key] date NULL
, [WWI Order ID] int NULL
, [WWI Backorder ID] int NULL
, [Description] nvarchar(100) NULL
, [Package] nvarchar(50) NULL
, [Quantity] int NULL
, [Unit Price] decimal(18,2) NULL
, [Tax Rate] decimal(18,3) NULL
, [Total Excluding Tax] decimal(29,2) NULL
, [Tax Amount] numeric(38,6) NULL
, [Total Including Tax] numeric(38,6) NULL
, [WWI City ID] int NULL
, [WWI Customer ID] int NULL
, [WWI Stock Item ID] int NULL
, [WWI Salesperson ID] int NULL
, [WWI Picker ID] int NULL
, [Last Modified When] datetime2(7) NULL);

insert #Results
exec [Integration].[GetOrderUpdates] '20100101', '20170122'

select * from #Results

After I run this statement I can run various queries. As long as I don’t close the connection, the temporary table persists. I could even load a second table and do a comparison.

This procedure can save a lot of boring work.

Posted in Testing, TSQL | Leave a comment

Displaying parameterized string in SQL Server

One of the things that I do frequently is to put together some message that involves a template filled in with variables. For example, I want my error message to fit a certain pattern. Or I want to put a meaningful log message text. In SQL Server, I repeat the following ugliness in my code:

SET @message = N'This is my message with ' 
  + Cast(@NumVariable as nvarchar(100))
  + N' as a number. Additional information:  '
  + @StringVariable;

Look at how I could represent this using the string Format method in .Net:

message = string.Format("This is my message with {0} as a number.  Additional information:  {1}",
   numVariable, StringVariable);

The code is much cleaner. I can see right away the template part and the variables. I don’t have to worry about casting numbers to a string data type.

I wish I could that in SQL Server.

But you can! Since SQL Server 2008, Transact SQL has had a FormatMessage function. This function has two groups of parameters. The first group contains the template text. The second group are all of the parameters that you want to fill variable parameter slots in the template text. I can replace my code above with the cleaner:

SET @message = FORMATMESSAGE(N'This is my message with %i as a number.  Additional information:  %s'
  , @NumVariable,  @StringVariable);

This is much easier to read. It handles type casting very naturally. It has simplified my TSQL code considerably.

Posted in TSQL | Leave a comment

Solution appropriate to user

I am working on a small project for a small non-profit company. It has been an interesting project that could involve a variety of interesting technologies. But if it is going to be useful, that company needs to be able to support the solution going forward. What are the factors that might make the project work for this client?

The first factor (not most important) is technology. They want to be able to query their NeonCRM system. NeonCRM is a donor system that provides rudimentary querying capacity. But it also provides a Rest API that returns JSON. I found that using PowerQuery I can populate a PowerPivot model. Once the data is in the model, they can then write sophisticated queries.

The problem is that most of the computers in their shop are Macs. Neither PowerQuery nor PowerPivot can be installed on the Mac. They would need to have dedicated Windows PCs to allow them to use the solution. They will need to become comfortable with what is to them a foreign O/S.

Power BI provides the same functionality. Macs can consume the data from the Power BI model. But in order to develop reports against that model, you need to install Power BI. But to install the Power BI desktop application to be able to create reports etc would require a PC.

The second factor, the more important one, are skills needed to work with the technology. I live in another country from this non-profit. At some point I need to hand it over to them to manage and expand. Creating most reports from a PowerPivot model is not much more difficult than creating Excel pivot tables. This is well within the competence of the staff of this non-profit.

But what happens when a more complex need comes up? One of the early questions that we had was the date each year of the last donation in a year. This is do-able, but it required some DAX expressions, which are clearly outside of the expertise of the staff.

And if they need to customize the PowerQuery, the changes to that code challenge me. There is no way to hand this off to the non-profit.

By the way, I don’t want to suggest that they do not have the intelligence to do this. I think that they could learn these things. But they need to be putting their time and energy and thinking into running their non-profit, something that I would be incompetent at.

I am now looking at a solution closer to their technology and skills. It won’t be as sexy as the previous solution. But I expect them to be happier with that solution.

Posted in Consulting, Solutions | Leave a comment

SSIS Script Component – Last Row

One problem that you might encounter with data is where you are gathering information from multiple lines and want to output only one line. If you are going to simply aggregate the results, then you can use the Aggregate transformation. But what do you do if you want to create simply append one row’s value to another?
For example, what if I want to take the following and output one row per state:

State City Population
Texas Houston 2239558
Texas San Antonio 1436697
Texas Dallas 1281057
Oklahoma Oklahoma City 610613
Oklahoma Tulsa 398121
Oklahoma Norman 118197

Let’s say that my output is simply:

State Cities
Texas Houston=2239558;San Antonio=1436697;Dallas=1281057
Oklahoma Oklahoma City=610613;Tulsa=398121;Norman=118197

The way that I handle this is to have one variable to store the key (in this case the state) that I am grouping by and another that I use to gather the text. In SSIS I usually override the ProcessInputRow method for the input buffer.

public override void Input0_ProcessInputRow(Input0Buffer Row)

Basically the way that this works is that I have a variable that identifies the key for the output. In this case, I need to store a string: state. Then I have a second variable to store the text.
The program is structured as follows:
If Row.key != key, then output the data that was in the variables. Set the variables equal to the values from the Row.
If Row.key == key then append the text to the text variable.
This is the code:

string state = string.Empty;
bool firstRow = true;
string cities = string.Empty;

public override void Input0_ProcessInputRow(Input0Buffer Row)
{
if (Row.State == state)
{
cities += string.Format(";{0}:{1}", Row.City, Row.Population);
}
else
{
if (!firstRow)
{
CitiesBuffer.AddRow();
CitiesBuffer.State = state;
CitiesBuffer.Cities = cities;
}
state = Row.State;
cities = string.Format("{0}:{1}", Row.City, Row.Population);
}
firstRow = false;
}

But here’s the problem. When I run this, it outputs the data for the first state up until the last state. But it leaves off the last state. The problem is that I need to output once I have finished loading the last row. But the Row doesn’t tell you when you have hit the last row.
I tried using the PostExecute method, but it appears that the buffers are all closed down when the PostExecute fires.
I found that I needed to use the ProcessInput method associated with my input. Using the ProcessInput I actually use the code to walk through all of the rows in the input. The Buffer.NextRow() will get the very next row and if there are no other rows, it will return false.


public override void Input0_ProcessInput(Input0Buffer Buffer)
{
// base.Input0_ProcessInput(Buffer);
string state = string.Empty;
bool firstRow = true;
string cities = string.Empty;

while (Buffer.NextRow())
{
if (Buffer.State == state)
{
cities += string.Format(";{0}:{1}", Buffer.City, Buffer.Population);
}
else
{
if (!firstRow)
{
CitiesBuffer.AddRow();
CitiesBuffer.State = state;
CitiesBuffer.Cities = cities;
}
state = Buffer.State;
cities = string.Format("{0}:{1}", Buffer.City, Buffer.Population);
}
firstRow = false;
}
if (!firstRow)
{
CitiesBuffer.AddRow();
CitiesBuffer.State = state;
CitiesBuffer.Cities = cities;
}
}

As soon as the routine runs out of rows from the input, I can then output what is in the buffer that I had been building.

Posted in ETL, SSIS Script | Leave a comment

SSIS Pivot Transformation with Multiple Set Keys

When you pivot data, there are three dimensions that you need to set: the Set Keys, the Pivot Keys and the Pivot Values (to use the SSIS terms). In Excel it is very easy to select multiple Set Keys (ROWS), multiple Pivot Keys (COLUMNS) and even multiple Pivot Values (Values). However, using the SSIS Pivot Transformation editor one can only choose one Set Key, one Pivot Key and one Pivot Value. However, you can have multiple Set Key columns by using the Advanced Editor.

The Pivot key in the SSIS Pivot Transformation is the column that contains the values that are going to define the new columns. The Pivot Value is the value that is going to be inserted into these new columns. Basically, you map a value from the Pivot Key column to a new output column. And SSIS will put the value from the Pivot Value column into those output columns. The Set Key identifies the column combination that defines the rows.

Why do we need multiple Set keys? In many situations, what defines a unique row is not a single column. For example, one might be loading data from a source keyed to the state and county. For example the following data set.

SUMLEV STNAME CTYNAME Year Population
50 Alabama Washington County 2010 17610
50 Arkansas Washington County 2010 204026
50 Colorado Washington County 2010 4801
50 Alabama Washington County 2011 17336
50 Arkansas Washington County 2011 207882
50 Colorado Washington County 2011 4809

If I were to use the county name as the Set Key, then I would have duplicates that could cause the Pivot transformation to fail. For this small set, you need the STNAME and the CTYNAME to distinguish the rows.

The District of Columbia introduces a different twist. District of Columbia is the name of the county and the name of the state. The SUMLEV distinguishes this usage for each row (50 means County level, 40 means State level).

SUMLEV STNAME CTYNAME Year Population
50 District of Columbia District of Columbia 2010 605126
40 District of Columbia District of Columbia 2011 620472
50 District of Columbia District of Columbia 2011 620472
40 District of Columbia District of Columbia 2010 605126

Never fear, it is possible to have multiple columns in your Set key. You just have to use the Advanced editor. Let’s start.

First create the package (I have a finished example at the end). Add the flat file source to get the data. Then add a Sort transformation to sort by SUMLEVEL, STNAME, and CTYNAME. Finally, add the Pivot Transformation and configure it like the following:
04 Pivot With Primary After Generate

You will need to identify the Pivot Key values that you need to pivot on and paste them into the Generate Pivot output columns from values textbox. Then click the “Generate Columns Now” button. This will generate the Column names in the Existing Pivoted output columns textbox (you can rename these columns in the advanced editor).

The package will run at this point. However, it doesn’t show whether the row is about the STATE or about a COUNTY in the STATE (SUMLEV). Also, it doesn’t identify the STATE that each COUNTY is associated with. I want to see the SUMLEV and STNAME for each row and I want to guarantee that SSIS groups by SUMLEV, STNAME and CTYNAME, not just CTYNAME.

To do that you right click the Pivot transformation and select Advanced Editor:

06 Advanced Editor
Then go to the Input Columns tab.
07 Advanced Add Additional Columns

Select all of the columns that you want to output, including those you don’t want to include in the Set Key. In this example, I have only selected the Set Key columns.
Next, go to the Input and Output Properties tab. You will need to do a couple things here.

09 Set PivotUsage

Note the LineageID property. You will need that property value when you create a Pivot Default Output column corresponding to this column. Next, change PivotUsage from 0 to 1 (leave this value 0 if you want to simply pass the column through, without including it in the Set key).

Next, go to the Pivot Default Output and add a new Output Column. Add a column and give a name (could be the same name as the Input column). In the SourceColumn, set the value the same as the LineageID property from the corresponding Input column. This will set the DataType property for this output.

14 Add SUMLevel

If you want to simply pass through the columns, leave the PivotUsage equal to 0 in the Input Columns. Then create the corresponding column in the Output Columns and set its SourceID equal to the LineageID of the Input Column.

SSIS allows you to do a lot more than shows on the surface. The Advanced editor provides some interesting modifications.

An example of using the Pivot transformation can be downloaded from here: Pivot transformation Example.

Posted in ETL, Pivot, SSIS | Leave a comment