Displaying parameterized string in SQL Server

One of the things that I do frequently is to put together some message that involves a template filled in with variables. For example, I want my error message to fit a certain pattern. Or I want to put a meaningful log message text. In SQL Server, I repeat the following ugliness in my code:

SET @message = N'This is my message with ' 
  + Cast(@NumVariable as nvarchar(100))
  + N' as a number. Additional information:  '
  + @StringVariable;

Look at how I could represent this using the string Format method in .Net:

message = string.Format("This is my message with {0} as a number.  Additional information:  {1}",
   numVariable, StringVariable);

The code is much cleaner. I can see right away the template part and the variables. I don’t have to worry about casting numbers to a string data type.

I wish I could that in SQL Server.

But you can! Since SQL Server 2008, Transact SQL has had a FormatMessage function. This function has two groups of parameters. The first group contains the template text. The second group are all of the parameters that you want to fill variable parameter slots in the template text. I can replace my code above with the cleaner:

SET @message = FORMATMESSAGE(N'This is my message with %i as a number.  Additional information:  %s'
  , @NumVariable,  @StringVariable);

This is much easier to read. It handles type casting very naturally. It has simplified my TSQL code considerably.

Posted in TSQL | Leave a comment

Solution appropriate to user

I am working on a small project for a small non-profit company. It has been an interesting project that could involve a variety of interesting technologies. But if it is going to be useful, that company needs to be able to support the solution going forward. What are the factors that might make the project work for this client?

The first factor (not most important) is technology. They want to be able to query their NeonCRM system. NeonCRM is a donor system that provides rudimentary querying capacity. But it also provides a Rest API that returns JSON. I found that using PowerQuery I can populate a PowerPivot model. Once the data is in the model, they can then write sophisticated queries.

The problem is that most of the computers in their shop are Macs. Neither PowerQuery nor PowerPivot can be installed on the Mac. They would need to have dedicated Windows PCs to allow them to use the solution. They will need to become comfortable with what is to them a foreign O/S.

Power BI provides the same functionality. Macs can consume the data from the Power BI model. But in order to develop reports against that model, you need to install Power BI. But to install the Power BI desktop application to be able to create reports etc would require a PC.

The second factor, the more important one, are skills needed to work with the technology. I live in another country from this non-profit. At some point I need to hand it over to them to manage and expand. Creating most reports from a PowerPivot model is not much more difficult than creating Excel pivot tables. This is well within the competence of the staff of this non-profit.

But what happens when a more complex need comes up? One of the early questions that we had was the date each year of the last donation in a year. This is do-able, but it required some DAX expressions, which are clearly outside of the expertise of the staff.

And if they need to customize the PowerQuery, the changes to that code challenge me. There is no way to hand this off to the non-profit.

By the way, I don’t want to suggest that they do not have the intelligence to do this. I think that they could learn these things. But they need to be putting their time and energy and thinking into running their non-profit, something that I would be incompetent at.

I am now looking at a solution closer to their technology and skills. It won’t be as sexy as the previous solution. But I expect them to be happier with that solution.

Posted in Consulting, Solutions | Leave a comment

SSIS Script Component – Last Row

One problem that you might encounter with data is where you are gathering information from multiple lines and want to output only one line. If you are going to simply aggregate the results, then you can use the Aggregate transformation. But what do you do if you want to create simply append one row’s value to another?
For example, what if I want to take the following and output one row per state:

State City Population
Texas Houston 2239558
Texas San Antonio 1436697
Texas Dallas 1281057
Oklahoma Oklahoma City 610613
Oklahoma Tulsa 398121
Oklahoma Norman 118197

Let’s say that my output is simply:

State Cities
Texas Houston=2239558;San Antonio=1436697;Dallas=1281057
Oklahoma Oklahoma City=610613;Tulsa=398121;Norman=118197

The way that I handle this is to have one variable to store the key (in this case the state) that I am grouping by and another that I use to gather the text. In SSIS I usually override the ProcessInputRow method for the input buffer.

public override void Input0_ProcessInputRow(Input0Buffer Row)

Basically the way that this works is that I have a variable that identifies the key for the output. In this case, I need to store a string: state. Then I have a second variable to store the text.
The program is structured as follows:
If Row.key != key, then output the data that was in the variables. Set the variables equal to the values from the Row.
If Row.key == key then append the text to the text variable.
This is the code:

string state = string.Empty;
bool firstRow = true;
string cities = string.Empty;

public override void Input0_ProcessInputRow(Input0Buffer Row)
{
if (Row.State == state)
{
cities += string.Format(";{0}:{1}", Row.City, Row.Population);
}
else
{
if (!firstRow)
{
CitiesBuffer.AddRow();
CitiesBuffer.State = state;
CitiesBuffer.Cities = cities;
}
state = Row.State;
cities = string.Format("{0}:{1}", Row.City, Row.Population);
}
firstRow = false;
}

But here’s the problem. When I run this, it outputs the data for the first state up until the last state. But it leaves off the last state. The problem is that I need to output once I have finished loading the last row. But the Row doesn’t tell you when you have hit the last row.
I tried using the PostExecute method, but it appears that the buffers are all closed down when the PostExecute fires.
I found that I needed to use the ProcessInput method associated with my input. Using the ProcessInput I actually use the code to walk through all of the rows in the input. The Buffer.NextRow() will get the very next row and if there are no other rows, it will return false.


public override void Input0_ProcessInput(Input0Buffer Buffer)
{
// base.Input0_ProcessInput(Buffer);
string state = string.Empty;
bool firstRow = true;
string cities = string.Empty;

while (Buffer.NextRow())
{
if (Buffer.State == state)
{
cities += string.Format(";{0}:{1}", Buffer.City, Buffer.Population);
}
else
{
if (!firstRow)
{
CitiesBuffer.AddRow();
CitiesBuffer.State = state;
CitiesBuffer.Cities = cities;
}
state = Buffer.State;
cities = string.Format("{0}:{1}", Buffer.City, Buffer.Population);
}
firstRow = false;
}
if (!firstRow)
{
CitiesBuffer.AddRow();
CitiesBuffer.State = state;
CitiesBuffer.Cities = cities;
}
}

As soon as the routine runs out of rows from the input, I can then output what is in the buffer that I had been building.

Posted in ETL, SSIS Script | Leave a comment

SSIS Pivot Transformation with Multiple Set Keys

When you pivot data, there are three dimensions that you need to set: the Set Keys, the Pivot Keys and the Pivot Values (to use the SSIS terms). In Excel it is very easy to select multiple Set Keys (ROWS), multiple Pivot Keys (COLUMNS) and even multiple Pivot Values (Values). However, using the SSIS Pivot Transformation editor one can only choose one Set Key, one Pivot Key and one Pivot Value. However, you can have multiple Set Key columns by using the Advanced Editor.

The Pivot key in the SSIS Pivot Transformation is the column that contains the values that are going to define the new columns. The Pivot Value is the value that is going to be inserted into these new columns. Basically, you map a value from the Pivot Key column to a new output column. And SSIS will put the value from the Pivot Value column into those output columns. The Set Key identifies the column combination that defines the rows.

Why do we need multiple Set keys? In many situations, what defines a unique row is not a single column. For example, one might be loading data from a source keyed to the state and county. For example the following data set.

SUMLEV STNAME CTYNAME Year Population
50 Alabama Washington County 2010 17610
50 Arkansas Washington County 2010 204026
50 Colorado Washington County 2010 4801
50 Alabama Washington County 2011 17336
50 Arkansas Washington County 2011 207882
50 Colorado Washington County 2011 4809

If I were to use the county name as the Set Key, then I would have duplicates that could cause the Pivot transformation to fail. For this small set, you need the STNAME and the CTYNAME to distinguish the rows.

The District of Columbia introduces a different twist. District of Columbia is the name of the county and the name of the state. The SUMLEV distinguishes this usage for each row (50 means County level, 40 means State level).

SUMLEV STNAME CTYNAME Year Population
50 District of Columbia District of Columbia 2010 605126
40 District of Columbia District of Columbia 2011 620472
50 District of Columbia District of Columbia 2011 620472
40 District of Columbia District of Columbia 2010 605126

Never fear, it is possible to have multiple columns in your Set key. You just have to use the Advanced editor. Let’s start.

First create the package (I have a finished example at the end). Add the flat file source to get the data. Then add a Sort transformation to sort by SUMLEVEL, STNAME, and CTYNAME. Finally, add the Pivot Transformation and configure it like the following:
04 Pivot With Primary After Generate

You will need to identify the Pivot Key values that you need to pivot on and paste them into the Generate Pivot output columns from values textbox. Then click the “Generate Columns Now” button. This will generate the Column names in the Existing Pivoted output columns textbox (you can rename these columns in the advanced editor).

The package will run at this point. However, it doesn’t show whether the row is about the STATE or about a COUNTY in the STATE (SUMLEV). Also, it doesn’t identify the STATE that each COUNTY is associated with. I want to see the SUMLEV and STNAME for each row and I want to guarantee that SSIS groups by SUMLEV, STNAME and CTYNAME, not just CTYNAME.

To do that you right click the Pivot transformation and select Advanced Editor:

06 Advanced Editor
Then go to the Input Columns tab.
07 Advanced Add Additional Columns

Select all of the columns that you want to output, including those you don’t want to include in the Set Key. In this example, I have only selected the Set Key columns.
Next, go to the Input and Output Properties tab. You will need to do a couple things here.

09 Set PivotUsage

Note the LineageID property. You will need that property value when you create a Pivot Default Output column corresponding to this column. Next, change PivotUsage from 0 to 1 (leave this value 0 if you want to simply pass the column through, without including it in the Set key).

Next, go to the Pivot Default Output and add a new Output Column. Add a column and give a name (could be the same name as the Input column). In the SourceColumn, set the value the same as the LineageID property from the corresponding Input column. This will set the DataType property for this output.

14 Add SUMLevel

If you want to simply pass through the columns, leave the PivotUsage equal to 0 in the Input Columns. Then create the corresponding column in the Output Columns and set its SourceID equal to the LineageID of the Input Column.

SSIS allows you to do a lot more than shows on the surface. The Advanced editor provides some interesting modifications.

An example of using the Pivot transformation can be downloaded from here: Pivot transformation Example.

Posted in ETL, Pivot, SSIS | Leave a comment

SSIS Flexible UnPivot

One of the problems I have encountered is trying to unpivot a file that has a variable number of columns. There are a set of columns that are fixed, that represent the Row keys. And then there are variable numbers of columns. This is especially true when your columns are based on dates. This sample is based on a question that was raised on the MSDN forums (https://social.technet.microsoft.com/Forums/sqlserver/en-US/cdd2cbd6-bed1-482a-be1c-f4cf434ed1ba/how-to-create-a-ssis-package-dynamically?forum=sqlintegrationservices&prof=required.

The UnPivot transformation, like most SSIS transformations is rigid as to the expected meta data. It expects all of the columns to have the same types, that there are exactly the same number of columns and they have the same name from run to run. This problem causes havoc with SSIS.

Product,2015Jan,2015Feb,2015Mar,2016Jul,2016Aug
0000001 - Product 0000001,52.31,48.97,47.94,48.47,49.52

I want to output this as:

Product,Year,Month,Margin
0000001 - Product 0000001,2015,Jan,52.31
0000001 - Product 0000001,2015,Feb,48.97

I might not start with column 2015Jan. I might start all of a sudden with 2016Jan. And I am likely to add columns at the end.

The Script Component provides the flexibility to handle this. First you create a File Connection to put the location for the file you are connecting to. Next, you create a Data Flow and add a Script Component to the Data Flow, setting its type to Source. Then you configure the Script Component. You add the Connection Manager to the Script:

Connection Manager

The SSIS Script component Connection Manager


You add the output columns:
Out Columns

The SSIS Script Component Out Columns

.
Then you can edit the script:

  int rowNum = 0;
  public override void CreateNewOutputRows()
  {
    // Open the file for reading
    StreamReader rdr = File.OpenText(this.Connections.SourceFile.ConnectionString);
    string[] months = new string[1]; // Place holder for the months and years
    int[] years = new int[1];
    while (rdr.EndOfStream == false)
    {
    string line = rdr.ReadLine();
    string[] cols = line.Split(',');
    if (cols.Length > 1)
    {
      rowNum++; // Add one to the row number
      if (rowNum++ > 1) // This is not the first rows
      {
        for (int i = 1; i < cols.Length; i++)
        {
        Output0Buffer.AddRow();
        Output0Buffer.Product = cols[0];
        Output0Buffer.Year = years[i - 1];
        Output0Buffer.Month = months[i - 1];
        Output0Buffer.Margin = decimal.Parse(cols[i]);
        }
      }
      else
      {
        months = new string[cols.Length - 1];
        years = new int[cols.Length - 1];

        for (int i = 1; i < cols.Length; i++)
        {
        months[i - 1] = cols[i].Substring(4);
        years[i - 1] = int.Parse(cols[i].Substring(0, 4));
        }
      }
    }
 }

The script does the following. I need to know the rownum. The first row I treat differently from the rest. I gather from the first row the identifiers that I am going to use. That is the else clause in the rownum test. I simple put the values into a months and years array. For the other rows, I take the value from the first column (cols[0]) and then create a row for each column after the first column.

I have not included error handling. You can download the source at FlexibleUnpivot.

Posted in ETL, Pivot, SSIS, SSIS Script | Leave a comment

SSIS Removing Duplicate Rows Using Conditions

Sometime your data source has duplicates. SSIS has an out of the box tool to remove these duplicates. The Sort transformation has a “Remove rows with duplicate sort values” property (https://www.mssqltips.com/sqlservertip/3036/removing-duplicates-rows-with-ssis-sort-transformation/). But it has a problem. It is very random as to the values that it removes. How do you specify which row to remove?

One scenario that one might encounter is a situation where you are bringing in data from multiple sources. If the data occurs in one source, then use that row, otherwise use data from the second source. I have built a simple SSIS package that you can download from
Sort Remove Duplicates Package. It has the following data flow:

SSIS DataFlow

SSIS Delete Duplicate DataFlow


The data sources are both Script components.

 public override void CreateNewOutputRows()
    {
        /*
          Add rows by calling the AddRow method on the member variable named "<Output Name>Buffer".
          For example, call MyOutputBuffer.AddRow() if your output was named "MyOutput".
        */
        int i = 0;
        int s = 1;
        Output0Buffer.AddRow();
        Output0Buffer.Set = s;
        Output0Buffer.ID = ++i;
        Output0Buffer.Val = 100;
        Output0Buffer.AddRow();
        Output0Buffer.Set = s;
        Output0Buffer.ID = ++i;
        Output0Buffer.Val = 100;
        Output0Buffer.AddRow();
        Output0Buffer.Set = s;
        Output0Buffer.ID = ++i;
        Output0Buffer.Val = 100;
        Output0Buffer.AddRow();
        Output0Buffer.Set = s;
        Output0Buffer.ID = ++i;
        Output0Buffer.Val = 100;
        Output0Buffer.AddRow();
        Output0Buffer.Set = s;
        Output0Buffer.ID = ++i;
        Output0Buffer.Val = 100;
        Output0Buffer.AddRow();
        Output0Buffer.Set = s;
        Output0Buffer.ID = ++i;
        Output0Buffer.Val = 100;
        Output0Buffer.AddRow();
        Output0Buffer.Set = s;
        Output0Buffer.ID = ++i;
        Output0Buffer.Val = 100;

    }

The script for the Secondary Data Set is roughly the same. I set the s variable to 2 for the Secondary Data Set. This is so that its data always follows the corresponding Primary Data Set row if it exists. I set the variable i to 4 so that the first rows (1 – 4) only have set 1 values, the following 3 rows have both set 1 and set 2 rows and the last rows are exclusively set 2. I also assign a different value to the set 2 output.

The next transformation is a Sort. The key I use is ID, Set. I want one row for each ID. I want to pick the first Set for that ID and I want to use Set to determine which is the first row for that ID.

Finally, I have a Script component to only output the first row encountered for each ID. I have to configure a few elements. The first thing that I do is to select the ID and Set columns as Input columns. That is so that I can refer to these columns in my Script component.

I also have to provide some way to exclude and include rows. To do that I use Exclusion Groups.

SSIS Script Component Exclusion Group

SSIS Script Component Exclusion Group


What you do is to set the ExclusionGroup to some number other than 0. What this will do is to allow my script to direct a particular row to that ExclusionGroup. Also note the SynchronousInputID. It should have a value that points to the only input.

The script is rather simple. You override the Input0_ProcessInputRow method:

    int lastID = -1;

    public override void Input0_ProcessInputRow(Input0Buffer Row)
    {
        if (Row.ID != lastID)
        {
            Row.DirectRowToOutput0();
            lastID = Row.ID;
        }
    }

Note that I have added a class level variable lastID. Every row is going to execute this code. If the ID column value (Row.ID) is not the same as lastID, then simply DirectRowToOutput0(). Then, don’t forget to set lastID equal to Row.ID for the next pass. What happens is that if a new ID is encountered, then the row will be sent on. Otherwise the Row disappears.

Please let me know if there is a simpler way to do this. Also, let me know how I can improve this script.

Posted in ETL, SSIS, SSIS Script | Leave a comment

Reordering Columns Using PowerShell

One of the regular issues that is discussed on the SSIS forum is what to do when a source file changes the order of columns. You build your SSIS package to handle Col1, Col2, Col3 and you get the files with the following order Col2, Col3, Col1. Or worse, you get Col2, Col1. How do you handle this case? It usually involves some nasty .Net code.

But why not process the file using PowerShell? In previous posts (listed at the end), I suggest that one can use PowerShell to preprocess a comma delimited file, sorting the result, removing columns and filtering the results. The same technique can reorder the columns, even add a column.

Here is my base file.

ID,Name,State,Country
1,Russ,Sandys,Bermuda
2,Katie,Texas,USA
3,Gail,Sandys,Bermuda
4,Melanie,Oregon,USA
5,Stephanie,Arizona,USA

I expect to find columns in the following order: ID, Name, State, Country.

But sometimes the downstream process provides you with the following formats:

Country,ID,Name,State
Bermuda,1,Russ,Sandys
USA,2,Katie,Texas
Bermuda,3,Gail,Sandys
USA,4,Melanie,Oregon
USA,5,Stephanie,Arizona

Or worse:

ID,Name,State
1,Russ,Sandys
2,Katie,Texas
3,Gail,Sandys
4,Melanie,Oregon
5,Stephanie,Arizona

Either of these will break the import.

Using PowerShell, I can very easily reformat the data to expected format:

cd C:\MyFolder\
Import-Csv -Path SampleData.csv `
  | SELECT  ID, Name, State, Country `
  | ConvertTo-CSV -NoTypeInformation `
  | % {$_ -replace  `
  '\G(?<start>^|,)(("(?<output>[^,"]*?)"(?=,|$))|(?<output>".*?(?<!")("")*?"(?=,|$)))' `
  ,'${start}${output}'} `
  | Out-File SampleDataSorted.csv -fo -en ascii ; 

In Removing quotes from CSV created by PowerShell I explain the why I use the regular expression replace.

The SELECT line of the PowerShell script is what determines the order of the columns. Interestingly, if the column does not exist in the input, it simple adds the column with a empty value.

This very simple script will always put the columns in the expected order. It will produce an empty column if the column does not have a value. By the way, while this may prevent the ETL process from failing it can leave you with invalid data.

Let me know if you have questions about this process. Can you see any use for this in your environment?

Previous posts:

Posted in ETL, PowerShell, SSIS | Leave a comment

Removing quotes from CSV created by PowerShell

In a previous post, I demonstrate how to reshape a comma delimited file (CSV) using PowerShell. I noted that the results put all of the column values in double quotes:

"ID","Name","State"
"5","Stephanie","Arizona"
"4","Melanie","Oregon"
"2","Katie","Texas"

I used this code to produce this result:

cd c:\MyFolder
Import-Csv -Path SampleData.csv|Where {$_.Country -eq "USA"} `
  | SELECT ID, Name, State `
  |Sort-Object State `
  |Export-Csv SampleDataSorted.csv `
  -NoTypeInformation; 

In this post, I remove the double quotes.

A Microsoft blog suggests one way to remove these double quotes. They suggest using the ConvertTo-CSV cmdlet rather than Export-CSV and doing a string replace of the double quotes with empty string (code from the article).

dir c:\fso -Filter *.csv | ? {$_.basename -like 'users?'} `
 | Import-Csv `
 |  sort lname,fname `
 | convertto-csv -NoTypeInformation `
 | % { $_ -replace '"', ""} `
 | out-file c:\fso\usersconsolidated.csv -fo -en ascii

I convert my command to:

cd c:\MyFolder
Import-Csv -Path SampleData.csv `
  | Where {$_.Country -eq "USA"} `
  | SELECT ID, Name, State `
  | Sort-Object State `
  | ConvertTo-CSV -NoTypeInformation `
  | % {$_ -replace '"',""} `
  | Out-File SampleDataSorted.csv -fo -en ascii
  ; 
Import-Csv -Path SampleData.csv Read source csv
Where {$_.Country -eq "USA"} Filter rows
SELECT ID, Name, State Set output columns
ConvertTo-CSV -NoTypeInformation Create CSV output
% {$_ -replace '"',""} For each row, replace double quotes with empty string
Out-File SampleDataSorted.csv -fo -en ascii Output to file

This produces the results I expect.

ID,Name,State
5,Stephanie,Arizona
4,Melanie,Oregon
2,Katie,Texas

But here is the problem. ETL scenarios can be messy. CSV files frequently need the double quotes. When a column contains a comma, you need to put the quotes around the column value. For example a file like this is problematic.

ID,Name,State,Country
1,Russ,Sandys,Bermuda
3,Gail,Sandys,Bermuda
4,Melanie,Oregon,USA
5,Stephanie,Arizona,USA
6,"Katie, ""Scott"", Arianna",Texas,USA

I want to keep the quotes around the second column that last row.

Regular expressions to the rescue. The replace command can do regular expression replacement without any additional work. That isn’t to say that building the regular expression isn’t painful.

cd c:\MyFolder
Import-Csv -Path SampleData.csv `
  | Where {$_.Country -eq "USA"} `
  | SELECT ID, Name, State `
  | Sort-Object State `
  | ConvertTo-CSV -NoTypeInformation `
  | % {$_ -replace  `
  '\G(?<start>^|,)(("(?<output>[^,"]*?)"(?=,|$))|(?<output>".*?(?<!")("")*?"(?=,|$)))' `
  ,'${start}${output}'} `
  | Out-File SampleDataSorted.csv -fo -en ascii ; 

Let’s break down the expression. The first expression (which I will explain in a moment) is the expression to be matched. The second (‘${start}${output}’} is what that expression is replaced with. I use named captures (start, output) to identify what I want the expression to be replaced with. One way to view named captures is like variables. If the pattern is matched then assign the text to that cvariable. Note that the single quotes around this expression are important http://www.johndcook.com/blog/powershell_perl_regex/#capture. So every match that is found will be replaced with the capture called “start” followed by the capture called “output.”

The first expression is interpreted as follows.

\G
Must occur after the previous match. That means that the expression won’t match if there isn’t an uninterrupted sequences of matches, one right after the other.

(?<start>^|,)
This defines the "start" capture, which is either the beginning of the string (^) or comma

((…)|(…))
Try to match first expression and if it fails, try the second. The pipe | between the two expressions in parentheses means to try the first pattern and as soon as it fails, try the second pattern.

("(?<output>[^,"]*?)"(?=,|$))
  ("(?<output>[^,"]*?)"(?=,|$))
Starts and ends with double quote
  ("(?<output>[^,"]*?)"(?=,|$))
After the second double quote, a comma must be present or it must be the end of the string. The construction ?= is not part of the match. That is important since I need to use that comma in the next match.
  ("(?<output>[^,"]*?)"(?=,|$))
This is what will be put in the “output” capture. The way I understand this is that there is a variable called “output.” If this pattern matches, then the “output” variable will be assigned the text that is found (represented by the pattern in the rest of the parentheses). The quotes will not be included in my replace statement above if this pattern is matched because they are outside of the “output” parentheses.
  ("(?<output>[^,"]*?)"(?=,|$))
This is a sequence of zero or more characters that are not comma or double quote. The * means to get zero or more. The ? after the * prevents the match from going to the end of the string. This is what is assigned to the “output” capture (variable) if the pattern is matched.

(?<output>".*?(?<!")("")*?"(?=,|$))
This is the pattern that is matched if the first fails. Note that the whole pattern, including the surrounding double quotes, is in the “output” capture and thus will be output by the replace statement.
  (?<output>".*?(?<!")("")*?"(?=,|$))
Match every character. It will stop if it finds a double quote followed by the comma or end of string.
  (?<output>".*?(?<!")("")*?"(?=,|$))
This handles the case where there are two double quotes before a comma. I want to leave them alone. If there are an even number of double quotes before the comma then continue looking for the end of the column.
  (?<output>".*?(?<!")("")*?"(?=,|$))
Double quotes cannot precede the next pattern. What I am trying to avoid is interpreting exactly two (or any even number of) double quotes preceding a comma being interpreted as the end of the match. If there are two quotes, then treat the comma as part of the current column. If there is one quote, then treat the comma as the break between two columns.
  (?<output>".*?(?<!")("")*?"(?=,|$))
This represents a sequence of two double quotes. Basically, if there are an even number of double quotes before a comma then treat that comma as part of the current column and continue looking for the next sequence of double quotes followed by comma to end the column.

Let me know if I can improve my regular expression. And definitely post any questions.

Posted in ETL, PowerShell | Leave a comment

Shaping a comma delimited file with PowerShell

Here is a use case for what I am thinking of. You have comma delimited file that you need to load into SQL Server. It is a very large file that you need to join to an equally large table in a database. For example at one place I worked we had a set of account balances that we needed to join to a customer table that had 15 million rows in it. We used a lookup, but it took about 20 minutes to load the customer table into cache. My thought at the time was whether we could use two data sources, both sorted on the customer id and use a join. Could that have sped up the process?

There would be a problem if you are using a comma delimited file. Sorting the file would necessarily be done in the SSIS process. But what if the comma delimited file had already been sorted before loading in the data source. You could simply set the IsSorted flag and you could use the Merge Join.

But to avoid putting a load on the SSIS Server, why not off load that process on a less busy server? Sort the file then transfer the file to the SSIS Server (or read the file from a shared drive). But if that server is only going to sort a file why install SSIS on that server. Why not use a lighter tool?

Import-Csv -LiteralPath SampleData.csv|Sort-Object Country|Export-Csv SampleDataSorted.csv
Posted in ETL, PowerShell | Leave a comment

Using PowerShell to shape comma delimited file

PowerShell has power to handle many ETL tasks (PowerShell as an ETL tool). In this post I illustrate how you can use PowerShell to shape a comma delimited file. That is, how to remove columns, filter rows and sort the file output.

Let’s start with a simple file (SampleData.csv):

ID,Name,State,Country
1,Russ,Sandys,Bermuda
2,Katie,Texas,USA
3,Gail,Sandys,Bermuda
4,Melanie,Oregon,USA
5,Stephanie,Arizona,USA

I want to reduce my set to only people in the USA. I want to remove the Country column and I want to sort by State. This is my intended output (SampleDataSorted.csv).

ID,Name,State
5,Stephanie,Arizona
4,Melanie,Oregon
2,Katie,Texas

This would be very easy to do using SSIS. I would need a Flat File Source to read my csv file. I would need a Conditional Split transformation to remove the rows where the Country is not USA. And I would require a Sort transformation. Finally, I would use a Flat File Destination to output only the three columns.

I can do this with PowerShell as well (for reasons why PowerShell as an ETL tool).

So here is my ETL package:

cd c:\MyFolder
Import-Csv -Path SampleData.csv|Where {$_.Country -eq "USA"} `
  | SELECT ID, Name, State `
  |Sort-Object State `
  |Export-Csv SampleDataSorted.csv ; 
SSIS Equivalent Task PowerShell
Flat File Source Read source Import-Csv -Path SampleData.csv
Conditional Split Filter Where {$_.Country -eq “USA”}
Flat File Destination Reduce the columns SELECT ID, Name, State
Sort Transformation Sort Sort-Object State
Flat File Destination Output CSV Export-Csv SampleDataSorted.csv

But it has a problem. This is the output.

#TYPE System.Management.Automation.PSCustomObject
"ID","Name","State"
"5","Stephanie","Arizona"
"4","Melanie","Oregon"
"2","Katie","Texas"

It would be a major pain to handle the #TYPE header. How do I get rid of this? And how do I remove the double quotes? I type this to learn more about the Export-CSV command:

help Export-CSV

That gives a little information, but the response suggests that I try this:

Get-help Export-Csv -Online

This brings me to a web page with all kinds of information. I learn that the command has a “-NoTypeInformation” that will remove the #TYPE line.

Unfortunately, you can’t remove quotes from the Export-CSV output. This is a minor problem, I will examine solutions for this problem in future posts.

So the final command and its output is:

cd c:\MyFolder
Import-Csv -Path SampleData.csv|Where {$_.Country -eq "USA"} `
  | SELECT ID, Name, State `
  |Sort-Object State `
  |Export-Csv SampleDataSorted.csv `
  -NoTypeInformation; 
"ID","Name","State"
"5","Stephanie","Arizona"
"4","Melanie","Oregon"
"2","Katie","Texas"
Posted in ETL, PowerShell | Leave a comment