SSIS Removing Duplicate Rows Using Conditions

Sometime your data source has duplicates. SSIS has an out of the box tool to remove these duplicates. The Sort transformation has a “Remove rows with duplicate sort values” property (https://www.mssqltips.com/sqlservertip/3036/removing-duplicates-rows-with-ssis-sort-transformation/). But it has a problem. It is very random as to the values that it removes. How do you specify which row to remove?

One scenario that one might encounter is a situation where you are bringing in data from multiple sources. If the data occurs in one source, then use that row, otherwise use data from the second source. I have built a simple SSIS package that you can download from
Sort Remove Duplicates Package. It has the following data flow:

SSIS DataFlow

SSIS Delete Duplicate DataFlow


The data sources are both Script components.

 public override void CreateNewOutputRows()
    {
        /*
          Add rows by calling the AddRow method on the member variable named "<Output Name>Buffer".
          For example, call MyOutputBuffer.AddRow() if your output was named "MyOutput".
        */
        int i = 0;
        int s = 1;
        Output0Buffer.AddRow();
        Output0Buffer.Set = s;
        Output0Buffer.ID = ++i;
        Output0Buffer.Val = 100;
        Output0Buffer.AddRow();
        Output0Buffer.Set = s;
        Output0Buffer.ID = ++i;
        Output0Buffer.Val = 100;
        Output0Buffer.AddRow();
        Output0Buffer.Set = s;
        Output0Buffer.ID = ++i;
        Output0Buffer.Val = 100;
        Output0Buffer.AddRow();
        Output0Buffer.Set = s;
        Output0Buffer.ID = ++i;
        Output0Buffer.Val = 100;
        Output0Buffer.AddRow();
        Output0Buffer.Set = s;
        Output0Buffer.ID = ++i;
        Output0Buffer.Val = 100;
        Output0Buffer.AddRow();
        Output0Buffer.Set = s;
        Output0Buffer.ID = ++i;
        Output0Buffer.Val = 100;
        Output0Buffer.AddRow();
        Output0Buffer.Set = s;
        Output0Buffer.ID = ++i;
        Output0Buffer.Val = 100;

    }

The script for the Secondary Data Set is roughly the same. I set the s variable to 2 for the Secondary Data Set. This is so that its data always follows the corresponding Primary Data Set row if it exists. I set the variable i to 4 so that the first rows (1 – 4) only have set 1 values, the following 3 rows have both set 1 and set 2 rows and the last rows are exclusively set 2. I also assign a different value to the set 2 output.

The next transformation is a Sort. The key I use is ID, Set. I want one row for each ID. I want to pick the first Set for that ID and I want to use Set to determine which is the first row for that ID.

Finally, I have a Script component to only output the first row encountered for each ID. I have to configure a few elements. The first thing that I do is to select the ID and Set columns as Input columns. That is so that I can refer to these columns in my Script component.

I also have to provide some way to exclude and include rows. To do that I use Exclusion Groups.

SSIS Script Component Exclusion Group

SSIS Script Component Exclusion Group


What you do is to set the ExclusionGroup to some number other than 0. What this will do is to allow my script to direct a particular row to that ExclusionGroup. Also note the SynchronousInputID. It should have a value that points to the only input.

The script is rather simple. You override the Input0_ProcessInputRow method:

    int lastID = -1;

    public override void Input0_ProcessInputRow(Input0Buffer Row)
    {
        if (Row.ID != lastID)
        {
            Row.DirectRowToOutput0();
            lastID = Row.ID;
        }
    }

Note that I have added a class level variable lastID. Every row is going to execute this code. If the ID column value (Row.ID) is not the same as lastID, then simply DirectRowToOutput0(). Then, don’t forget to set lastID equal to Row.ID for the next pass. What happens is that if a new ID is encountered, then the row will be sent on. Otherwise the Row disappears.

Please let me know if there is a simpler way to do this. Also, let me know how I can improve this script.

About Russ

Russ Loski is a data warehouse developer focusing on SSIS and SQL Server development. He has worked with SQL Server since 1998. He has used various technologies to move data from one source to others. He has worked for well-known companies such as the Bank of America, the Dallas Cowboys Football Club, T-Mobile and Coca Cola Enterprises. Currently he is developing a Data Warehouse for a large health care organization.
This entry was posted in ETL, SSIS, SSIS Script. Bookmark the permalink.

Leave a Reply