Cost of using SSIS Variables

When I think of a variable in programming, I thing of a construct that is not much more costly to use than the equivalent constant.  Thus, if I had the following loop in C#, I could replace the variable “splitString” with a constant and the performance impact would be minimal.

int splitLength = 3;
string rowVal = "This is the string to split";
int lenRow = rowVal.Length;
System.Text.StringBuilder sb = new System.Text.StringBuilder();
for (int i = 0; i < lenRow; i += splitLength)
  {
    sb.Append("-" 
     + rowVal.Substring(i, Math.Min(splitLength, lenRow - i)));
  }

I think that we make the same assumption about SSIS variables.  We build SSIS Script Components and in the ProcessInput method, we unthinkingly use the SSIS variable. Thus, many of us would be surprised to learn that calling an SSIS variable is a costly operation.  When a Script Component is called to process millions of rows, that cost is unacceptable.

The following code snippets copy a string column, inserting a hyphen at regular intervals.  That interval is stored in an SSIS variable.  I ran this against 16 million rows.  I created two versions of this code: one that directly references the SSIS variable in the ProcessInput method.  The second reads the SSIS variable into a standard instance variable in the PreExecute method.  In the ProcessInput method, I reference the instance variable.

Direct SSIS Variable Reference

public override void Input0_ProcessInputRow(Input0Buffer Row)
{
    System.Text.StringBuilder sb = new System.Text.StringBuilder();
    string rowVal = Row.Column;
    int lenRow = rowVal.Length;

    for (int i = 0; i < lenRow; i += this.Variables.SplitLength)
    {
        sb.Append("-" 
          + rowVal.Substring(I
           , Math.Min(this.Variables.SplitLength, lenRow - i)));
    }
    Row.ColumnOut = sb.ToString();
}

Component Using Instance Variable

// Instance variable
int splitLength = 2;

public override void PreExecute()
{
    base.PreExecute();
    splitLength = this.Variables.SplitLength;
}
public override void Input0_ProcessInputRow(Input0Buffer Row)
{
    System.Text.StringBuilder sb = new System.Text.StringBuilder();
    string rowVal = Row.Column;
    int lenRow = rowVal.Length;

    for (int i = 0; i < lenRow; i += splitLength)
    {
        sb.Append("-" 
          + rowVal.Substring(I
           , Math.Min(splitLength, lenRow - i)));
    }
    Row.ColumnOut = sb.ToString();
}

When I ran both of these, the difference in performance was striking. The package using the direct SSIS variable reference as in the first code sample takes about 9 minutes to process 16 million rows. The same package using an instance variable as in the second code sample runs in less than 10 seconds.

Here is the problem, when you use the SSIS variable, you are actually calling a property method. This method is not trivial. It needs to read the SSIS variable value from the SSIS variables collection. If you do this 16 million times, this is costly.

Code for the SplitLength Variable Property

public Int32 SplitLength
{
    get
    {
        return (Int32)(ParentComponent.ReadOnlyVariables
          ["SplitLength"].GetValueWithContext
          (ScriptComponent.EvaluatorContext));
    }
}

My recommendation is that if you need to read an SSIS variable in your Script Component, read that variable in the PreExecute method into an instance variable and reference that instance variable in the ProcessInput method.  That is unless you need to slow your ETL process down.