Microsoft quietly added two functions to their SSIS expression language: Token and TokenCount. I think that these functions will find their way into any serious SSIS developer’s tool belt. Here I show how you can use this to solve a very common problem, interpreting dates.
When importing data, frequently one column of data will need to be split into multiple columns. For example, the source has a person’s name, but you need to parse it into First Name and Last Name. To do that with a derived column involves using the FindString and SubString functions: FindString to find the breaks between words and SubString gets the text at that break. The code that you create is hard to read and maintain.
The function Token (technet.microsoft.com/en-us/library/hh213216.aspx), splits up a text based on a string and returns one of the tokens. The function looks something like this.
Token("Russ Loski", " ", 1)
This function call splits the string “Russ Loski” by the second argument ” ” (space) and returns the first token that it found: “Russ”. If I replace the final argument with 2, then it will return “Loski”. By the way, if I use 3, there is no third token in this string and thus Null is returned.
What if I want to get the last word? This is the function of the TokenCount function (technet.microsoft.com/en-us/library/hh213135.aspx). TokenCount returns the number of tokens when parsed by the string.
TokenCount("Russ Loski", " ")
This example returns 2, as there are two tokens in “Russ Loski” when split by space ” “.
I can put these together to get the last word in my string.
Token("Russ Loski", " ", TokenCount("Russ Loski", " "))
This returns “Loski” the last word in this sentence.
One of the problems that ETL developers encounter are dates that SSIS is unable to parse. In almost all cases the pattern is easy to see, it is a challenge to write the parsing routine. Take the following two dates that are in the Date column:
12/1/2013 3:45 13/1/2013 3:45:20
Casting both of these strings to date works on my machine (language set to US English).
(DT_DBTIMESTAMP)"12/1/2013 3:45" (DT_DBTIMESTAMP)"13/1/2013 3:45:20"
SSIS interprets the first as December 1, 2013 and the second as January 13, 2013. It appears that the first date it uses my computer’s language setting. In the second, SSIS acts like it is trying to make sense out of an invalid date, by treating the digits greater than 12 as a date and digits less than 13 as month.
This is almost certainly wrong. Unless your data source is completely unreadable, it will use a consistent date format. And if it isn’t, then the providers need to make the data consistent at the source.
I would assume therefore that the format of the date is date / month / year (space) hour : minute. How do I convert this a datetime data type? I can use the Token function to get the data from the Date column.
Token(Column, " /:", 1)
Notice that the second argument has three characters, not one. The way that this works is that the string will be split by any of these three characters.
So I can treat token 1 as day, token 2 as month, token 3 as year, token 4 as hour and token 5 as minute. I can put these together to create an ISO date that SSIS will unambiguously interpret.
(DT_DBTIMESTAMP)( TOKEN(Column,"/ :",3) + "-" + RIGHT("00" + TOKEN(Column,"/ :",2),2) + "-" + RIGHT("00" + TOKEN(Column,"/ :",1),2) + " " + RIGHT("00" + TOKEN(Column,"/ :",4),2) + ":" + RIGHT("00" + TOKEN(Column,"/ :",5),2) )
Try to write this using SubString and FindString. And be sure to have an aspirin ready, as you will need it before you are done. The Token function is so much easier to work with.
To get a summary of the new functions in SSIS 2012, check out Devon Knight’s post (www.bidn.com/blogs/DevinKnight/ssis/2749/new-ssis-2012-expressions