What to include in source view

In a previous post (ETL Source: full query or view) I suggest that you create a view (or function) to use as the source for your data load. In this post, I ask what should you include in your source view? I argue that in this case more really is more.

There is a lot to be said for including in your view just the columns and rows needed for the load. For example, when I am loading my fact table, I will need to find the data mart dimension keys. One of the design principles the Kimball group advocates is using surrogate keys for your dimension tables and relating your fact tables to the dimension tables using those surrogate keys (http://www.kimballgroup.com/1998/05/surrogate-keys/). The source table will have business keys. I just need to join the business keys to the dimension tables to get the dimension surrogate keys.

Here’s the rub. I really don’t need the business key, right? So why show that column? I really don’t intend to insert rows that don’t have a corresponding dimension row, so why not use an inner join to exclude rows without matching dimension rows? I may also be filtering out rows from the staging table. Why not filter those rows out in the view? And if I don’t load those filter columns into the fact table, I certainly want to exclude those columns from the view output.

But then certain rows are not inserted into the fact table. What happened? If I have a lean view, I have deconstruct the view to try to identify the points where these rows were excluded. It could be an element in the filter that inexplicably filters out the row. Or it could be that the business key is not in the dimension table.

If I write the view using a left outer join and include the business key in the view output, I can find the rows where the surrogate key is missing. If I don’t put the filter in the view, I can run the query without the filter. It is possible there is an unexpected business key value. Or the filter excludes the row. I can find the row that I expect to find and determine what causes it to be excluded.

When I use the query with the view, I can apply appropriate filters to get the results I want. I can apply not null filters on the view to exclude rows not matching dimension rows and I can filter the output. And certainly I won’t use the columns that are not necessary.

Leave a Reply