Building the stage table is potentially the most challenging part of the overall task of building a data warehouse analysis area.
Most of the effort required is in the design phase, in terms of knowing what data needs to come into the fact table that will ultimately be built.
This section assumes that the decision as to what to include has been made.
Multiple Data Sources
A Stage table typically contains the change data for a detail fact table. As such, it normally maps to a specific function within the business and in many cases relates to one main OLTP table. In many cases, however, it may be necessary to combine information from a number of tables. One of the decisions required is whether or not it is practical or even possible to join the data from the different source tables.
We could, however, also include two additional source tables invoice_header and invoice_line which contain specific information relating to what was on the invoice. We may want our fact table to contain information from these tables as well. Although these two tables may contain the order_number and hence potentially allow a join with the order tables we may choose not to perform such a join for performance reasons. In this case, we have three obvious choices in terms of how we ultimately update our fact table.
The choices are:
- Join all four tables using one large join in our staging table.
- Update the staging table in two passes; one pass updates the order information, and the other pass updates the invoice information.
- Generate two Stage tables, one for order and one for invoice. Use these two staging tables to update the one sales_detail fact table.
Although all three options are viable and a normal situation in the WhereScape RED environment, options (2) and (3) will require specific coding and modifications to the generated procedures from the outset.
Given the example provided, option (2) would be the normal approach; although in some cases option (3) would be valid.
Drag and Drop
The best approach in creating a Stage table is to choose the source table that contains the most fields that we will be using and drag this table into the stage target.
We can then drag specific columns from the other source tables until we have all the source data that is required.
The process for defining the metadata is as follows:
- Double-click on the Stage Table object group in the left pane. This will result in all existing stage tables being displayed in the middle pane. This also sets the middle pane as a stage drop target.
- Browse the Data warehouse connection to display your Load tables in the right pane. This is done via the Browse > Data Warehouse menu option.
- Drag the primary load table (i.e. the one with the most columns, or the lowest data granularity) from the right pane and drop it into the middle pane. A dialog appears to create the new staging object.
- Leave the object type as 'Stage Table' and change the name to reflect what is being done.
- Once a valid name is entered, the properties for the new Stage table are displayed. Normally these would be left unchanged except perhaps for storage settings.
- Once the Properties window is closed, the columns for the new Stage table are displayed in the middle pane. This middle pane is now considered a drop target for this specific stage table. Any additional columns or tables dropped into the middle pane are considered additions to this stage table definition. Any columns that are not required can be deleted.
- Drag and drop additional columns from other source tables if appropriate.
- Drag in the dimension artificial key from each Dimension that is to be joined to the Stage/Fact table.
- You can only join a Dimension, if a business key exists amongst the stage table columns or if it is possible to derive that business key in some way from the columns or other dimensions.
If a column is being used to join information from two or more source tables, that column must only appear once in the Stage table. It is irrelevant which table is used to create the column in the new Stage table.
Once completed, your list of columns for the Stage table should look something like the list below. Notice the Source Table for each column.
The source table (src table) reflects where each column was dragged from.
In the example above, the bulk of the columns came from the load_order_line table, and the customer_code, order_date, and ship_date came from the load_order_header columns.
These two Load tables will be joined via the order_number column. This order_number column appears in both load tables but is sourced, in this example, from the load_order_line table.
Each dimension artificial key was dragged from its appropriate table. The final column 'dss_update_time' was generated by WhereScape RED and has no source.
Creating the Table
Once the Stage table has been defined in the metadata, you need to physically create the table in the database.
- Right-click the Stage table from the left pane and select Create (ReCreate) from the pop-up menu.
- The results pane displays the results of the creation. The following example shows a successful creation.
The contents of this pane are a message to the effect that the table was created followed by a copy of the actual database create a statement, and if defined the results of any index created. For the initial create no indexes will be defined.
If the table was not created then ascertain and fix the problem. A common problem is a 'Duplicate column' where a column has been accidentally added twice.
The best way to find such a column is to double-click the list heading Column Name, to sort the column names into alphabetical order.
Another double-click on the heading sorts the columns back into their create order. Column order can be changed by altering the column order value against a column's Properties.
Double-clicking the heading of a column in a list sorts the list into alphabetical order based on the column chosen.