I recently encountered a plumbing issue at my house that required the water throughout the house to be shut off for a 24 hour period. I wasn’t completely unprepared but I could have been better prepared. Toilets don’t work, dishes can’t be cleaned, many food items can’t be cooked and worst of all you don’t have any drinking water. When I started thinking about what equipment and supplies were needed in case the water was off for more than just 24 hours I branched off into other areas where prior planning prevents problems. My thinking eventually landed on what I do every day at work: populating data warehouses.
The questions I asked were:
- If the database that hosts the data warehouse were to be down for more than a daily cycle will the ETL handle that gracefully?
- If the source database is down for more than the daily cycle how well will the ETL handle that?
- If the ETL server is down will a range of data fail to be pulled into the data warehouse?
All three questions are primarily about the nuts and bolts of how the ETL is designed. If the ETL is designed to get today’s data, what happens if the ETL never successfully runs today?
What is a “proper” design? The answer is that, “It depends.” Outlined below are three strategies for making your data-pull bullet-proof.
1. Source dependent date range
Does your source system maintain a LAST_UPDATED column in every table? Even more importantly you must ask, “Is the LAST_UPDATED column reliable?” If data in the source system can be manually updated without a change to the LAST_UPDATED column then you must assume that the column is not reliable.
If the LAST_UPDATED column is reliable then you should absolutely make use of it. However, the ETL must keep track of the last range of the LAST_UPDATED values that were used in the prior ETL run. Somewhere on the ETL side of things it has to know that the last pull of data from the source system was 01/05/2015 04:01:00.00000 through 01/06/2015 04:01:02.000000. Knowing that range, the next day the ETL will look for data that changed after 01/06/2015 04:01:02.000000.
- Usually very reliable
- Pulls only the data that changed
- Low network overhead
- If any part of your system is down the date range is still used. The ETL could easily pull a week’s worth of data with this method.
- Source systems rarely index the LAST_UPDATED column, requiring a full table scan which could be very costly on large tables
- Maintenance of the ranges used for each source table must be carefully managed. Any sloppiness in the ranges can result in lost data.
- Was the change due to an insert or an update? If you don’t know the answer to that then a Table Comparison transform or a WHERE NOT EXISTS must be used. If the source has both LAST_UPDATE and LAST_INSERTED columns then the Table Comparison can be avoided.
- All tables that are related have to use the same date range. Otherwise you could pull a child source row without also pulling the parent source row.
- Rows deleted in the source system aren’t in the feed. An alternate method of accounting for those rows will have to be created. Since the feed is a subset of the source the “Detect deleted rows” feature in the Table Comparison transform cannot be used.
2. Target Dependent Comparison
When no LAST_UPDATED column is available in the source then the ETL has to do the heavy lifting. This method requires the ETL to pull the entire source data set and use a Table Comparison transform to compare it against the target data set.
- Simple and quick to design
- Very reliable. You’ll never miss any data
- Rows deleted in the source system will be handled by the Table Comparison transform
- Very high overhead on the source database, the network, the ETL server and the target database
- Very costly in terms of execution time. This is absolutely the slowest method you could use
- Handling deleted source rows in the Table Comparison transform is costly
3. Source Based Change Data Capture (CDC)
This solution requires a good amount of upfront design, hardware and configuration. I have used Oracle CDC for this and if the Oracle side is set up just right it’s fairly easy to write the ETL using native components.
- All source side changes are captured
- Very reliable
- Works well running multiple times a day
- Hands down this is the fastest performing strategy
- Requires additional hardware
- Requires additional DBA knowledge and administration
- Requires additional developer knowledge and understanding of the source based CDC
- Adds an additional point of failure into the design
Rarely does a source system work perfectly with an ETL system to populate a Data Warehouse. A hybrid of the first two strategies is often used. For small tables, say less than one million rows, there is rarely any reason to get more sophisticated than the Target Dependent Strategy. But huge source tables, like a record of financial transactions, present a problem that requires a much higher level of sophistication. When the system can never lose a transaction, and attempting to compare the entire target to the entire source would require weeks, then the CDC strategy is the way to go.
The more thought you put into creating a bullet-proof design and ensuring that you have implemented plenty of defensive programming the less time you’ll spend fixing data inconsistencies.