Your answer to the question posed in the title of this post should be, “Of course!” If that wasn’t your answer, then read on to see three ways your ETL can make your reporting faster.
Gathering Database Statistics
This can be the biggest bang for the buck. If the database optimizer doesn't have current statistics, then it won’t know about the data the ETL just loaded. Most databases do not perform automatic gathering of statistics. Even fewer will do it in a timely manner. In Oracle, the default is to gather statistics once a week. For most of my clients that load data daily, or even hourly, once a week is not timely enough! I worked with one group of ETL developers that thought Oracle was automatically gathering statistics. When I showed them the dates in the LAST_ANALYZED column found in the table ALL_TAB_STATISTICS, they quickly realized that their statistics were between 6 and 184 days out of date. They subsequently confirmed with their DBA that the default statistics gathering was being run once a week on Saturday. The ETL knows best when it is time to gather statistics. Do not rely on your DBA to gather timely statistics! Your DBA will not know when the ETL has finished loading the data; instead, the DBA will schedule statistics gathering at a fixed time – at 6 am for example. Suppose your ETL finishes at 3 am and reporting begins immediately afterwards. The statistics are out of date, and reporting could be quite slow if it is trying to access the most recent data due to bad query optimizer decisions. What happens if the reporting starts at 6:05 am? There are two problems with that scenario. First, the statistics gathering process may not be finished - so once again the statistics are out of date. Secondly, reporting performance will likely suffer since the statistics gathering process is running at the same time. There are a few approaches to gathering statistics, but there are two that I generally use:
- Gather statistics immediately after a table has been loaded.
- Gather statistics after the entire ETL process is complete.
The advantage of the first approach is that the table statistics are perfectly up to date within seconds or minutes of the load. Should other steps in the ETL use the just-loaded data, they will be working off of perfect statistics. The disadvantage is that pausing to gather statistics delays the start for the remaining ETL steps. A work around is to gather statistics using a fire-and-forget call. The second approach has the advantage that the coding for all statistics gathering is encapsulated in one place instead of being spread among many places in many jobs. Rather than retrofit statistics gathering into existing ETL jobs, most projects will choose to write all statistics gathering in one job that runs after all the other ETL is finished. This isn't the most effective approach, but it is far better than waiting for the DBA to do it. The first approach is the best one to use - especially if you are just starting ETL development. If you are new to gathering statistics, then be sure to coordinate with your DBA. In the past couple releases, Oracle has introduced some amazing changes to how they approach gathering statistics. Many of the old rules that were “best practices” in Oracle 9i are now very much out of date.
Loading Data in Sorted Order
Oracle and SQL Server both have a type of table where the data is stored in a sorted order. In Oracle this is an Index Organized Table (IOT) and in SQL Server it is a Clustered Index. As data is loaded into the table, the database will place the rows into specific blocks/pages so that the data is kept ordered. There are two benefits to this. Queries that sort the data in the order that the table is sorted do not have to perform the extra step of sorting the data. Secondly, if the table is organized by date - and a query wants all the data for a specific date - then all the data is kept together and will require less I/O (disk reads/writes). Where ETL can help reporting – and help itself to some extent – is to sort the data before loading it into the IOT or Clustered Index target table. When the data is loaded, the database has much less work to do to place the data into the correct block/page and the likelihood of block/page splits is greatly reduced. If a block/page has to be split, then they become less “full”. Full blocks/pages require less I/O to read while sparsely populated blocks/pages require more I/O. In general, the lower the I/O, the faster a report executes. If the data is unsorted, then the database could end up thrashing around to load/unload a block/page that has to be written to multiple times by different rows.
Loading data using Bulk Loader
In Oracle, if a table has the compress property turned on (in standard compression), the data will only be compressed if the Bulk Loader (or any Direct Path load) is used. Standard insert operations will not compress the data. The ETL execution time will take a hit due to the extra overhead of waiting for Oracle to compress the data. However, there is a potential gain if you enable the NOLOGGING table property. The net time loss that I’ve seen is about 5%. On the reporting side, Oracle is reporting that querying from compressed data is about 20% faster. I’ll take a 5% hit to my ETL any day to get a 20% gain in reporting. Your mileage may vary. Combine this tip with sorting the data, and you can significantly improve data compression ratios. Keep in mind that Bulk Loader should be used in a Dataflow that only performs inserts. It cannot be used where updates or deletes are generated since those operations are changed to inserts when Bulk Loader is enabled. In this case, consider compressing the table after the data has been loaded. However, post-load compression can be quite time consuming - especially on very large tables. Consider performing a re-compress on a periodic basis or when the compression ratio drops below a certain threshold. So, have you tried any of the above techniques? Did you use them in response to poorly performing report queries? If so, did you notice a performance increase afterwards? If you would like further consultation on these advanced ETL techniques, or you just want a Health Check regarding your current ETL implementation, send us an email. We would love to assist!