Monday, 30 December 2013

Oracle Business Intelligence Application (OBIA) – Data Corrections Approaches


The data in the DWH or BAW (Business Analytics Warehouse) should be correct and accurate is a rule and basis of acceptability of any reporting solution.  The data inaccuracy and non-correct data always affect the credibility of the reporting solution. I believe this is one of most influential ingredient for the recipe of the failure of BI system.

On other hand it is reality, the data in the DWH or BAW could be incorrect or inaccurate due to variety of reasons. There is a need to accept this reality and chalk out the plan for corrective measures for incorrect data and put preventive measures to avoid future inaccuracies. This blog is an attempt to discuss the possible approaches for data corrections.

 Data Inaccuracy Reasons 

As I mentioned earlier data in the DWH could be incorrect or inaccurate due to a variety of reasons, mainly

  • Incorrect requirement or missed, leading to incorrect ETL.
  • Incorrect interpretation of requirements leading to incorrect ETL.
  • Uncaught coding defects.
  • Incorrect data from source.
The reasons mentioned above would require revisiting the ETL code and respect to the incorrect requirements and uncaught defects.

After detection of the inaccurate or incorrect data in DWH, the next step is to analyze the data and metadata and find out the root cause of inaccuracy, once done the next step is to correct the data already loaded in data warehouse.

In my view there are three basic data correction options: Cut Over/Data Correction ETL Patch, Full Load or Partial Full Load.

Data Correction ETL Patch

Creating ETL Patch for historical data corrections is the recommended approach. It is one of patch which would correct the historical data.

The main steps involved are as follows

  • Analyze the ETL and identify the changes in it.
  • Update the design.
  • Correct the code.
  • Test the code.
  • Create a patch to update the historical data (data already in DWH) to correct it.
  • Test the patch.
  • Run the patch.
  • Check the report for correction.
  • If the reports are correct, then implement the corrected ETL.

This is the most widely used and wise option to correct data inaccuracies. A through impact analysis routine would help to minimize the unforeseen adverse effect on overall report system. I have came across many cut over procedures at customer place where the cut over solution has resolved the issues on area and created number of issues on another areas of reporting spectrum.

Full Load

In my view this is no go option in many situations, but full load would be a recommended approach under following circumstances

  • Data Corrections / Rebuild of confirmed dimension
  • The configuration/design efforts for data corrections are more than time taken to do full load
  • The full load if data corrections are impossible to work out e.g. deletions
  • The full load against small data sources
  • When there is no snapshots and aggregate dependency

I have been on project where a complete full load took some around 7 to 10 days to finish, it is oblivious that is a most time consuming and expensive data correction option.

Partial Full Load

The partial full load would be a recommended approach under following circumstances, it is a sophisticated approach than the full load, and it is more selective full load for required transactional facts

  • Data Corrections / Rebuild of  transactional fact only, not for confirmed dimension
  • The configuration/design efforts for data corrections are more than time taken to do partial full load
  • The partial full load if data corrections are impossible to work out e.g. deletions
  • The full load against small data sources
  • When there is no snapshots and aggregate dependency

This approach is a right alternative to avoid full load for all business areas, as I said earlier it is more like a selective full load. The reporting solution data load procedures selectivity and flexibility to go with selective data load is a key for success of this approach.
 
Conclusion

In a life cycle of data warehouse of 5 years, we have to go with all three options; the only choice is to which option we should go for the moment of time. 

I would recommend

  • Data Correction ETL Patch is ideal solution for transactional as well as confirmed dimension data corrections

  • Complex data corrections for confirmed dimension and transaction facts then the full load would be a right option

  • Full load is impossible, partial full load for transactional fact is a viable option

I hope the information in the blog is useful.

No comments:

Post a Comment