The data in the DWH or BAW
(Business Analytics Warehouse) should be correct and accurate is a rule and
basis of acceptability of any reporting solution. The data inaccuracy and non-correct data
always affect the credibility of the reporting solution. I believe this is one
of most influential ingredient for the recipe of the failure of BI system.
On other hand it is reality,
the data in the DWH or BAW could be incorrect or inaccurate due to variety of
reasons. There is a need to accept this reality and chalk out the plan for
corrective measures for incorrect data and put preventive measures to avoid
future inaccuracies. This blog is an attempt to discuss the possible approaches
for data corrections.
Data
Inaccuracy Reasons
As I mentioned earlier data in
the DWH could be incorrect or inaccurate due to a variety of reasons, mainly
- Incorrect requirement or missed, leading to incorrect ETL.
- Incorrect interpretation of requirements leading to incorrect ETL.
- Uncaught coding defects.
- Incorrect data from source.
The reasons mentioned above
would require revisiting the ETL code and respect to the incorrect requirements
and uncaught defects.
After detection of the
inaccurate or incorrect data in DWH, the next step is to analyze the data and
metadata and find out the root cause of inaccuracy, once done the next step is
to correct the data already loaded in data warehouse.
In my view there are three basic
data correction options: Cut Over/Data Correction ETL Patch, Full Load or
Partial Full Load.
Data Correction ETL Patch
Creating ETL Patch for historical
data corrections is the recommended approach. It is one of patch which would
correct the historical data.
The main steps involved are as
follows
- Analyze the ETL and identify the changes in it.
- Update the design.
- Correct the code.
- Test the code.
- Create a patch to update the historical data (data already in DWH) to correct it.
- Test the patch.
- Run the patch.
- Check the report for correction.
- If the reports are correct, then implement the corrected ETL.
This is the most widely used and
wise option to correct data inaccuracies. A through impact analysis routine
would help to minimize the unforeseen adverse effect on overall report system.
I have came across many cut over procedures at customer place where the cut
over solution has resolved the issues on area and created number of issues on
another areas of reporting spectrum.
Full Load
In my view this is no go option
in many situations, but full load would be a recommended approach under
following circumstances
- Data Corrections / Rebuild of confirmed dimension
- The configuration/design efforts for data corrections are more than time taken to do full load
- The full load if data corrections are impossible to work out e.g. deletions
- The full load against small data sources
- When there is no snapshots and aggregate dependency
I have been on project where a
complete full load took some around 7 to 10 days to finish, it is oblivious
that is a most time consuming and expensive data correction option.
Partial Full Load
The partial full load would be a
recommended approach under following circumstances, it is a sophisticated
approach than the full load, and it is more selective full load for required
transactional facts
- Data Corrections / Rebuild of transactional fact only, not for confirmed dimension
- The configuration/design efforts for data corrections are more than time taken to do partial full load
- The partial full load if data corrections are impossible to work out e.g. deletions
- The full load against small data sources
- When there is no snapshots and aggregate dependency
This approach is a right
alternative to avoid full load for all business areas, as I said earlier it is
more like a selective full load. The reporting solution data load procedures
selectivity and flexibility to go with selective data load is a key for success
of this approach.
Conclusion
In a life cycle of data warehouse
of 5 years, we have to go with all three options; the only choice is to which
option we should go for the moment of time.
I would recommend
- Data Correction ETL Patch is ideal solution for transactional as well as confirmed dimension data corrections
- Complex data corrections for confirmed dimension and transaction facts then the full load would be a right option
- Full load is impossible, partial full load for transactional fact is a viable option
I hope the information in the
blog is useful.
No comments:
Post a Comment