With 25 years plus experience as a Data Architect in both Fortune 500 and start up business I’ve seen a lot of pain points of complex data integration. Let’s list out typical issues facing data integration;
- Poorly designed data models (This deserves 5 bullet points)
- Large and/or complex schema
- Poorly structured flat files
- Large SQL code base
- High error rates
- No documentation
- High Cost of ETL Programmers
- Long Change Time for Implementations
The typical problem in fixing these issues is the broad scope needed to redesign the entire platform. My experience has shown that these can be broken down and attacked efficiently. Just like the abbreviation of ETL for Extract, Transform and Load we need to similarly manage all 3 categories. In addition we need to account for the 3 major uses cases of data integration; database to database, file import/export from a database and streaming data from application via a microservice/rest protocol.
Before diving into the details, I want to state my overall design goal in managing complex ETL. Use a “rules” methodology vs manual coding. If every problem your facing requires writing code your not working efficiently.
Let start with source of a SQL style database, let’s assume it is part of a complex application with hundreds or even thousands of tables. I prefer to have a separate process to extract data that will have minimal performance impact. That means implementing staging tables or a change data capture process. That contrasts to trying querying the exiting tables. Query the existing tables has several issues. One, the tables may not have adequate timestamps to properly identify records that have been changed. Two, larger tables need indexes for that timestamp column to query the changed data efficiently. Third, the application code or SQL code across a complex application would need to guarantee the timestamp is properly updated across every table. Fortunately setting up Change Data Capture is not too complex. Many database vendors provide this technology and even custom CDC solution can be implemented with modest effort. Another valuable point is CDC provides additional metadata columns for the ETL process and importantly, it provides consistent columns names for that metadata. This allows ETL to follow a rules methodology.