ETL MongoDB and SQL

ETL MongoDB to Oracle, SQL Server, MySQL & Postgres

Note: This article is a work in progress.

MongoDB is a Json based NoSQL database that stores your hierarchical data in a collection (equivalent to a table in a SQL database). MongoDB is schema-less because it’s Json schema can be modified without executing explicit commands. SQL databases schema is managed using DDL commands (Data Definition Language). MongoDB’s schema-less design is more agile, flexible and generally accelerates application development. But it can complicate data integration if your using legacy ETL tools. That is because legacy ETL tools “bind” their code to the database schema thereby negating the advantage of being schema-less, at least in the context of ETL. Ideally the ETL tool should share the agile, rapid application development that MongoDB provides. To do that requires 3 features of the ETL tool. First it must also be Json based. The second requirement is it must be metadata managed. Third, it needs to have a data rules engine. The term rules engine means it implements the data architecture and data model using the metadata and configuration. Compared tha to manual ETL coding that explicitly defines data mappings and transformation.

Semiotic Semantic ETL and Rules Based Processing

MongoDB requires all search parameters to reside in the same index (document store). This typically means that data from several SQL tables needs to be merged into a single hierarchical Json document. Let’s examine the relatively simplistic case of a header table with one or more detail tables. You can’t just flatten the details records or create dynamic smart keys in MongoDB. For example, you can’t create field name that contains a number like order number or order detail number. This allows text searches across the document while minimizing the number key words (Json element unique names).

The easiest way to manage updates in MongoDB is to simply replace the entire document. But with a SQL database as the source that may be easier said then done if your having to retrieve a complex data model. Any change to any value within the document means you need to retrieve the entire document. So depending on what feeds the ETL process; CDC, replication or staging tables etc. that would need account for retrieving the whole document.

In database update/merging of data in MongoDB

MongoDB supports updates to root elements fairly easily if you have provide the entire root node’s data. MongoDB has some functionality updating partial documents and merging arrays but it is quite complex and limited in scope.  So to pull this all together requires coding a significantly complex data persistence layer from your ETL platform.

Semiotic has already resolves these issues

Semiotic resolves this complexity with a metadata driven rules engine that has all the MongoDB specific rules already defined. Semiotic’s data dictionary specifies all this functionality without manual coding. The data dictionary has MongoDB specific metadata tags that outlines the ETL. Semiotic also has write caching with primary key queuing that allows merging of data in memory from separate ETL data streams.

For additional information