ETL SQL into Elasticsearch

Low latency, complex data model SQL data synchronization with Elasticsearch

Elasticsearch (ES) is a NoSQL databases optimized for the fuzzy logic of text search. SQL databases tend to struggle with text search.  They are CPU and IO intensive and not as flexibility with the search logic. So there is a strong need for a SQL database to off load text searches onto a search engine. The original use case for Elasticsearch was sequential inserts from logs. This simple use case can be called pump and dump of ETL data. Logstash is well suited for that particular task. A SQL database’s data model and data changes are more complex and require more technical care to keep data synchronized to Elasticsearch.

The pitfalls of synchronizing SQL data to Elasticsearch

Elasticsearch requires all search parameters to reside in the same index (document store). This typically means that data from several SQL tables needs to be merged into a single hierarchical Json document. Let’s examine the simplistic case of a header table with one or more detail tables. You can’t just flatten the details records or create dynamic smart keys in ES. For example, you can’t create field name that contains a number like order number or order detail number. ES has a performance limit to the number of field names that can be used and each dynamic number field counts to that limit. To remedy this, the details records should there be defined as an Json array node with a consistent structure of primitives or a Json object. This allows text searches across the document while minimizing the number key words (Json element unique names).

The easiest way to manage updates in ES is to simply replace the entire document. But with a SQL database as the source that may be easier said then done if your having to retrieve a complex data model. Any change to any value within the document means you need to retrieve the entire document. So depending on what feeds the ETL process; CDC, replication or staging tables etc. that would need account for retrieving the whole document.

In database update/merging of data in Elasticsearch

ES supports updates to root elements fairly easily if you have provide the entire root node’s data. Elasticsearch has some functionality updating partial documents and merging arrays but it is quite complex and limited in scope.  So to pull this all together requires coding a significantly complex data persistence layer from your ETL platform.

Semiotic has already resolves these issues

Semiotic resolves this complexity with a metadata driven rules engine that has all the Elasticsearch specific rules already defined. Semiotic’s data dictionary specifies all this functionality without manual coding. The data dictionary has ES specific metadata tags that outlines the ETL. Semiotic also has write caching with primary key queuing that allows merging of data in memory from separate ETL data streams.

For additional information