Implementing low latency SQL data synch into Elasticsearch
Elasticsearch is a powerful text search engine that provides many features not found in typical SQL databases. Text search tends to be CPU and IO intensive as implemented in SQL databases. So there is a strong need for a SQL database to off load text searches onto a search engine like Elasticsearch. The original use case for Elasticsearch was sequential inserts from logs. The ELK stack and Logstash in particular are well suited for that task. But a SQL database’s DML (insert/update/delete) is far more complex and requires extra care merging and updated the ES index.
The pitfalls of synchronizing SQL data to Elasticsearch
Elasticsearch requires all search parameters to reside in the same index (document store). This typically means that data from several SQL tables needs to be merged into a single hierarchical Json document. Let’s examine the relatively simplistic case of a header table with one or more detail tables. You can’t just flatten the details records or create dynamic smart keys in ES. For example, you can’t create field name that contains a number like order number or order detail number. ES has a performance limit to the number of field names that can be used and each dynamic number field counts to that limit. To remedy this, the details records should there be defined as an Json array node with a consistent structure of primitives or a Json object. This allows text searches across the document while minimizing the number key words (Json element unique names).
The easiest way to manage updates in ES is to simply replace the entire document. But with a SQL database as the source that may be easier said then done if your having to retrieve a complex data model. Any change to any value within the document means you need to retrieve the entire document. So depending on what feeds the ETL process; CDC, replication or staging tables etc. that would need account for retrieving the whole document.
In database update/merging of data in Elasticsearch
ES supports updates to root elements fairly easily if you have provide the entire root node’s data. Elasticsearch has some functionality updating partial documents and merging arrays but it is quite complex and limited in scope. So to pull this all together requires coding a significantly complex data persistence layer from your ETL platform.
METL has already resolves these issues
METL resolves this complexity with a metadata driven rules engine that has all the Elasticsearch specific rules already defined. METL’s data dictionary specifies all this functionality without manual coding. The data dictionary has ES specific metadata tags that outlines the ETL. METL also has write caching with primary key queuing that allows merging of data in memory from separate ETL data streams.