Menu Close

SQL ETL into Elasticsearch

Low latency multi-table SQL data is pulled and merged into a hierarchical Json document into Elasticsearch

There is a strong need for SQL style database to off load text searches onto a search engine like Elasticsearch. With non-trivial data models there are several technical hurdles that need to be resolved. Elasticsearch requires all search parameters to reside in the same index (document store). This typically means that data from several SQL tables need to be merged into a hierarchical Json document. The data model is typically a header table with multiple detail tables. The Json schema implements the details records as an array in a node. This allows text searches across the document. while minimizing the number key words (Json element unique names). Attempts at eliminating arrays by “flattening” or pivoting either vertically or horizontally the details records is problematic as Elasticsearch has 1,000 key name limit. The next technical hurdle¬† is when detail records are modified. This forces a design choice of either getting the entire json document from the source database or merging just the new/changed data with an existing Elasticsearch document. In trivial data models getting the entire document is viable but in larger, complex data models this would place an additional load on the OLTP database. That is, anytime any data element is changed the entire document must be regenerated. Since the design goal is offload workload from the source database onto Elasticsearch, adding load back to source database is counter productive. But updating partial documents and merging arrays is complex in Elasticsearch. Intelligent Integration solves all this complexity but pulling only changed data with a minimal load on the source server and uses the most optimal Elasticsearch syntax to merge data on the Elasticsearch server. Intelligent Integration implements this codelessly using it’s metadata data dictionary. In addition, Intelligent Integration recovers automatically if data is out sync from a system outage and merges the json document correctly.

Here is a real world use case at

Data architects want to move text searches off SQL Server onto ElasticSearch. Columns from 8 tables must be merged into a single hierarchical json document for a single index on Elasticsearch.
Note: Change Data Capture (CDC) has been implemented to log OLTP data changes.

Intelligent Integration is able to pull the CDC data and map columns into a partial Json document. The partial Json is then merged into a single document on ElasticSearch. By pulling only data that changed in the CDC tables, versus retrieving all the data needed for the full Json document, we are reducing the load on the source server. CDC is used for incremental loads and the base tables are queried if a full load is required.

For additional information