Auto-etl: a Generic Methodology for Transforming Entity-attribute-value (Eav) Data Model Into Flat Tables Using Form Metadata to Optimize Report Generation. A Case Study of Kenya EMR

Ojwang’, Antony

View/Open

Full Text (2.095Mb)

Date

2018

Author

Ojwang’, Antony

Type

Thesis

Language

Metadata

Show full item record

Abstract

The use of Entity Attribute Value (EAV) data model is strongly favored by the existing research in both medical and other domains because of its flexibility and extensible nature that makes it suitable for storing highly sparse and heterogeneous medical data/observations. The model is however slow in data extraction and reporting and as a result affects timely decisions for patient care. Research has shown that pivoting of the EAV data through ETL (Extract, Transform and Load) process is the ideal solution to the data extraction and reporting problems in the EAV based systems. The ETL process has largely been a manual process which was costly, slow, tedious, and prone to human errors. At present, a number of studies have documented approaches to automate the different parts of the ETL process which consist of generation of ETL specifications, rule composition, and SQL statement generation. With automation, the process enjoys reduced human involvement and consequently improve the accuracy and reduce the overall time and cost in the design, development, and implementation of the process. However, there has not been full automation and experts are still involved, either fully or partially, in supporting the generation of rules for the process hence the disadvantages of a manual process. This study introduces Auto-ETL methodology that uses form metadata to fully automate the ETL rule generation and SQL generation phases and partly the less demanding ETL specifications phase of the process. To evaluate the new methodology, we conducted a pilot experiment to compare efficiency and accuracy of the Auto-ETL methodology to a purely manual process. We also conducted analysis of triage variables (weight, height, temperature, respiratory rate) data extracted from both the EAV model and the generated tables from two facility EMR databases. Results of the pilot experiment shows that Auto-ETL is faster and more accurate in generating queries. The generated datasets also showed great simplification in SQL query complexity and performance compared to those for EAV model. For analysis, descriptive data characteristics from the source and generated tables was presented by use of median and range (minimum, maximum) for the continuous variables. Lin’s concordance correlation with 95% confidence interval (CI) was used to measure the agreement between the source dataset and generated tables for the study variables and an agreement will be considered if the concordance correlation is closer to 1. The R squared was used to report the relationship between the variables form the source dataset and generated tables and considered to have a strong relationship if closer to 1.Statistically significant association will be considered when p < 0.05. For descriptive characteristics, results showed same median across the two data sources and a p-value of <0.001 for R squared and CI value of 0.997 for concordance relationship. These are indications of a strong concordance between source data and ETL data and consequently informs the conclusion that form metadata can be used to support generation of SQL statements for ETL process with zero data loss.