Data Integrity Validation Methodologies for High-Volume Healthcare ETL Pipelines: Automated Testing Strategies and Quality Assurance Frameworks

Authors

  • Mehulkumar Joshi

Keywords:

data integrity, healthcare ETL, ELT, data validation, automated testing, data quality, interoperability, dbt testing, auditability, quality assurance

Abstract

The article examines methodologies for validating data integrity in high-volume healthcare ETL/ELT pipelines where heterogeneous source systems, evolving interoperability standards, and strict compliance constraints amplify the consequences of defects. Relevance stems from the operational dependence of clinical analytics, revenue-cycle reporting, and population health workflows on accurate, traceable, and scalable transformed data. Novelty is provided by an integrated validation model that connects data-quality theory, secure-processing guidance, and modern transformation testing practices into a single quality-assurance workflow tailored to healthcare semantics. The work aims to synthesize automated testing strategies that reduce undetected schema drift, mapping errors, and business-rule violations across batch and near-real-time processing. For this purpose, the article applies analytical review, comparative synthesis, and structured mapping of controls to pipeline stages, drawing on recent peer-reviewed research and authoritative standards. The concluding section formulates implementation-ready recommendations for layered checks, evidence logging, and governance linkages. The article will benefit healthcare data engineers, analytics leaders, and compliance stakeholders responsible for reliable data delivery.

Author Biography

  • Mehulkumar Joshi

    Senior Analytics Engineer, RXNT , Philadelphia, PA

References

[1]. Abughazala, M., Ibiyo, M., Muccini, H., & Sharaf, M. (2025). Quality by prompt: LLM-powered transformation of data quality requirements into Great Expectations. In Software engineering and advanced applications: 51st Euromicro Conference, SEAA 2025, Salerno, Italy, September 10–12, 2025, proceedings, part I (pp. 130–147). Springer-Verlag. https://doi.org/10.1007/978-3-032-04190-6_9

[2]. Assistant Secretary for Technology Policy/Office of the National Coordinator for Health Information Technology. (2025, December 18). Interoperability. https://healthit.gov/interoperability/

[3]. HL7 International. (2023). FHIR release 5 (v5.0.0): R5—STU. https://hl7.org/fhir/R5/

[4]. Foidl, H., Golendukhina, V., Ramler, R., & Felderer, M. (2024). Data pipeline quality: Influencing factors, root causes of data-related issues, and processing problem areas for developers. Journal of Systems and Software, 207, 111855. https://doi.org/10.1016/j.jss.2023.111855

[5]. Fu, Q., Nicholson, G. L., & Easton, J. M. (2024). Understanding data quality in a data-driven industry context: Insights from the fundamentals. Journal of Industrial Information Integration, 42, 100729. https://doi.org/10.1016/j.jii.2024.100729

[6]. dbt Labs. (2026, January 30). Model contracts (dbt Developer Hub). https://docs.getdbt.com/docs/mesh/govern/model-contracts

[7]. dbt Labs. (2026, February). Unit tests (dbt Developer Hub). https://docs.getdbt.com/docs/build/unit-tests

[8]. Marron, J., Garcia, M. E., Lefkovitz, N., et al. (2024). Implementing the HIPAA Security Rule (NIST SP 800-66 Rev. 2). National Institute of Standards and Technology. https://csrc.nist.gov/pubs/sp/800/66/r2/final

[9]. Martins, P., Cardoso, F., Váz, P., Silva, J., & Abbasi, M. (2025). Performance and scalability of data cleaning and preprocessing tools: A benchmark on large real-world datasets. Data, 10(5), 68. https://doi.org/10.3390/data10050068

[10]. Lim, H. C., Wong, H., Philip, R., Van Der Vegt, A., Choo, K. R., Pole, J. D., & Sullivan, C. (2025). Streamlining electronic medical record data extraction and validation in digital hospitals: A systematic review to identify optimal approaches and methods. Learning Health Systems, 9(4), e70024. https://doi.org/10.1002/lrh2.70024

Downloads

Published

2026-04-17

Issue

Section

Articles

How to Cite

Mehulkumar Joshi. (2026). Data Integrity Validation Methodologies for High-Volume Healthcare ETL Pipelines: Automated Testing Strategies and Quality Assurance Frameworks. International Journal of Computer (IJC), 57(1), 255-264. https://ijcjournal.org/InternationalJournalOfComputer/article/view/2516