Skip to main content

Data Processes and Machine Learning for Health Research in VA

Dr. Andrew J. Zimolzak; Internal Medicine Baylor College of Medicine
Zoom Meeting ID: 970 7656 5407 Password: 477211

The US Department of Veterans Affairs (VA) runs one of the largest healthcare systems in the country, with 400,000 full time employees, and a health record database of 20 million patients and billions of patient encounters. However, the use of any medical record data for research presents challenges that require unique approaches, and the scale of VA health records and organizational structure add further challenges. We will introduce the discipline of clinical research informatics and describe the flow and structure of VA medical record data as it is used for research. We will present simple and advanced approaches to data cleaning, integration, or harmonization. We will discuss several medical research projects that have benefited from data cleaning efforts, such as randomized drug trials, and large-scale identification of diagnostic error.


Andrew J. Zimolzak is an assistant professor in the department of internal medicine at Baylor College of Medicine. Dr. Zimolzak has studied the secondary use of routinely collected medical data for 10 years. He has direct experience with retrieval and analysis of data from electronic medical records from multiple health care systems, as well medical insurance claims. This work has been applied in projects such as randomized trials of medications for hypertension and heart failure, pharmacogenomics, lung cancer genomics and precision medicine, kidney failure prediction, physicians’ delayed follow-up of patient test results, diagnostic error in the emergency department, and outcome prediction in COVID-19. Dr. Zimolzak has practiced general internal medicine in urgent care and inpatient hospital settings for over ten years, and in addition to research efforts he is a currently teaching hospitalist at the Michael E. DeBakey VA Medical Center in Houston. His interests include deriving accurate phenotype information from medical records, machine learning for improved efficiency of data cleaning, and research code reproducibility and sharing. He has been funded by the Gordon and Betty Moore Foundation, and the Agency for Healthcare Research and Quality.