Bridging the gap between noisy healthcare data and knowledge: causality and portability

Lecture | January 23 | 10 a.m.-12 p.m. | 5101 Berkeley Way West

 Xu Shi, PhD

 Public Health, School of

Routinely collected healthcare data present numerous opportunities for biomedical research but also come with unique challenges. For example, critical issues such as data quality, unmeasured and mismeasured confounding, high-dimensional covariates, and patient privacy concerns naturally arise. In this talk, I present tailored causal inference methods and an automated data quality control pipeline that aim to overcome these challenges and make the transition from data to knowledge. I detail the challenge of inconsistent “languages” used by different healthcare systems and coding systems. In particular, different healthcare providers may use alternative medical codes to record the same diagnosis or procedure, limiting the transportability of phenotyping algorithms and statistical models across healthcare systems. I formulate the idea of medical code translation into a statistical problem of inferring a mapping between two sets of multivariate, unit-length vectors learned from two healthcare systems, respectively. The statistical problem is particularly interesting because the training data are corrupted by a fraction of mismatch in the response-predictor pairs, whereas classical regression analysis tacitly assumes that the response and predictor are correctly linked. I propose a novel method for mapping recovery and establish theoretical guarantees for estimation and model selection consistency

 CA,, 5106438154