We just published an article in Cell - Patterns, Transitive Sequencing Medical Records for Mining Predictive and Interpretable Temporal Representations, describing a high-throughput approach that learns from a set of phenotypes to model others (link to open access publication).

Transitive Sequencing Medical Records for Mining Predictive and Interpretable Temporal Representations

The Bigger Picture

Over the past decade, billions of dollars have been spent to institute meaningful use of electronic health record (EHR) systems. For a multitude of reasons, however, EHR data are still complex and have ample quality issues, which make it difficult to leverage these data to address pressing health issues, especially during pandemics such as COVID-19, when rapid responses are needed. In this paper, we propose a transitive sequential pattern mining algorithm for exploiting the temporal information in the EHRs that are distorted by layers of administrative and healthcare system processes. Perhaps more importantly, we propose a machine learning (ML) pipeline that is capable of engineering predictive features without the need for expert involvement to model diseases and health outcomes. Together, the temporal sequences and the ML pipeline can be rapidly deployed to develop computational models for identifying and validating novel disease markers and advancing medical knowledge discovery.


Electronic health records (EHRs) contain important temporal information about the progression of disease and treatment outcomes. This paper proposes a transitive sequencing approach for constructing temporal representations from EHR observations for downstream machine learning. Using clinical data from a cohort of patients with congestive heart failure, we mined temporal representations by transitive sequencing of EHR medication and diagnosis records for classification and prediction tasks. We compared the classification and prediction performances of the transitive sequential representations (bag-of-sequences approach) with the conventional approach of using aggregated vectors of EHR data (aggregated vector representation) across different classifiers. We found that the transitive sequential representations are better phenotype “differentiators” and predictors than the “atemporal” EHR records. Our results also demonstrated that data representations obtained from transitive sequencing of EHR observations can present novel insights about the progression of the disease that are difficult to discern when clinical data are treated independently of the patient’s history.