Generative models for computational phenotyping
A generative modeling approach for ML-based phenotyping with observational data
Created Jun 12, 2018 - Last updated: Oct 12, 2020
Almost all of the ML-based phenotyping work published in recent years has relied on discriminative learning approaches. This prevailing methodology involves enhancing logistic regression algorithms (or other discriminative models) through some regularization techniques to reduce the parameter space, serving as an embedded feature selection method. While discriminative classifiers have been typically favored in the industry, generative learning approaches hold significant promise for phenotyping. Their scalability to large datasets and capability to learn from limited silver/gold standard labeled data provide a notable advantage.
Let us consider X as a vector of features, X = {X1, …, Xn}, and Y as the target variable (e.g., y = 1 indicates a positive diabetes diagnosis), both derived from Electronic Health Records (EHR) data. Here, both the features X and the target variable Y are stochastic variables with a joint distribution over (X, y). The objective of the feature engineering task is to identify which features Xi are statistically associated with Y.
Discriminative classifiers, such as logistic regression, SVMs, and decision trees, are designed to delineate explicit hard or soft boundaries between distinct classes of Y within the dataset. These classifiers directly model the conditional probability \( p(y|X) \), referred to as the posterior probability of outcome y, given the set of features X. Conversely, generative classifiers operate by making structural assumptions about the data that help to mitigate overfitting. They focus on learning the distribution of different classes of Y within the data by modeling the joint probability \( p(X,y) \), thus addressing both \( p(X) \) and \( p(y) \). Figure 1 illustrates the differing approaches of discriminative and generative learning in a two-class classifier.
Figure 1: discriminative vs generative approaches in a 2-class classifier.
Based on probability theory, we know that:
$$ p(X) = p(Y=y \mid X) \cdot p(X) $$
where
$$ p(X) = p(Y=y) \times p(Y=y \mid X) + p(Y \neq y) \times p(Y \neq y \mid X) $$
Here, $ p(X) $ is the posterior probability of $ Y $ taking the value $ y $, given the observed features $ X $.According to Bayes' rule, in a generative learning framework, we have:
$$ p(X) = p(Y=y) \times p(Y=y) \cdot p(X) $$
Within the generative Bayesian approach, $ p(Y=y) $ serves as the generative function, while $ p(Y=y) $ represents the prior probability of $ Y $ taking the value $ y $, independent of the observed features $ X $. In other words, $ p(Y=y) $ corresponds to the phenotype. The generative approach provides more than just the posterior conditional probability $ p(X) $; it also learns the phenotype $ p(Y=y) $ and allows us to refine the posterior by incorporating additional data and updating the prior $ p(Y=y) $.
Generative models in clinical research context
In a clinical research context, consider a scenario where Y represents a vector of labels used to determine the diagnosis of Type 2 Diabetes Mellitus (T2DM), with binary classes 1 and 0, where Y = {1, 0}. In this case, p(Y = 1) denotes the phenotype associated with T2DM, and p(Y = 1) also represents the prior probability of T2DM prevalence within a specific healthcare system or geographic region.
The probability p(X) denotes the likelihood of observing a set of features X derived from electronic health records (EHR) data, which can be further refined using domain expertise and natural language processing NLP-based knowledge discovery techniques.
In a generative or Bayesian framework, this relationship is succinctly expressed as:
Posterior Probability = Generative Model × Prior (normalized by a constant)
In the generative learning approach, the model's ability to generate data (observations) is contingent upon the knowledge of features, and conversely, features can be inferred from the generated data. This relationship is illustrated in Figure 2.
Figure 2: The generative learning approach.
The generative modeling approach represents a nuanced equilibrium between a fully data-driven paradigm and a domain-/knowledge-driven paradigm that relies on expert judgment annotations.
Generative models, owing to their inherently lower degrees of freedom compared to discriminative models, exhibit greater robustness and reduced susceptibility to overfitting. This characteristic allows generative models to potentially excel over discriminative models, particularly in scenarios involving small datasets, by effectively internalizing generalization processes.
More critically, generative models possess the capability to impose structural assumptions on data, which mitigates the risk of overfitting. These models aim to learn the distribution of different classes of the phenotype Y within the dataset by estimating the joint probability p(X, y)—in essence, generative models consider both p(y | X) and p(y). Consequently, generative models are particularly advantageous for computing disease probabilities, as they leverage their proficiency in learning from limited labeled data.
The crux of phenotyping lies in discerning the underlying distributional characteristics of the phenotype, Y, within the data by learning the joint probability p(X, y). This endeavor aligns with the true objective of phenotyping, which is to characterize generalizable traits of a phenotype rather than merely identifying the presence or absence of the phenotype in individuals. In other words, phenotyping involves a broader conceptualization of phenotypic traits. Therefore, generative models are particularly well-suited for computational phenotyping, as they provide an effective framework for capturing and modeling these generalized traits.