Skip to main content
  • Original Research
  • Open access
  • Published:

A structural equation model for imaging genetics using spatial transcriptomics

Abstract

Imaging genetics deals with relationships between genetic variation and imaging variables, often in a disease context. The complex relationships between brain volumes and genetic variants have been explored with both dimension reduction methods and model-based approaches. However, these models usually do not make use of the extensive knowledge of the spatio-anatomical patterns of gene activity. We present a method for integrating genetic markers (single nucleotide polymorphisms) and imaging features, which is based on a causal model and, at the same time, uses the power of dimension reduction. We use structural equation models to find latent variables that explain brain volume changes in a disease context, and which are in turn affected by genetic variants. We make use of publicly available spatial transcriptome data from the Allen Human Brain Atlas to specify the model structure, which reduces noise and improves interpretability. The model is tested in a simulation setting and applied on a case study of the Alzheimer’s Disease Neuroimaging Initiative.

1 Introduction

The aim of imaging genetics studies is to find associations between genetic variants and imaging features, often in a disease context [1]. This scheme extends beyond traditional genome-wide association studies (GWAS) by identifying genetic associations of imaging biomarkers with the assumption that these biomarkers are a more direct reflection of the genetic effects. Thus, they could provide a stronger association signal [2]. Additionally, the identified associations are likely to provide new insights into the underlying disease mechanisms as well as new hypotheses about the anatomical and/or functional locations involved in complex diseases [3].

So far, imaging genetics studies have been largely focused on the brain [1, 3,4,5,6], despite efforts to extend their application to other fields [7]. Several large consortia have gathered data from thousands of subjects to understand the effects of genetic variants on brain structure and function [8]. One of the hallmark sources for imaging genetics studies is the Alzheimer’s Disease Neuroimaging Initiative (ADNI) database [9]. This database contains single nucleotide polymorphism (SNP) and structural MRI data for Alzheimer’s patients, individuals with late mild cognitive impairment, and cognitive normal controls.

One of the largest challenges facing imaging genetics studies is the statistical power needed to identify reliable associations. In a typical GWAS, researchers have to correct for the number of independent tests performed (i.e. number of independent SNPs tested) in order to limit the number of false-positive discoveries. However, a genome-wide brain-wide imaging genetic study will not only have to correct for the number of independent SNPs, but also for the number of independent imaging features tested. As a result, many studies are underpowered to identify reliable associations. One of the largest imaging genetics studies [10] analysed over 30,000 individuals within the Enhancing Neuro Imaging Genetics through Meta-Analysis (ENIGMA) consortium. They performed a genome-wide association of SNPs with seven brain volumes and identified only eight genome-wide significant SNPs.

Despite the high dimensionality of the imaging data (millions of voxels), the actual number of independent tests for which we need to correct in an imaging genetics study is far smaller than the number of voxels. Due to the spatial relationships between voxels, measurements from neighbouring voxels are usually highly correlated. A common approach is to test genetic associations for anatomically defined brain regions [2]. Several studies have shown that both neuroanatomical parcellation and connectivity of the brain are strongly reflected in gene expression patterns across the brain [11,12,13]. The public availability of brain transcriptome atlases from the Allen Institute for Brain Science [14] provides an opportunity to use these transcriptional signatures to group the anatomically defined brain regions, further limiting the number of effective tests.

Several methods have been proposed to identify associations between genetic variants and imaging features by applying dimension reduction, such as variations of canonical correlation analysis [15], and independent component analysis (commonly used in a functional MRI context) [4]. Others have opted to model the interactions between the different data types explicitly, for instance using graphical Bayesian models [16, 17] which capture a more mechanistic causal view of the data. These models consist of a directed acyclic graph, which can easily be made to incorporate covariates, including possible confounding factors. Both of these studies use relatively small candidate SNP sets, because they aim for understanding SNP–brain relationships rather than the discovery of genome-wide associations. However, these Bayesian models are quite challenging to specify and fit.

In this work, we propose a method to identify associations between candidate genetic variants and imaging features allowing for the incorporation of prior knowledge. The proposed method combines a graphical model with dimension reduction to model the effect of SNPs on brain imaging features through a set of latent variables. We use a maximum likelihood structural equation modelling (SEM) approach to find the edge weights of our model [18]. By performing dimensionality reduction within the model, we reduce the number of parameters to be estimated. In addition, the model allows for easy incorporation of information from the Allen Human Brain Atlas [12] to inform the grouping of brain regions based on the similarity of their transcriptional profiles.

Our model uses the transcriptional profiles for grouping because we consider gene expression to be an intermediate phenotype, that links SNPs to brain imaging features. Most disease-associated SNPs are located near regulatory regions of the genome [19], and the effects of SNPs on expression tend to be tissue and cell type specific [20]. Gene expression data of brain regions reflect cell type composition and anatomical similarity [12] and capture a wide range of brain-specific molecular pathways [21]. For these reasons the region groups in the dimension reduction are based on spatial gene expression data of the brain.

2 Materials and methods

The interplay between genetic variation, brain anatomy, and disease symptoms is complex. We use a structural equation model with latent variables [18] to model these relationships. We pose that the genetic variation is exogenous; in other words, the genetic variation in a study population is not caused by disease or brain anatomy. This variation does have an effect on the brain. For example, in Alzheimer’s disease, genetic variants may influence the immune response and amyloid \(\beta\) concentrations in the brain, which may in turn lead to shrinkage in several brain areas [22]. Large-scale imaging initiatives, such as ADNI, offer a possibility to study this shrinkage of brain regions. This can be estimated from MRI data of diseased individuals and controls, and expressed in cortical thickness and subcortical volume measurements.

In our graphical model, we define groups of brain regions, based on the transcriptional profiles of these areas in the healthy brain. Areas that share patterns of gene expression in a normal brain may be similarly affected by genetic variations. For each of the region groups, we introduce one latent variable. This latent variable is affected by the genetic variations and causes changes in relevant brain regions. This makes our model similar to principal component analysis (PCA) on sets of brain regions, combined with a regression for the latent variables. However, in our model the weights are estimated together, and the latent variables reflect not only the correlations between the regions (as in a conventional PCA), but also those between regions and SNPs and among the SNPs.

2.1 Variables used

We model the relationship between single nucleotide polymorphisms (SNPs) and brain region measurements. Let \({\mathbf {g}}_i \in {\mathbb {R}}^{p}\) be a vector of centred (zero-mean) SNP values, and \({\mathbf {x}}_i \in {\mathbb {R}}^{q}\) a vector of centred (zero-mean) and scaled (\({\text {sd}}=1\)) brain region measurements, both for individual i. The reason both types of measurements are centred is to eliminate intercepts from the model. The brain measurements are, in addition, scaled to unit variance to compensate for the considerably larger variance in thickness or volume for larger brain areas. The genetic variants and brain measurements are connected in the model by a set of latent variables, \({\mathbf {z}}_i \in {\mathbb {R}}^{m}\).

In addition to the variables included in the model, we have two other sources of information. In defining the model structure, we make use of external information on the brain region measurements, in the form of brain region groups with a shared transcriptional profile. These groups are defined based on the spatial gene expression data of the healthy adult brain. Finally, the goal is to understand disease-related phenotypes. The disease labels are not used in the modelling stage. However, we hypothesize that if the variation in the data is related to a disease state, the latent variables will reflect this. After model fitting, we therefore associate each individual’s estimated latent variable score with his or her disease status.

2.2 The graphical model

We model the relationship between brain SNP values and brain region measurements in a structural equation model (SEM). It consists of two parts. The first part is a linear model for brain region measurements as a function of the latent variables,

$$\begin{aligned} {\mathbf {x}}_i = {\mathbf {B}}{\mathbf {z}}_i+\varvec{\zeta}_{i}, \end{aligned}$$
(1)

where \({\mathbf {x}}_i\) contains the observed brain region measurements, \({\mathbf {z}}_i\) are the latent variables, and \(\varvec{\zeta}_{i}\) is a zero-mean normally distributed error variable. The matrix \({\mathbf {B}}\) contains the weights of the latent variables that explain the brain region measurements. The second part of the SEM is a linear model for these latent variables as a function of the SNP values,

$$\begin{aligned} {\mathbf {z}}_i = {\mathbf {A}}{\mathbf {g}}_i+\varvec{\varepsilon }_i, \end{aligned}$$
(2)

where \({\mathbf {g}}_i\) contains the observed SNP measurements and \(\varvec{\varepsilon }_i\) is a zero-mean normally distributed error variable. The matrix \({\mathbf {A}}\) contains regression weights, representing the effects of the SNPs on the latent variables. Combined, these equations mean that region changes are viewed as a manifestation of the latent values, while the SNP values are considered causal to them. The latent variables represent some intermediate phenotype, related to the molecular state of the connected brain regions.

The number of latent variables is equal to the number of brain region groups, which are defined based on external spatial gene expression data. A region group contains the brain regions with a similar transcriptional profile, as these may react similarly to differences in genetic background. We restrict each latent variable to only predict the brain region measurements for its own region group. This results in a restriction on the weight matrix \({\mathbf {B}}\), where each latent variable (corresponding to a column in \({\mathbf {B}}\)) has a unique set of nonzero entries. Figure 1 shows the model for two latent variables, where we can see that each latent variable is connected to its own set of brain regions.

Fig. 1
figure 1

The graphical structural equation model. Observed variables are shown in grey circles, latent variables in white circles, and error variables without circles. This example contains two latent variables, both with their own set of observed brain region measurements. This structure, where the latent variables define groups of region measurements, is defined by prior knowledge on these brain regions. We use spatially resolved gene expression data of the healthy human brain to define these region groups

2.3 Model implied covariance

In linear Gaussian structural equation modelling, we learn the parameters of a model by optimizing the correspondence between the observed covariance \({\mathbf {S}}\) (from the data) and the model implied covariance \(\varSigma\). The model implied covariance can be divided into a block matrix, by defining

$$\varSigma = \left[\begin{array}{cc} \varSigma _{gg}&\varSigma _{gx} \\ \varSigma _{xg}&\varSigma _{xx} \end{array}\right].$$

Note that this implied covariance does not contain any components for the latent variables in \({\mathbf {z}}\). The latent variables are not observed, and therefore, we cannot use their observed covariance in fitting the model.

The elements of the implied covariance can be parameterized in terms of the model coefficients. The first element is

$$\begin{aligned} \varSigma _{gg} = {\mathbf {E}}\left[ {\mathbf {g}}{\mathbf {g}}^{\mathrm{T}}\right] . \end{aligned}$$

This is the covariance of the SNPs (since these values are centred). The SNPs are exogenous in our model: \({\mathbf {g}}\) does not have any causal variables within our model. As a result, the implied covariance of the SNPs is not parameterized in terms of model coefficients. We can estimate this covariance term simply by taking the observed covariance between the SNPs.

The next element of the implied covariance matrix is

$$\begin{aligned} \varSigma _{xg}&= {\mathbf {E}}\left[ {\mathbf {x}}{\mathbf {g}}^{\mathrm{T}}\right] \\&= {\mathbf {E}}\left[ {\mathbf {B}}{\mathbf {A}}{\mathbf {g}}{\mathbf {g}}^{\mathrm{T}}+\varvec{\varepsilon }{\mathbf {g}}^{\mathrm{T}}+\varvec{\zeta }{\mathbf {g}}^{\mathrm{T}}\right] , \end{aligned}$$

and similarly

$$\begin{aligned} \varSigma _{gx}&= {\mathbf {E}}\left[ ({\mathbf {x}}{\mathbf {g}}^{\mathrm{T}})^{\mathrm{T}}\right] \\&= {\mathbf {E}}\left[ {\mathbf {g}}{\mathbf {g}}^{\mathrm{T}}{\mathbf {A}}^{\mathrm{T}}{\mathbf {B}}^{\mathrm{T}}+{\mathbf {g}}\varvec{\varepsilon }^{\mathrm{T}}+{\mathbf {g}}\varvec{\zeta }^{\mathrm{T}}\right] . \end{aligned}$$

The final element of the implied covariance matrix is the model implied covariance among the brain regions. This is given by

$$\begin{aligned} \varSigma _{xx}&= {\mathbf {E}}\left[ {\mathbf {x}}{\mathbf {x}}^{\mathrm{T}}\right] \\&= {\mathbf {E}}\left[ {\mathbf {B}}{\mathbf {A}}{\mathbf {g}}{\mathbf {g}}^{\mathrm{T}}{\mathbf {A}}^{\mathrm{T}}{\mathbf {B}}^{\mathrm{T}} +{\mathbf {B}}{\mathbf {A}}{\mathbf {g}}\varvec{\varepsilon }^{\mathrm{T}}{\mathbf {B}}^{\mathrm{T}} +{\mathbf {B}}\mathbf {A}{\mathbf {g}}\varvec{\zeta }^{\mathrm{T}} \right. \\&\left. \quad +\,{\mathbf {B}}\varvec{\varepsilon }{\mathbf {g}}^{\mathrm{T}}{\mathbf {A}}^{\mathrm{T}}{\mathbf {B}}^{\mathrm{T}} +{\mathbf {B}}\varvec{\varepsilon }\varvec{\varepsilon }^{\mathrm{T}}{\mathbf {B}}^{\mathrm{T}} +{\mathbf {B}}\varvec{\varepsilon }\varvec{\zeta }^{\mathrm{T}}\right. \\&\left. \quad +\,\varvec{\zeta }{\mathbf {g}}^{\mathrm{T}}{\mathbf {A}}^{\mathrm{T}}{\mathbf {B}}^{\mathrm{T}} +\varvec{\zeta }\varvec{\varepsilon }^{\mathrm{T}}{\mathbf {B}}^{\mathrm{T}} +\varvec{\zeta }\varvec{\zeta }^{\mathrm{T}}\right] . \end{aligned}$$

2.4 Model assumptions and estimation

Some elements of the implied covariance are often assumed to be zero. These assumptions lead to a strong simplification of the implied covariance. It is common in a regression setting to pose that the predictor variables and error variables are independent. In our case, the error independence assumption leads to \({\mathbf {g}}\varvec{\varepsilon }^{\mathrm{T}} = \varvec{\varepsilon }{\mathbf {g}}^{\mathrm{T}} = 0\). In addition, we assume that the errors in the brain region predictions (Eq. 1) are independent of the errors in the latent variable predictions (Eq. 2). This means that \(\varvec{\zeta }\varvec{\varepsilon }^{\mathrm{T}} = \varvec{\varepsilon }\varvec{\zeta }^{\mathrm{T}} = 0\). Finally, we assume that the errors in brain region prediction are independent of the SNPs, so \({\mathbf {g}}\varvec{\zeta }^{\mathrm{T}} = \varvec{\zeta }{\mathbf {g}}^{\mathrm{T}} = 0\).

As a result of these assumptions, the full implied covariance matrix of the model reduces to

$$\begin{aligned}&\left[\begin{array}{cc} \varSigma _{gg}&\quad \varSigma _{gx} \\ \varSigma _{xg}&\quad \varSigma _{xx} \end{array}\right]\nonumber \\&\quad = {\mathbf {E}} \left[\begin{array}{cc} {\mathbf {g}}{\mathbf {g}}^{\mathrm{T}}&\quad {\mathbf {g}}{\mathbf {g}}^{\mathrm{T}}{\mathbf {A}}^{\mathrm{T}}{\mathbf {B}}^{\mathrm{T}} \\ {\mathbf {B}}{\mathbf {A}}{\mathbf {g}}{\mathbf {g}}^{\mathrm{T}}&\quad {\mathbf {B}}{\mathbf {A}}{\mathbf {g}}{\mathbf {g}}^{\mathrm{T}}{\mathbf {A}}^{\mathrm{T}}{\mathbf {B}}^{\mathrm{T}}+{\mathbf {B}}\varvec{\varepsilon }\varvec{\varepsilon }^{\mathrm{T}}{\mathbf {B}}^{\mathrm{T}} +\varvec{\zeta }\varvec{\zeta }^{\mathrm{T}} \end{array}\right]. \end{aligned}$$
(3)

For normally distributed data, the maximum likelihood estimate of the covariance matrix is

$$\begin{aligned} \max _\varSigma \left( -\log (|\varSigma |)-{\text {tr}}\left( {\mathbf {S}}\varSigma ^{-1}\right) \right) , \end{aligned}$$
(4)

where \({\mathbf {S}}\) is the observed covariance matrix. The SNP data we use are discrete and can therefore not be considered normally distributed. To compensate for this, we will estimate robust standard errors. In Eq. (4), the covariance \(\varSigma\) is parameterized according to Eq. (3), so we can perform the optimization over the parameter values.

Model fitting is performed in the lavaan package in R [23]. For identifiability, we fix the loading of the first brain region measurement per region group (latent variable) to 1. This does not only fix the scales of the latent variables, but it also has the advantage that the resulting latent variables will have the same direction of effect as the first brain region measurement. For example, a reduction in volume of the first brain region will result in a reduction in the corresponding latent variable. All the error variances on the brain region measurements (variance of \(\varvec{\zeta }\)) are assumed to be equal within each region group, which is the same as in principal component analysis.

The model fit in lavaan yields estimates for \({\mathbf {B}}\), \(\mathbf {A}\), and the covariance matrices of the error variables \(\varvec{\varepsilon }\) and \(\varvec{\zeta }\). Each of these parameter estimates is provided with robust p values (for the hypothesis of being equal to zero), when using the MLM estimation procedure [23]. Using the estimated model parameters, one can then calculate unbiased Bartlett scores for the latent variables [24].

2.5 Data

Simulated data The model is evaluated on both simulated and real data. In the simulation, we first generated SNP values (\({\mathbf {g}}_i\)) in accordance with Hardy–Weinberg equilibrium. The minor allele frequencies were independently drawn from a beta distribution with shape parameters \(\alpha =1\) and \(\beta =2\). Then we simulated latent variables (\({\mathbf {z}}_i\)) as a linear combination of the SNP values, with Gaussian noise (\({\hbox {sd}} = 2\)). Each of these latent variables determined the region measurements (\({\mathbf {x}}_i\)) of a set of regions (a region group), with added Gaussian noise (\({\hbox {sd}} = 2\)). This part of the simulation is in line with Eqs. (1) and (2) and Fig. 1. Finally, we used a logistic model in which a linear combination of some of the latent variables determined the probability of observing a phenotype. These binary phenotypes (disease versus healthy) were then drawn from a Bernoulli distribution.

We simulated 100 independent data sets for 500 individuals. Each time, we set the number of SNPs to 20 and the number of latent variables (and therefore region groups) to 5. We randomly selected 10 SNP-to-latent weights (\({\mathbf {A}}\)) to be either 1 or \(-\,1\). The 5 region groups contain 20, 10, 10, 5, and 5 regions, respectively, for a total of 50 brain region measurements. Each latent variable has latent-to-brain-region weights (in \({\mathbf {B}}\)) which were uniformly sampled between 0.5 and 1.5. All other elements of \({\mathbf {B}}\) were set to zero, which effectively restricts each latent variable to affect only its own region group. Finally, two out of the five latent variables were randomly selected to affect the disease probability, with weights of either 10 or \(-10\). All other latent-to-phenotype weights were set to zero.

To test the robustness of our method, we also simulated data for a range of alternative parameter settings. We varied the amount of noise in the latent variables (\({\mathbf {z}}_i\)) and the region measurements (\({\mathbf {x}}_i\)) between 1 and 5. The number of nonzero SNP-to-latent weights (in \({\mathbf {A}}\)) was varied from 2 to 20. Finally, we constructed data sets with misspecified latent-to-brain-region weights (in \({\mathbf {B}}\)). To this end, we swapped links between latent variables and regions. In each swap, a region was disconnected from its original latent variable and instead connected to another latent variable. To retain the sizes of the region groups, another region of that second latent variable was then connected to the first latent variable. Each swap therefore resulted in two misspecified links. We made sure not to swap regions back to their original latent variables.

ADNI data and preprocessing The real data used in the preparation of this article were obtained from the Alzheimer’s Disease Neuroimaging Initiative (ADNI) database (adni.loni.usc.edu) [9]. The ADNI was launched in 2003 as a public–private partnership, led by Principal Investigator Michael W. Weiner, MD. The primary goal of ADNI has been to test whether serial magnetic resonance imaging (MRI), positron emission tomography (PET), other biological markers, and clinical and neuropsychological assessment can be combined to measure the progression of mild cognitive impairment (MCI) and early Alzheimer’s disease (AD). For up-to-date information, see www.adni-info.org.

The ADNI database contains measurements on a large number of cognitive normal (CN) controls, individuals with late mild cognitive impairment (LMCI), and individuals with Alzheimer’s disease (AD). The measurements in the database include patient demographics, raw and processed MRI data, biomarker data, and SNP data. For the brain volumes we made use of the UCSF cross-sectional FreeSurfer (version 4.3) cortical thickness and white matter parcellation measurements. For the SNPs we made use of the ADNI 1 Illumina Human 610-Quad BeadChip data, with imputation as previously described [17]. In the end, we selected volumes, SNPs, and diagnoses for 746 individuals. These data were split into two equal parts of 373 individuals, one as a training set and one as a validation set, to prevent over-fitting in the modelling process.

Our methodology is not suited to genome-wide analysis. Instead, it tries to find the effects of specific SNPs on a set of latent variables. As candidate SNPs we selected a set of 35 polymorphisms associated with Alzheimer’s disease according to the International Genomics of Alzheimer’s Project (IGAP) study results [25]. IGAP is a two-stage GWAS on individuals of European ancestry for Alzheimer’s disease. In stage 1, IGAP used genotyped and imputed data on 7,055,881 SNPs of 17,008 Alzheimer’s disease cases and 37,154 controls. In stage 2, 11,632 SNPs were genotyped and tested for association in an independent set of 8,572 Alzheimer’s disease cases and 11,312 controls. Finally, a meta-analysis was performed combining results from stages 1 and 2. We selected the known SNPs, stage 1 discoveries, and stage 1 and stage 2 discoveries from table 2, and the suggestive SNPs from supplemental table 4 of [25].

The volume data were present for 112 regions. We corrected it for individual age, gender, and whole brain volume (using linear regression), with the goal of maintaining all meaningful variation in brain region volumes, possibly related to the disease phenotype. For our latent variable model, the brain region volumes were linked to region groups. We defined these region groups based on the transcriptional profiles in the healthy adult human brain, as provided by the Allen Atlas [12]. This gene expression resource contains anatomically labelled measurements taken from six human brains. Regions with measurements in each of the six brains were selected, and the expression values were averaged to obtain a single value for each of the 19,992 genes in each of the 105 regions of the Allen Atlas [21]. We then performed a t-distributed neighbourhood embedding (t-SNE) analysis to obtain a two-dimensional map of the brain regions. Brain regions are placed nearby in this map if they have a similar expression profile across all genes. This map was then used to manually define nine groups of brain regions, as is shown in figure 2 of [21]. The regions of the ADNI data were manually linked to the nine region groups, as shown in Additional file 1: Table S1. The anatomical atlas used for the Allen Atlas is hierarchical: it has a tree-like structure with large regions containing smaller regions. Table 1 shows a higher-level description of the regions in the nine region groups. In most cases the Allen Atlas regions were more general (larger) than the FreeSurfer regions of the ADNI data. Out of the 112 regions, 105 regions were linked to a region group, while the other 7 regions did not have corresponding samples in the Allen Atlas data and were therefore left out.

Table 1 The nine region groups (corresponding to the latent variables), with the brain regions they contain

3 Results

3.1 Simulation

To evaluate the performance of our model, the SEM was fitted to each of the simulated data sets. We considered two measures for model comparison. First, we set out to assess the prediction of phenotypes from the latent variables, with a logistic regression. In each of the 100 simulated data sets, we estimated the latent variable scores and used only those to predict the phenotype. For each of these 100 models, we obtained an Akaike information criterion (AIC) value. We compared our model to several logistic regression models that use only the simulated data, instead of the SEM estimated latent scores. The first alternative model uses only all the brain region measurements, the second only all the SNP measurements, and the third a combination of all regions and SNP measurements. As a fourth alternative model, we performed a PCA on the volume measurements and extracted the first five principal components. Figure 2 shows that, on average, our latent variables obtain a lower AIC than models using either all brain region data, all SNP data, or both. The model using the first five principal components of the brain region data is most similar to our model, and it only has a slightly higher AIC on average than our model.

Fig. 2
figure 2

Simulation AIC model fit. The logistic regression models use either the SEM estimated latent variable scores (fitted latents), the first five principal components of the brain region data (5 PCs regions), all brain region data, all SNP and brain region data, or all SNP data

The second measure for model comparison is the ability to retrieve the correct SNPs. In each of our simulation data sets, two of the five latent variables have an effect on the phenotype (disease status). All SNPs that affect either of these two latent variables effectively impact the phenotype. We consider those SNPs to be the SNPs with a true effect. We now consider how these SNPs are ranked for importance in our SEM analysis, and two alternative approaches. From our SEM fit, we extracted the robust SNP p values for predicting the latent variables (so the p values for the estimates in \({\mathbf {A}}\)). These give an impression of the importance of a SNP in predicting the latent variables. In addition, we used the latents’ logistic regression p values for the phenotype. These show the importance of a latent variable in predicting the phenotype. As a result, the path from a SNP to the phenotype contains two p values per latent variable: one for the latent variable prediction and one for the phenotype prediction.

We considered combining these p values in two ways: (1) for each SNP we took the maximum p value of the two per latent variable and then the minimum p value over the five latent variables; or (2) for each SNP we used Fisher’s method [26] to combine the two p values per latent variable (\(-2 \sum \log (p_i)\)) and then took the minimum p value over the five latent variables. Note that Fisher’s method is meant for p values testing the same null hypothesis, which is not the case here. Both methods yield a score (p value) for SNP importance. We varied a threshold for this score from 0 to 1 and compared the set of SNPs with values below this threshold to the set of SNPs with a known true effect. In this way, we constructed a receiver operating characteristic curve for SNP retrieval and calculated the corresponding area under the curve (AUC).

We compared the performance of our methodology to a straightforward modelling approach: a logistic regression to predict the disease status phenotype from the SNPs. This was performed both in a univariate way (as in a GWAS) and in a multivariate way. Figure 3 shows the performance of our SEM-based methods, using the maximum p value per SNP–latent combination (SEM max) or using Fisher’s method (SEM Fisher), and of the GWAS-like approaches. The SEM max method has the highest average AUC, indicating that it is best able to rank the SNPs on their importance for the phenotype. Note that the SEM Fisher method has the disadvantage that either a strong SNP-to-latent or a strong latent-to-phenotype effect can lead to a low combined p value, regardless of the other value. The observed difference between the univariate and multivariate approach is very small, which is to be expected since the simulated SNP values are independent.

Fig. 3
figure 3

Simulation AUC for SNP selection. Shown are the results for two methods of p value integration for our model (SEM max and SEM Fisher), for multivariate logistic regression and univariate logistic regression. A high AUC means that the method correctly ranks the importance of the SNPs for the phenotype (disease state)

To test the robustness of our model, we also compared the models for a range of alternative simulation settings. Additional file 2: Fig. S1 shows the results of these simulations. The amount of noise on the latent variables has a similar impact on all compared methods. With a large amount of noise on the brain region measurements, the prediction of phenotypes remains best with our model, but the identification of SNPs is better with methods that do not make use of these region volume data. The number of SNPs with a nonzero effect on the latent variable has little impact on the simulation results. Misspecification of the region groups, on the other hand, has a negative impact specifically on the performance of our method. This shows that our approach is somewhat sensitive to the specification of brain region groups.

3.2 ADNI application

We apply our methodology to the Alzheimer’s Disease Neuroimaging Initiative (ADNI) data [9]. We selected 35 SNPs and 105 brain region volumes for 746 individuals. The brain regions were divided into nine region groups based on the gene expression patterns of matching brain areas in the healthy human brain [12, 21]. Each of the nine brain region groups has one corresponding latent variable, and each latent variable has a unique set of brain region measurements attached to it. Additional file 3: Fig. S2 shows the volume loadings for each of the latent variables. Since the first loading for each latent variable is set to 1, the latent variables will have the same direction of effect as this variable. All but two of the region volumes have a positive loading. Two regions in the subcortical group 2 (SubCort2) are negatively correlated to the latent variable scores, reflecting a more heterogeneous signal in this group.

Figure 4 shows the association between the nine latent variables and the selected SNPs. Only those SNPs are shown that have a nominally significant (\(p<0.05\)) association with at least one of the latent variables. After correction for multiple testing, the only significant effect is that for rs429358, located in APOE, on the hippocampal region group (Bonferroni-corrected \(p = 2.28 \times 10^{-4}\)). In the validation set, here used as a replicate, this effect was again significant (Bonferroni-corrected \(p=8.66 \times 10^{-3}\)). None of the other associations are significant after multiple testing correction. This APOE allele is known to be associated with a decrease in the hippocampal volume, both in individuals with mild cognitive impairment [27] and in Alzheimer’s disease [28].

Fig. 4
figure 4

Association between SNPs and latent variable scores, as found by the robust maximum likelihood fit of the SEM. All nominally significant associations (\(p<0.05\)) are coloured by their robust z-statistic values [23]. The linked genes [25] are shown in brackets. a The results for the training set. After Bonferroni correction for the 315 tests, only the effect of rs429358 (APOE) on the hippocampus region group remains significant. b The validation results confirm the significant effect of rs429358 (APOE) on the hippocampus region group

The latent variables reflect differences in brain region volumes across the ADNI data set. To test whether these differences in brain region volumes are related to the disease phenotype, we compared the latent variable scores between the CN, LMCI, and AD individuals. Figure 5 shows the distribution of latent variable scores for the validation set. To calculate these, we used the fitted SEM of the training data and used its parameter estimates to calculate latent variable scores for the validation data. For three region groups the latent variable scores were significantly lower in LMCI than in controls, and even lower in AD. These regions are the cerebral cortex, the hippocampal formation, and the amygdala. This reflects significant shrinkage in these areas during Alzheimer’s disease progression. The region group of sulci and spaces (SulcSpac) has a latent variable that significantly increases in LMCI and AD. The significant association between the SNP rs429358 and the latent variable scores for hippocampus reflects the importance of APOE for Alzheimer’s disease.

Fig. 5
figure 5

Association between the validation latent variable scores and diagnosis. Diagnosis is cognitive normal (CN), late mild cognitive impairment (LMCI), or Alzheimer’s disease (AD). Nominally significant differences (ANOVA \(p<0.05\)) are indicated with asterisks. The cerebral cortex (CrCortex), hippocampus (Hippocam), and amygdala (Amygdala) latent volume variables are lowered with disease progression, while the latent variable score for sulci and spaces (SulcSpac) is increased

4 Conclusion

We have proposed the use of a maximum likelihood structural equation model for combining SNP data and structural brain area measurements. The model makes use of external gene expression data, to define groups of brain regions that may respond similarly to genetic variation. For each of these region groups, we define a latent variable, which captures the relationship between the regions in a group and genetic variation. We have applied the model to a simulated data set, to show it can capture disease-relevant variation and identify causal SNPs. In addition, we have applied the model to the ADNI data set, containing Alzheimer’s patients, individuals with late mild cognitive impairment, and cognitive healthy controls. One SNP, linked to APOE, shows a reproducible significant relationship to the latent variable that captures hippocampal volume change. This latent variable, and the ones representing the cerebral cortex, amygdala, and sulci and spaces, also significantly associate with the disease diagnosis. This shows that our approach can be used to integrate several data types and yield interpretable results.

The fitting process of the structural equation model has relatively high computational cost. It is truly multivariate, which makes it infeasible at the moment to perform genome-wide analysis. It does have advantages for incorporating a large number of variables, since it allows for straightforward inclusion of constraints on the parameter estimates [23]. With a constraint on the sum of squared weights, one could for instance implement a ridge regression. In addition, the model allows for the inclusion of additional data. This can be done either in the specification of the model structure, as we have done for the region groups, or by adding observed variables to the model. In our model, we chose to group brain regions based on the similarity of their expression profiles in the healthy brain. An interesting extension to the model would be to incorporate a layer of latent variables to reflect a grouping of the SNPs. These groups could also be based on the similarity of the brain-wide expression patterns of the associated genes.

These results show that maximum likelihood SEM is a versatile approach for data integration, which can be used to elucidate the relationships between genetic variation, structural brain phenotypes, and brain disease.

References

  1. Liu J, Calhoun VD (2014) A review of multivariate analyses in imaging genetics. Front Neuroinform 8(March):29. https://doi.org/10.3389/fninf.2014.00029

    Article  Google Scholar 

  2. Hibar DP, Kohannim O, Stein JL, Chiang MC, Thompson PM (2011) Multilocus genetic analysis of brain images. Front Genet 2(October):1–11. https://doi.org/10.3389/fgene.2011.00073

    Article  Google Scholar 

  3. Franke B, Stein JL, Ripke S, Anttila V, Hibar DP, van Hulzen KJE et al (2016) Genetic influences on schizophrenia and subcortical brain volumes: large-scale proof of concept. Nat Neurosci 19(3):420–431. https://doi.org/10.1038/nn.4228

    Article  Google Scholar 

  4. Calhoun VD, Liu J, Adalı T (2009) A review of group ICA for fMRI data and ICA for joint inference of imaging, genetic, and ERP data. NeuroImage 45(1):S163–S172. https://doi.org/10.1016/j.neuroimage.2008.10.057

    Article  Google Scholar 

  5. Vounou M, Janousova E, Wolz R, Stein JL, Thompson PM, Rueckert D et al (2012) Sparse reduced-rank regression detects genetic associations with voxel-wise longitudinal phenotypes in Alzheimer’s disease. NeuroImage 60(1):700–716. https://doi.org/10.1016/j.neuroimage.2011.12.029

    Article  Google Scholar 

  6. Stein JL, Medland SE, Vasquez AA, Derrek P, Senstad RE, Winkler AM et al (2012) Identification of common variants associated with human hippocampal and intracranial volumes. Nat Genet 44(5):552–561. https://doi.org/10.1038/ng.2250.Identification

    Article  Google Scholar 

  7. Batmanghelich NK, Saeedi A, Cho M, Estepar RSJ, Golland P (2015) Generative method to discover genetically driven image biomarkers. In: Colchester ACF, Hawkes DJ (eds) Information processing in medical imaging. Volume 511 of lecture notes in computer science. Springer, Berlin, pp 30–42

    Chapter  Google Scholar 

  8. Medland SE, Jahanshad N, Neale BM, Thompson PM (2014) Whole-genome analyses of whole-brain data: working within an expanded search space. Nat Neurosci 17(6):791–800. https://doi.org/10.1038/nn.3718

    Article  Google Scholar 

  9. Mueller SG, Weiner MW, Thal LJ, Petersen RC, Jack C, Jagust W et al (2005) The Alzheimer’s Disease Neuroimaging Initiative. Neuroimaging Clin N Am 15(4):869–877. https://doi.org/10.1016/j.nic.2005.09.008

    Article  Google Scholar 

  10. Hibar DP, Stein JL, Renteria ME, Arias-Vasquez A, Desrivières S, Jahanshad N et al (2015) Common genetic variants influence human subcortical brain structures. Nature 520(7546):224–229. https://doi.org/10.1038/nature14101

    Article  Google Scholar 

  11. Ko Y, Sa Ament, Ja Eddy, Caballero J, Earls JC, Hood L et al (2013) Cell type-specific genes show striking and distinct patterns of spatial expression in the mouse brain. Proc Natl Acad Sci 110(8):3095–3100. https://doi.org/10.1073/pnas.1222897110

    Article  Google Scholar 

  12. Hawrylycz MJ, Lein ES, Guillozet-Bongaarts AL, Shen EH, Ng L, Ja Miller et al (2012) An anatomically comprehensive atlas of the adult human brain transcriptome. Nature 489(7416):391–399. https://doi.org/10.1038/nature11405

    Article  Google Scholar 

  13. Richiardi J, Altmann A, Milazzo AC, Chang C, Chakravarty MM, Banaschewski T et al (2015) Correlated gene expression supports synchronous activity in brain networks. Science 348(6240):1241–1244. https://doi.org/10.1126/science.1255905

    Article  Google Scholar 

  14. Sunkin SM, Ng L, Lau C, Dolbeare T, Gilbert TL, Thompson CL et al (2013) Allen Brain Atlas: an integrated spatio-temporal portal for exploring the central nervous system. Nucleic Acids Res 41(Database issue):D996–D1008. https://doi.org/10.1093/nar/gks1042

    Article  Google Scholar 

  15. Du L, Huang H, Yan J, Kim S, Risacher SL, Inlow M et al (2016) Structured sparse canonical correlation analysis for brain imaging genetics: an improved GraphNet method. Bioinformatics 32(10):1544–1551. https://doi.org/10.1093/bioinformatics/btw033

    Article  Google Scholar 

  16. Stingo FC, Guindani M, Vannucci M, Calhoun VD (2013) An integrative Bayesian modeling approach to imaging genetics. J Am Stat Assoc 108(503):37–41. https://doi.org/10.1080/01621459.2013.804409

    Article  MathSciNet  MATH  Google Scholar 

  17. Batmanghelich NK, Dalca AV, Quon G, Sabuncu MR, Golland P (2016) Probabilistic modeling of imaging, genetics and diagnosis. IEEE Trans Med Imaging 0062(c):1. https://doi.org/10.1109/TMI.2016.2527784

    Article  Google Scholar 

  18. Bollen KA (1989) Structural equations with latent variables. Wiley, New York

    Book  Google Scholar 

  19. Maurano MT, Humbert R, Rynes E, Thurman RE, Haugen E, Wang H et al (2012) Systematic localization of common disease-associated variation in regulatory DNA. Science 337(6099):1190–1195. https://doi.org/10.1126/science.1222794

    Article  Google Scholar 

  20. Ardlie KG, Deluca DS, Segre AV, Sullivan TJ, Young TR, Gelfand ET et al (2015) The genotype-tissue expression (GTEx) pilot analysis: multitissue gene regulation in humans. Science 348(6235):648–660. https://doi.org/10.1126/science.1262110

    Article  Google Scholar 

  21. Huisman SMH, van Lew B, Mahfouz A, Pezzotti N, Höllt T, Michielsen L et al (2017) BrainScope: interactive visual exploration of the spatial and temporal human brain transcriptome. Nucleic Acids Res 45(10):e83. https://doi.org/10.1093/nar/gkx046

    Article  Google Scholar 

  22. Bettens K, Sleegers K, Van Broeckhoven C (2013) Genetic insights in Alzheimer’s disease. Lancet Neurol 12(1):92–104. https://doi.org/10.1016/S1474-4422(12)70259-4

    Article  Google Scholar 

  23. Rosseel Y (2012) lavaan: an R package for structural equation modeling. J Stat Softw 48(2):1–20

    Article  Google Scholar 

  24. Distefano C, Zhu M, Mîndrilă D (2009) Understanding and using factor scores: considerations for the applied researcher. Pract Assess Res Eval 14(20):1–11

    Google Scholar 

  25. Lambert JC, Ibrahim-Verbaas CA, Harold D, Naj AC, Sims R, Bellenguez C et al (2013) Meta-analysis of 74,046 individuals identifies 11 new susceptibility loci for Alzheimer’s disease. Nat Genet 45(12):1452–1458. https://doi.org/10.1038/ng.2802

    Article  Google Scholar 

  26. Fisher R (1950) Statistical methods for research workers. Biological monographs and manuals. No. V., 11th edn. Oliver and Boyd, Edinburgh

    Google Scholar 

  27. Farlow MR, He Y, Tekin S, Xu J, Lane R, Charles HC (2004) Impact of APOE in mild cognitive impairment. Neurology 63(10):1898–1901. https://doi.org/10.1212/01.WNL.0000144279.21502.B7

    Article  Google Scholar 

  28. Schuff N, Woerner N, Boreta L, Kornfield T, Shaw LM, Trojanowski JQ et al (2009) MRI of hippocampal volume loss in early Alzheimer’s disease in relation to ApoE genotype and biomarkers. Brain 132(4):1067–1077. https://doi.org/10.1093/brain/awp007

    Article  Google Scholar 

Download references

Authors’ contributions

SMHH, AM, NKB, BPFL and MJTR were involved in initiating the study and developing the model. SMHH drafted the initial manuscript and performed the analyses. AM, NKB, BPFL and MJTR edited the manuscript. All authors read and approved the final manuscript.

Authors’ Information

Sjoerd Huisman is a PhD student in the Delft Bioinformatics Lab at Delft University of Technology. Ahmed Mahfouz is an assistant professor at the Leiden Computational Biology Center at the Leiden University Medical Center. Kayhan Batmanghelich is an assistant professor in the Department of Biomedical Informatics and the Intelligent Systems Program with secondary appointments in the Computer Science and Electrical Engineering Departments at the University of Pittsburgh. Boudewijn P.F. Lelieveldt is professor of Biomedical Imaging at the Department of Radiology of Leiden University Medical Center, Leiden, the Netherlands, where he is heading the Division of Image Processing (LKEB). Marcel Reinders is a professor in Bioinformatics within the Faculty of Electrical Engineering, Mathematics and Computer Science at the Delft University of Technology in which he heads the Pattern Recognition and Bioinformatics section.

Acknowledgements

Data collection and sharing for this project was funded by the Alzheimer’s Disease Neuroimaging Initiative (ADNI) (National Institutes of Health Grant U01 AG024904) and DOD ADNI (Department of Defense Award Number W81XWH-12-2-0012). ADNI is funded by the National Institute on Aging, the National Institute of Biomedical Imaging and Bioengineering, and through generous contributions from the following: AbbVie, Alzheimer’s Association; Alzheimer’s Drug Discovery Foundation; Araclon Biotech; BioClinica, Inc.; Biogen; Bristol-Myers Squibb Company; CereSpir, Inc.; Cogstate; Eisai Inc.; Elan Pharmaceuticals, Inc.; Eli Lilly and Company; EuroImmun; F. Hoffmann-La Roche Ltd and its affiliated company Genentech, Inc.; Fujirebio; GE Healthcare; IXICO Ltd.; Janssen Alzheimer Immunotherapy Research & Development, LLC.; Johnson & Johnson Pharmaceutical Research & Development LLC.; Lumosity; Lundbeck; Merck & Co., Inc.; Meso Scale Diagnostics, LLC.; NeuroRx Research; Neurotrack Technologies; Novartis Pharmaceuticals Corporation; Pfizer Inc.; Piramal Imaging; Servier; Takeda Pharmaceutical Company; and Transition Therapeutics. The Canadian Institutes of Health Research is providing funds to support ADNI clinical sites in Canada. Private sector contributions are facilitated by the Foundation for the National Institutes of Health (www.fnih.org). The grantee organization is the Northern California Institute for Research and Education, and the study is coordinated by the Alzheimer’s Therapeutic Research Institute at the University of Southern California. ADNI data are disseminated by the Laboratory for Neuro Imaging at the University of Southern California.

Data used in preparation of this article were obtained from the Alzheimer’s Disease Neuroimaging Initiative (ADNI) database (adni.loni.usc.edu). As such, the investigators within the ADNI contributed to the design and implementation of ADNI and/or provided data, but did not participate in analysis or writing of this report. A complete listing of ADNI investigators can be found at: http://adni.loni.usc.edu/wp-content/uploads/how_to_apply/ADNI_Acknowledgement_List.pdf.

Competing interests

The authors declare that they have no competing interests.

Funding

Funding was provided by the Dutch Technology Foundation STW, as part of the STW Project 12721: Genes in Space under the ImaGene STW Perspective Program, and by the European Union Seventh Framework Programme (FP7/2007-2013) under Grant Agreement 604102 (Human Brain Project). This work is partially supported by NIH Award Number 1R01HL141813-01. We thank the Competitive Medical Research Fund (CMRF) grant for their funding.

Publisher’s Note

Springer Nature remains neutral with regard to jurisdictional claims in published maps and institutional affiliations.

Author information

Authors and Affiliations

Authors

Consortia

Corresponding author

Correspondence to Marcel J. T. Reinders.

Additional files

Additional file 1: Table S1.

All brain regions used in the ADNI section, with their region group code, manually annotated ABA region, ADNI code, and ADNI description. Region group codes: CrCortex = cerebral cortex; Hippocam = hippocampal formation; Amygdala = amygdala; Striatum = striatum; DorsThal = dorsal thalamus; SubCort1 = sub-cortical regions 1; SubCort2 = sub-cortical regions 2; ClCortex = cerebellar cortex; SulcSpac = sulci and spaces.

Additional file 2: Fig. S1.

Simulations to test model robustness. The plots show a comparison of our model (red) to alternative approaches with varying simulation parameters. We simulated a range of noise on latent variables, noise on volumes, number of SNPs, and number of misspecified latent to volume links.

Additional file 3: Fig. S2.

Weights (loadings) from the latent variables to the region measurements. The rows correspond to latent variables (region groups), and the columns to brain regions. Each first loading per region group was set to 1, for model identifiability. The loadings show the strength of the relationship between each latent variable and the thickness/volume of its corresponding brain regions. See Additional file 1: Table S1 for the meaning of the region codes.

Rights and permissions

Open Access This article is distributed under the terms of the Creative Commons Attribution 4.0 International License (http://creativecommons.org/licenses/by/4.0/), which permits unrestricted use, distribution, and reproduction in any medium, provided you give appropriate credit to the original author(s) and the source, provide a link to the Creative Commons license, and indicate if changes were made.

Reprints and permissions

About this article

Check for updates. Verify currency and authenticity via CrossMark

Cite this article

Huisman, S.M.H., Mahfouz, A., Batmanghelich, N.K. et al. A structural equation model for imaging genetics using spatial transcriptomics. Brain Inf. 5, 13 (2018). https://doi.org/10.1186/s40708-018-0091-0

Download citation

  • Received:

  • Accepted:

  • Published:

  • DOI: https://doi.org/10.1186/s40708-018-0091-0

Keywords