Chapter 1 EigenGWAS basis

This project is dedicated to EigenGWAS, a linear model analysis approach for genomic data.

EigenGWAS can be seamlessly integrated to the steps population genetic routines such as STRUCTRURE and principal component analysis. In a nutshell, it regresses the eigenvector against each marker, here often a single nucleotide polymorphism (SNP) marker, an analysis that detects the loci under selection. Its analysis pipeline may be found at RPub.

1.1 Genetic relatedness matrix \(\mathbf{G}\)

The genomic data can be represented as \(\mathbf{X}\), an \(n \times m\) matrix representing \(n\) individuals and \(m\) markers. Without loss of generality, \(x_{il}\) is the genotype code for the \(i^{th}\) individual at the \(l^{th}\) biallic locus. The data matrix \(\mathbf{X}\) can be generated from chips, NGS, or GBS.

In general, given \(\mathbf{X}\) we can construct the \(n\times n\) genetic relatedness matrix as

\[\begin{equation} \mathbf{G}=\frac{1}{m}\tilde{\mathbf{X}}\tilde{\mathbf{X}}^T \tag{1.1} \end{equation}\]

in which \(\tilde{\mathbf{X}}\) is the scaled form of \(\mathbf{X}\) (slighlty different from standardization, which will be \(\frac{x_l-2p_l}{\sqrt{2(1+F_l)p_lq_l}}\)). However, upon the mating type of the species, \(\mathbf{G}\) should be constructed accordingly. For a random mating population, \(x_l\) is scaled as \(\tilde{x}_l=\frac{x_l-2p_l}{\sqrt{2p_lq_l}}\), assuming the current population as the base population, whereas for an inbred population, \(\tilde{x}_l=\frac{x_l-2p_l}{\sqrt{4p_lq_l}}\), and \(q_l=1-p_l\) the frequency for the alternative allele.

Due to missing data, in practice for a pair of individuals \(i\) and \(j\), their pairwise relatedness is

\[\begin{equation} G_{ij}=\frac{1}{\tilde{m}}\sum_l^{\tilde{m}}\frac{(x_{il}-2p_l)(x_{jl}-2p_l)}{2(1+F)p_lq_l} \tag{1.2} \end{equation}\]

in which \(\tilde{m}\) is the number of loci genotyped at both individal \(i\) and \(j\), and \(F\) the inbreeding coefficient taking the value of 0 for random mating population and 1 for inbred population. When \(F\) is set as zero, it is assumed that the current population is the base population.

1.1.1 Statistical properties of \(\mathbf{G}\)

Let \(\mathbf{G}_o\) denote the off-diagonal elements of \(\mathbf{G}\), then we can define two population parameters, \(n_e\), the effective population size, and \(m_e\), the effective number of markers.

\[\begin{equation} n_e=\frac{-1}{mean(\mathbf{G}_o)} \tag{1.3} \end{equation}\] \(n_e\) reflects true relatedness between any pair of samples;

\[\begin{equation} m_e=\frac{1}{Var(\mathbf{G}_o)} \tag{1.4} \end{equation}\] The ratio between \(\frac{m_e}{m}\) reflects the average linkage disequilibrium between the any pair of markers, and alternatively \(m_e\) can be expressed as

\[\begin{equation} m_e=\frac{m^2}{\sum_{l_1=1}^m\sum_{l_2=1}^m\rho_{l_1l_2}^2}=\frac{1}{\bar{\rho}^2} \tag{1.5} \end{equation}\]

in which \(\rho_{l_1l_2}\) is Pearson’s correlation between a pair of SNPs, see Appendix 7.1. It is an important parameter to characterize the evolutionary process of a population.

1.2 EigenGWAS linear model

Given eigenanalysis (see wikipedia for its more details) of \(\mathbf{X}\), we have \(\mathbf{E}\) and \(\mathbf{\Lambda}\), in which \(\mathbf{\Lambda}\) is an \(n \times n\) diagonal matrix for eigenvalues and \(\mathbf{E}\) is an \(n \times n\) matrix for the eigenvectors. \(\mathbf{E}_k\) is the \(k^{th}\) eigenvector associated with the \(k^{th}\) largest eigenvalue. Regressing \(\mathbf{E}_k\) against the \(l^{th}\) marker, we have the model below

\[\begin{equation} \mathbf{E}_k=a+\beta_l\mathbf{x}_l+e \tag{1.6} \end{equation}\]

It consequently generates \(m\) estimates of \(\hat{\beta}\), \(\hat{\sigma}_{\beta}\), and their corresponding \(p\) values. The \(p\) value can be converted to one-degree-of-freedom \(\chi^2_1\). Under the context of EigenGWAS, the \(\chi^2_1\) can be intepreted as

\[\begin{equation} 4\frac{\color{red}{n_1}\color{blue}{n_2}}{n}\frac{(\color{red}{p_{1,l}}-\color{blue}{p_{2,l}})^2}{2p_lq_l}=4n\omega_1 \omega_2 F_{st}^N=nF_{st}^W \tag{1.7} \end{equation}\]

in which \(\color{red}{n_1}\) and \(\color{blue}{n_2}\) are the numbers of samples at the left and right side of “0” on the eigenvector (see the figure below), and \(\omega_g=\frac{n_g}{n}\) the proportion of a subgroup in the sample (\(g=2\) in EigenGWAS analysis), and \(\color{red}{p_{1,l}}\) and \(\color{blue}{p_{2,l}}\) are the frequencies of the reference allele in two subgroups, respectively. \(F_{st}^N=\frac{(\color{red}{p_{1,l}}-\color{blue}{p_{2,l}})^2}{2p_lq_l}\) and \(F_{st}^W=2\frac{\sum_{g=1}^2\omega_g(p_{g,l}-p_l)^2}{p_lq_l}\). It is obviously that when \(n_1=n_2=\frac{n}{2}\) the test statistic reaches its maximum.

## [1]  100 1000
## [1] "Ne= 100 Me= 1080.91076380488 given N= 100 and M= 1000"

1.2.1 \(\lambda_{GC}\) correction

Given the median, denoted as \(\chi^2_{1,median(p)}\), of the \(m\) \(\chi^2_1\) values observed, we can define \(\lambda_{GC}=\chi^2_{1,median(p)}/\chi^2_{1,0.5}\), in which \(\chi^2_{1,0.5}=0.455\). We further use subscript \(k\) to denote \(\lambda_{GC_k}\) the one that is estimated from the EigenGWAS analysis of \(\mathbf{E}_k\), as shown (1.6).

After technical correction, correspondingly

\[\begin{equation} \tilde\chi^2_1=\chi^2_1/\lambda_{GC_k} \tag{1.8} \end{equation}\]

a correction of the test statistic. Compared with its original form, the correction has several implications

  • Statistically, as (1.6) has its response variable from \(\mathbf{X}\), the correction removes its overfitting.

  • Genetically, it corrects for genetic drift such as soaked in \(\mathbf{E}_1\). Here the quantity of the genetic drift is measured by the median of the \(m\) \(\chi^2_1\) values observed.

  • \(\lambda_{GC}\) parameter has long been proposed in controlling population structure in GWAS, but finds its best fit in EigenGWAS.

In practice, the correction can be further refined by quantile-based \(\lambda_{GC}\) rather than the single median p-value only.

1.2.2 Threshold for EigenGWAS

As shown above, EigenGWAS is a linear model framework nearly identical to the conventional GWAS, and Bonferroni correction, such as \(\alpha/m\), can be used to set the threshold at the significance level \(\alpha\), such as \(\alpha=0.05\). The loci that exceeds the threshold are under selection.

1.3 Connection to singular value decomposition

Singular value decomposition (SVD, see wiki for more details) can decompose the matrix \(\mathbf{X}\) into \[\begin{equation} \mathbf{X}=\mathbf{U\Sigma V} \tag{1.9} \end{equation}\] in which

  • \(\mathbf{U}\) is an \(n\times n\) unitary matrix, corresponding to individual-level loading for each sample. \(\mathbf{U}\) is also called left-singular vectors.

  • \(\mathbf{\Sigma}\) an \(n\times m\) diagonal matrix for singular values which are square roots of the eigenvalue (\(\lambda_k\)) of GRM matrix.

  • \(\mathbf{V}\) an \(m\times m\) unitary matrix, corresponding to SNP-level loading. \(\mathbf{U}\) is also called right-singular vectors.

Due to the transformation between \(\mathbf{U}\) and \(\mathbf{V}\), it has \[\begin{equation} \mathbf{U}=\mathbf{\tilde{X}V\Sigma}^{-1} \tag{1.10} \end{equation}\] The right side of (1.10) can be unfolded as \(\frac{\sqrt{m}}{\Sigma_k}\mathbf{\tilde{x}}_l\mathbf{v}_k\), following \(N(0,1)\), and taking square of it leads to

\[\begin{equation} (\frac{\sqrt{m}}{\Sigma_k}\mathbf{\tilde{x}}_l\mathbf{v}_k)^2 \sim \tilde{\chi}^2_{1,svd} \tag{1.11} \end{equation}\]

It brings out an interesting comparison between (1.8) and (1.10), and exists subtle difference, for instance \(E_1\), in their correction using \(\lambda_{GC}\) and \(\lambda_1\). When the population divergency is due to genetic drift and a small proportion of loci are under selection, \(\lambda_1 \gt \lambda_{GC}\). It is because \(\lambda_1\) is the mean of the \(\chi^2_1\) from (1.7), but \(\lambda_{GC}\) is the median of it. Given a population under selection, such as selection sweep, it does have \(\lambda_1 \gt \lambda_{GC}\), as demonstrated in simulation.

So, using linear model system such as EigenGWAS brings more flexibility, as well as improved statistical power comparing with direct left-to-right singular vector transformation in SVD (1.10).

However, in this transformation, eigenvalue is involved, as would be show. It will reduce the statistical power for EigenGWAS. In theory and practice, with type-I error rate well-controlled, \(\tilde{\chi}^2_1\) (1.8) is greater than \(\tilde{\chi}^2_{1.svd}\) (1.10).

1.4 Intepretation

1.4.1 \(F_{st}\)

EigenGWAS provides an unsupervised realization for \(F_{st}\) the conventional implementation of that assumes the underlying subgroups are known. In practice, however, it is often unclear how the possible subgroups should be defined. In contrast, EigenGWAS implicitly partitions the samples into two subgroups at each eigenvector at zero point on the coordinate. No guarantee that this strategy optimal, it seems a sound attempt when the group information is missing.

1.4.2 Classic mechanic intepretation

The early framework of population genetics indebts a lot for theoretical physics (Ronald Fisher studied astronomy). As a linear model system, EigenGWAS also finds its analogue in Newtown’s first and the second laws for classical mechanics. The first law states

“In an inertial frame of reference, an object either remains at rest or continues to move at a constant velocity, unless acted upon by a force.”

In population genetics, it can be seemed as genetic drift that is constantly driving the a pair of population apart from each other, and its velocity can be quantified by a binomial distribution as \(\frac{pq}{\tilde{n}_e}\).

The second law stats

“In an inertial frame of reference, the vector sum of the forces \(F\) on an object is equal to the mass m of that object multiplied by the acceleration a of the object: \(F = ma\).”

The selection can drive a genomic region run against its reference population at a velocity greater than \(\frac{pq}{\tilde{n}_e}\).

Although, this interpretation works even for domesticated populations, its proportion of the markers divergent by genetic drift and selection would be very different.