D: 29 Jan 2025 Main functions (--make-grm-bin...) Quick index search |
GWAS 1: RegressionsSignificanceThe goal of GWAS is to run large genotype-phenotype analyses with the intent of discovering predictive or causal genetic variants using a somewhat hypothesis free approach. That said, the data we use, how it is processed, and selection thresholds can bias results.
ObjectiveHere we will show the basic operations for running a GWAS analysis for quantitative traits (continuous variables) and qualitative traits (case-control, binary variables).
Since there are no completely public human GWAS datasets, that we know of, we will use a simulated dataset. This was generated by projecting genetic variant effects through the 1000 Genomes Phase 3 (1kGP3) data, thus capturing some real genomic structure across populations. This was generated using 1kGP3 biallelic, autosomal SNPs; LD-pruned (pairphase 100 0.8); MAF>0.05; HWE P-value>0.05; under the filename all_hg38_qcd_LE1. This was generated in the Linkage tutorial. If you do not have these files under ./data/processed/, then go back to this tutorial or Tutorial Shortcuts - Linkage. Phenotypes are generated using an ad hoc additive model with genomic effects and Gaussian noise (unexplained variance). Effects are randomly selected with probability k/p, where k are the number of desired non-zero genetic effects and p the total pool of variants to select from (here we used 10,000 random SNPs from the dataset above). Non-zero effects before normalization are 1.
The model is given by:
where, Qualitative traits are further derived by thresholding the Guassian-like y above per the Normal cumulative distribution function to give a prevalence t. This is consistent with the liability-threshold model in genetics. Cases are coded as 2 and controls as 1, which is recognized by Plink 2 and automatically run as a logistic regression.
In the examples below,
R:
Part 1: Quantitative PhenotypesHere, we'll run some association analyses using a quantitative phenotype. We will use the simulated-gwas simgwas_quant1 in the /GWAS/project1/data/raw/. This is part of the tutorial package.
Plink 2:
--pheno
--pfilter
--out ./results/gwas/all_hg38_qcd_LE1_simgwas_quant1a Let's plot the results. R:
This will produce two plots. For this instance, it will take a few minutes since there are many data points. The first plot is a classic Manhattan plot displaying the -log10(P) values for the SNPs with P-value<1e-6 (the somewhat arbitrary threshold that we set). Following this is a -log10(P) QQ plot, comparing the GLM returned P-value quantiles vs the uniform null.
![]() ![]() Wow, so nearly 250K SNPs in the dataset have very small P-values (strong associations). The QQ plot reflects the strong deviation from the null suggesting that these are true effects. As we know the groundtruth, we know that these are in fact many false positives and the true positive rate is far from 1. However, the classic statistical tests are deceptive, suggesting that we have a large yield of true positives. Let's investigate more. If we think more about the generating model (described above using the 1000 Genomes data), there is a lot of population structure. As you might recall from prior tutorials, part of this structure are varying MAFs across populations even to the point where some minor alleles in one population become major alleles in another. The regression model is identifying not only the trait-causal SNPs but also SNPs that distinguish between populations. Note that we did not explicitly add a population covariate when we simulated the phenotype. Instead, the natural population structure is enough to "distort" the causal signal. This happens because there is a bias in how often true positives occur across different populations. Population identity becomes a significant phenotype signal. We pickup some TPs that are part of that population signal, but also many other SNPs that are only signaling population differences. Let's test the hypothesis that population structure is the cause of many false positives. Plink 2:
--glm hide-covar --covar ./results/qc/qc3/all_hg38_qcd_LE1_pca.eigenvec
R:
![]() ![]() Good. Now there are about 1000-fold less SNPs that pass the 1e-6 cutoff. We even improve our yeild of known TPs, getting all but one. The QQ plot shows substantial deviation from the null, supporting that these are true effects. We will explore the source of the remaining false positives in another tutorial. We close out this tutorial by showing how to run the qualitative-trait GWAS.
Part 2: Qualitative PhenotypeWe have created a GWAS set using the same method above but we thresholded the continuous phenotype to give a case prevalence of 10% (typical case-control are closer to 1% or lower but we have a fairly small sample size, so we are using a higher prevalence). Here we give an example of the commands for running a case-control, logistic model. Plink 2:
--glm hide-covar no-firth
R:
![]() ![]() The yield is not as high as the above quantitative traits; however, we have very low false positives and a sizable return of true positives. As a further exercise, you can try balancing the cases and controls and seeing if that improves discovery. |