D: 6 Dec 2024 Main functions (--make-grm-bin...) Quick index search |
GWAS 2: Post-HocSignificanceThe goal of GWAS is to run large genotype-phenotype analyses with the intent of discovering predictive or causal genetic variants using a somewhat hypothesis free approach. GWAS often employs marginal regression (MR) (as with Plink 2's GLM) to estimate effect sizes and their P-values. MR breaks up multivariate regressions into many univariate (single) regressions. This brings up two important issues: 1) adjusting the univariate P-value for multiple hypotheses being tested (i.e., the multiple comparisons problem, MCP), and 2) accounting for correlations among genetic variants. A third factor that we also explore is unrelated to MR; accounting for (residual) population structure or relationships (aka genomic inflation or genomic control).ObjectiveHere we continue with the exercises from GWAS 1. You will need those results to run the exercises here. You can also skip that tutorial and generate the files via Tutorial Shortcuts - GWAS 1. We will explore a few post-hoc methods offered by Plink 2 to assess and mitigate spurious effects, namely: adjust and clump.source("../R/loadplink2HW.R")
source("../R/gwas1.R")
Part 1: Adjusting P-valuesAdjust allows you to adjust the GWAS statistics based on multiple comparison problem (MCP) considerations and (residual) population/kin structure aka genomic inflation (GI). There are two main ways to run this, post-hoc (--adjust-file) or during the GLM (--adjust). Let's try the former --adjust-file using results from a previous tutorial. Plink 2:time plink2 \
--adjust-file ./results/gwas/\all_hg38_qcd_LE1_simgwas_quant1b.simgwas_quant1.glm.linear 'test=ADD' \
--out ./results/gwas/all_hg38_qcd_LE1_simgwas_quant1b.simgwas_quant1_glm_adjust
The command for calculating the adjusted P-values. Required arguments are the GLM file name and "test" (see GLM file). For more details see adjust. When you run the command, you may notice in the terminal output and log file that the genomic inflation (GI) lambda is reported (66.5 in this case). This is an inflation factor used to adjust the GLM statistic (e.g., t-score). The t-score is shrunk by this factor and, thus, results in an larger (less significant) adjusted P-value. This is reported in the adjust file. Open the *.adjusted file in R. R: a<-loadplink2Rmat("./results/gwas/all_hg38_qcd_LE1_simgwas_quant1b.simgwas_quant1_glm_adjust.adjusted")
View(a)
time plink2 \
--pfile ./data/processed/all_hg38_qcd_LE1 \
--glm hide-covar --covar ./results/qc/qc3/all_hg38_qcd_LE1_pca.eigenvec \
--pheno ./data/raw/simgwas_quant1.pheno \
--pfilter 1e-1 \
--out ./results/gwas/all_hg38_qcd_LE1_simgwas_quant1c
Plink 2:
time plink2 \
--adjust-file ./results/gwas/all_hg38_qcd_LE1_simgwas_quant1c.simgwas_quant1.glm.linear 'test=ADD' \
--out ./results/gwas/all_hg38_qcd_LE1_simgwas_quant1c.simgwas_quant1_glm_adjust
a<-loadplink2Rmat("./results/gwas/all_hg38_qcd_LE1_simgwas_quant1c.simgwas_quant1_glm_adjust.adjusted")
View(a)
time plink2 \
--pfile ./data/processed/all_hg38_qcd_LE1 \
--glm hide-covar --covar ./results/qc/qc3/all_hg38_qcd_LE1_pca.eigenvec \
--pheno ./data/raw/simgwas_quant1.pheno \
--pfilter 1e-6 \
--adjust \
--out ./results/gwas/all_hg38_qcd_LE1_simgwas_quant1d
--adjust This runs the adjust command with the GLM instance, instead of running --adjust-file as a separate instance. Way better. GI lambda went from 66.5 to 1. There is no spurious population effect as expected since we included population covariates in the regression. The high lambda was an artifact of the post-hoc adjustment. On your own, you can run the above without the population PCs and see if GI (GC) picks up the population structure. Replace --glm hide-covar --covar ./results/qc/qc3/all_hg38_qcd_LE1_pca.eigenvec with --glm allow-no-covars. Plink 2 suggests using LD-score regression as an alternative to evaluate GI. (See LD-score and Lee et al. (2018).) However, note that LD-score regression will also be impacted by what variants are sampled. To estimate genomic inflation using GLM or LD-score regression, you need to randomly sample the genome. Otherwise you risk overly adjusting the P-values and losing true positives. Further note that any kind of variant filtering like pfilter impacts other adjusted P-values. Plink 2 can only adjust for the variants it knows about. You can also manually set --lambda Part 2: ClumpIn GWAS 1, we were left with a number of false positives (FPs). This was even after controlling for population structure. One hypothesis, is that the FPs are variants correlated to our known true positives (TPs). Let's use Plink 2's post-hoc clump methods to explore this. Plink 2time plink2 \
--pfile ./data/processed/all_hg38_qcd_LE1 \
--clump ./results/gwas/all_hg38_qcd_LE1_simgwas_quant1b.simgwas_quant1.glm.linear \
--clump-unphased \
--clump-r2 0.2 \
--clump-allow-overlap \
--out ./results/gwas/all_hg38_qcd_LE1_simgwas_quant1b_simgwas_quant1_clump
This a reference genotype set for estimating the correlations. It need not be the same set as used for the association analysis but, of course, it should contain the SNPs of interest and have similar structure. --clump Loads the named PLINK-format association report(s) (text files with a header line, a column containing variant IDs, and another column containing P-values) and groups results into LD-based clumps, writing a new report to plink2.clumps[.zst]. --clump-unphased Run --clump with -unphased genotypes as this is more consistent with the glm model used for the above association analysis. Recall from Linkage that there are multiple methods for measuring co-incidence of genetic variants. By default, --clump uses phasing. --clump-r2 One of two main parameters for clumping. Defines the r2 threshold for including a genetic variant in a parent clump. Please see clump reference for more options. R: clump_path = "./results/gwas/all_hg38_qcd_LE1_simgwas_quant1b_simgwas_quant1_clump.clumps"
effect_path = "./data/raw/simgwas_quant1.effects"
TPclumps(clump_path,effect_path)
|