Introduction, downloads

D: 6 Dec 2024

Recent version history

What's new?

Coming next

[Jump to search box]

General usage

Getting started

Flag usage summaries

Column set descriptors

Citation instructions

Standard data input

PLINK 1 binary (.bed)

PROVISIONAL_REF?

PLINK 2 binary (.pgen)

Autoconversion behavior

VCF/BCF (.vcf[.gz], .bcf)

Oxford genotype (.bgen)

Oxford haplotype (.haps)

PLINK 1 text (.ped, .tped)

PLINK 1 dosage

Sample ID conversion

Dosage import settings

Generate random

Unusual chromosome IDs

Allele frequencies

Phenotypes

Covariates

'Cluster' import

Reference genome (.fa)

Input filtering

Sample ID file

Variant ID file

Interval-BED file

--extract-col-cond

QUAL, FILTER, INFO

Chromosomes

SNPs only

Simple variant window

Multiple variant ranges

Deduplicate variants

Sample/variant thinning

Pheno./covar. condition

Missingness

Category subset

--keep-col-match

Missing genotypes

Number of distinct alleles

Allele frequencies/counts

Hardy-Weinberg

Imputation quality

Sex

Founder status

Main functions

Data management

--make-[b]pgen/--make-bed

--export

--output-chr

--split-par/--merge-par

--set-all-var-ids

--recover-var-ids

--update-map...

--update-ids...

--ref-allele

--ref-from-fa

--normalize

--indiv-sort

--write-covar

--variance-standardize

--quantile-normalize

--split-cat-pheno

--pheno-svd

--pmerge[-list]

--write-samples

Basic statistics

--freq

--geno-counts

--sample-counts

--missing

--genotyping-rate

--hardy

--het

--check-sex/--impute-sex

--fst

--pgen-info

Pairwise diffs

--pgen-diff

--sample-diff

Linkage disequilibrium

--indep...

--r[2]-[un]phased

--ld

Sample-distance matrices

Relationship/covariance

  (--make-grm-bin...)

--make-king...

--king-cutoff

Population stratification

--pca

PCA projection

Association analysis

--glm

--glm ERRCODE values

--gwas-ssf

--adjust-file

Report postprocessing

--clump

Linear scoring

--score[-list]

--variant-score

Distributed computation

Command-line help

Miscellaneous

Flag/parameter reuse

System resource usage

--loop-cats

.zst decompression

Pseudorandom numbers

Warnings as errors

.pgen validation

Resources

1000 Genomes phase 3

HGDP-CEPH

FASTA files

Errors and warnings

Output file list

Order of operations

Developer information

GitHub root

Python library

R library

Compilation

Adding new functionality

Discussion forums

Credits

File formats

Tutorials

Setup

Rules of Thumb

Data Exploration 1 — HWE, Allele Frequency Spectrum

Data Exploration 2 — Genomic Structure

Linkage

Relationship Matrix

Genome-Wide Assocation Analyses (GWAS)

Regressions

Post-Hoc

Formatting Files

bcftools

Variant IDs

Reference Alleles

Format for R

Shortcuts

Quick index search

GWAS 2: Post-Hoc

Significance

The goal of GWAS is to run large genotype-phenotype analyses with the intent of discovering predictive or causal genetic variants using a somewhat hypothesis free approach. GWAS often employs marginal regression (MR) (as with Plink 2's GLM) to estimate effect sizes and their P-values. MR breaks up multivariate regressions into many univariate (single) regressions. This brings up two important issues: 1) adjusting the univariate P-value for multiple hypotheses being tested (i.e., the multiple comparisons problem, MCP), and 2) accounting for correlations among genetic variants. A third factor that we also explore is unrelated to MR; accounting for (residual) population structure or relationships (aka genomic inflation or genomic control).

Objective

Here we continue with the exercises from GWAS 1. You will need those results to run the exercises here. You can also skip that tutorial and generate the files via Tutorial Shortcuts - GWAS 1. We will explore a few post-hoc methods offered by Plink 2 to assess and mitigate spurious effects, namely: adjust and clump.

Before running any of the exercises below, load this file in R.

R:

source("../R/loadplink2HW.R") source("../R/gwas1.R")

Part 1: Adjusting P-values

Adjust allows you to adjust the GWAS statistics based on multiple comparison problem (MCP) considerations and (residual) population/kin structure aka genomic inflation (GI). There are two main ways to run this, post-hoc (--adjust-file) or during the GLM (--adjust). Let's try the former --adjust-file using results from a previous tutorial.

Plink 2:

time plink2 \ --adjust-file ./results/gwas/\all_hg38_qcd_LE1_simgwas_quant1b.simgwas_quant1.glm.linear 'test=ADD' \ --out ./results/gwas/all_hg38_qcd_LE1_simgwas_quant1b.simgwas_quant1_glm_adjust

--adjust-file
The command for calculating the adjusted P-values. Required arguments are the GLM file name and "test" (see GLM file). For more details see adjust.


When you run the command, you may notice in the terminal output and log file that the genomic inflation (GI) lambda is reported (66.5 in this case). This is an inflation factor used to adjust the GLM statistic (e.g., t-score). The t-score is shrunk by this factor and, thus, results in an larger (less significant) adjusted P-value. This is reported in the adjust file. Open the *.adjusted file in R.

R:

a<-loadplink2Rmat("./results/gwas/all_hg38_qcd_LE1_simgwas_quant1b.simgwas_quant1_glm_adjust.adjusted") View(a)

Let's focus on 4:87071601G,A a true positive. We see the SNP ID and some other information, and then P-value columns: UADJ, GC, BONF, ... GC is the GI lambda adjusted P-value. Notice that the unadjusted P-value is 4.5e-21 but the GC-adjusted is 0.25. So, we went from very significant to not at all. As noted above, the estimated lambda is very large but why is this so high? GI (GC) adjusts for population structure. However, we used the 10 population PCs as covariates. Thus, the regression coefficients and their P-values should already account for this structure. At least, to a large degree. Let's investigate.

Usage Note: In large part, this is an artifact of using adjust post-hoc vs as part of the original GLM command. We will explain more later. For now we'll continue with this approach to illustrate a point about genomic control and other post-hoc P-value adjustments.

We will now run a similar regression as in GWAS 1, but increase the pfilter to 1e-1, and re-run the post-hoc --adjust-file on this.

Plink 2:

time plink2 \ --pfile ./data/processed/all_hg38_qcd_LE1 \ --glm hide-covar --covar ./results/qc/qc3/all_hg38_qcd_LE1_pca.eigenvec \ --pheno ./data/raw/simgwas_quant1.pheno \ --pfilter 1e-1 \ --out ./results/gwas/all_hg38_qcd_LE1_simgwas_quant1c

Plink 2:

time plink2 \ --adjust-file ./results/gwas/all_hg38_qcd_LE1_simgwas_quant1c.simgwas_quant1.glm.linear 'test=ADD' \ --out ./results/gwas/all_hg38_qcd_LE1_simgwas_quant1c.simgwas_quant1_glm_adjust

As you can see from ther terminal/log, the genomic inflation lambda dropped dramatically from 66.5 to 8.5. Let's look at 4:87071601G,A .

R:

a<-loadplink2Rmat("./results/gwas/all_hg38_qcd_LE1_simgwas_quant1c.simgwas_quant1_glm_adjust.adjusted") View(a)

Unadjusted P-value is 4.5e-21 (as expected, it's the same as above). The GC adjusted P-value is now 0.001. What happened? Same dataset, same GLM model...

GI lambda is an estimate of genome-wide "spurious" associations with the phenotype. It is estimated by randomly sampling the genome. If you recall, our quant1b GLM run used a --pfilter of 1e-6. This is far from random but instead is selecting SNPs that are strongly associated with our trait. So, of course, randomly sampling on true positives will result in GI being high and misleading. In our new GLM run, we increased the P-value cutoff to 1e-1; a more liberal threshold. So, we are now selecting more SNPs from the null (not trait specific) distribution.

This exercise mainly highlights the fact that estimating/adjusting for cryptic population structure (e.g., GC, LD-score regression) requires randomly sampling the genome. This is important to remember for post-hoc adjustments. If you run --adjust during the regression, then Plink 2 will estimate GI (GC) using a broader (more random) set of variants assuming that is what you feed it. Let's test this by running the original regression that gave a lambda of 66.5 but add --adjust.

Plink 2:

time plink2 \ --pfile ./data/processed/all_hg38_qcd_LE1 \ --glm hide-covar --covar ./results/qc/qc3/all_hg38_qcd_LE1_pca.eigenvec \ --pheno ./data/raw/simgwas_quant1.pheno \ --pfilter 1e-6 \ --adjust \ --out ./results/gwas/all_hg38_qcd_LE1_simgwas_quant1d

--adjust
This runs the adjust command with the GLM instance, instead of running --adjust-file as a separate instance.


Way better. GI lambda went from 66.5 to 1. There is no spurious population effect as expected since we included population covariates in the regression. The high lambda was an artifact of the post-hoc adjustment. On your own, you can run the above without the population PCs and see if GI (GC) picks up the population structure. Replace --glm hide-covar --covar ./results/qc/qc3/all_hg38_qcd_LE1_pca.eigenvec with --glm allow-no-covars.

Cautionary Notes:

Plink 2 suggests using LD-score regression as an alternative to evaluate GI. (See LD-score and Lee et al. (2018).) However, note that LD-score regression will also be impacted by what variants are sampled. To estimate genomic inflation using GLM or LD-score regression, you need to randomly sample the genome. Otherwise you risk overly adjusting the P-values and losing true positives.

Further note that any kind of variant filtering like pfilter impacts other adjusted P-values. Plink 2 can only adjust for the variants it knows about.

You can also manually set --lambda to adjust the GLM statistics.

We leave this section noting that the genomic inflation (genomic control) and MCP adjusted P-values like BONF are not combined by default. Run --adjust or --adjust-file with the gc argument to incorporate GI (GC) into the MCP calculations.

Part 2: Clump

In GWAS 1, we were left with a number of false positives (FPs). This was even after controlling for population structure. One hypothesis, is that the FPs are variants correlated to our known true positives (TPs). Let's use Plink 2's post-hoc clump methods to explore this.

Plink 2

time plink2 \ --pfile ./data/processed/all_hg38_qcd_LE1 \ --clump ./results/gwas/all_hg38_qcd_LE1_simgwas_quant1b.simgwas_quant1.glm.linear \ --clump-unphased \ --clump-r2 0.2 \ --clump-allow-overlap \ --out ./results/gwas/all_hg38_qcd_LE1_simgwas_quant1b_simgwas_quant1_clump

--pfile
This a reference genotype set for estimating the correlations. It need not be the same set as used for the association analysis but, of course, it should contain the SNPs of interest and have similar structure.

--clump
Loads the named PLINK-format association report(s) (text files with a header line, a column containing variant IDs, and another column containing P-values) and groups results into LD-based clumps, writing a new report to plink2.clumps[.zst].

--clump-unphased
Run --clump with -unphased genotypes as this is more consistent with the glm model used for the above association analysis. Recall from Linkage that there are multiple methods for measuring co-incidence of genetic variants. By default, --clump uses phasing.

--clump-r2
One of two main parameters for clumping. Defines the r2 threshold for including a genetic variant in a parent clump. Please see clump reference for more options.


R:

clump_path = "./results/gwas/all_hg38_qcd_LE1_simgwas_quant1b_simgwas_quant1_clump.clumps" effect_path = "./data/raw/simgwas_quant1.effects" TPclumps(clump_path,effect_path)

Of 341 variants that were returned for simgwas_quant1b, we previously saw that this consisted of 48 (of 49) true positives. We now see that 81% of all the 341 variants also fall into correlated clumps with true positives. This will change to some degree with the clumping parameters. Thus, if we recall the original GWAS from the GWAS 1 tutorial, much of the false positives (that passed statistical tests) are due to population structure and correlations among variants.

bcftools — Adding Functional Annotations >>