Rules of Thumb - PLINK 2.0

Introduction, downloads

D: 2 Jul 2025

Recent version history

What's new?

Coming next

[Jump to search box]

General usage

Getting started

Flag usage summaries

Column set descriptors

Citation instructions

Standard data input

PLINK 1 binary (.bed)

PROVISIONAL_REF?

PLINK 2 binary (.pgen)

Autoconversion behavior

VCF/BCF (.vcf[.gz], .bcf)

Oxford genotype (.bgen)

Oxford haplotype (.haps)

EIGENSOFT binary

PLINK 1 text (.ped, .tped)

PLINK 1 dosage

Sample ID conversion

Dosage import settings

Generate random

Unusual chromosome IDs

Allele frequencies

Phenotypes

Covariates

'Cluster' import

Reference genome (.fa)

Input filtering

Sample ID file

Variant ID file

Interval-BED file

--extract-col-cond

QUAL, FILTER, INFO

Chromosomes

SNPs only

Simple variant window

Multiple variant ranges

Deduplicate variants

Sample/variant thinning

Pheno./covar. condition

Missingness

Category subset

--keep-col-match

Missing genotypes

Number of distinct alleles

Allele frequencies/counts

Hardy-Weinberg

Imputation quality

Sex

Founder status

Main functions

Data management

--make-[b]pgen/--make-bed

--export

--output-chr

--split-par/--merge-par

--set-all-var-ids

--recover-var-ids

--update-map...

--update-ids...

--ref-allele

--ref-from-fa

--normalize

--indiv-sort

--write-covar

--variance-standardize

--quantile-normalize

--split-cat-pheno

--pheno-svd

--pmerge[-list]

--write-samples

Basic statistics

--freq

--geno-counts

--sample-counts

--missing

--genotyping-rate

--hardy

--het

--check-sex/--impute-sex

--fst

--pgen-info

Pairwise diffs

--pgen-diff

--sample-diff

Linkage disequilibrium

--indep...

--r[2]-[un]phased

--ld

Sample-distance matrices

Relationship/covariance

(--make-grm-bin...)

--make-king...

--king-cutoff

Population stratification

--pca

PCA projection

Association analysis

--glm

--glm ERRCODE values

--gwas-ssf

--adjust-file

Report postprocessing

--clump

Linear scoring

--score[-list]

--variant-score

Distributed computation

Command-line help

Miscellaneous

Flag/parameter reuse

System resource usage

--loop-cats

.zst decompression

Pseudorandom numbers

Warnings as errors

.pgen validation

Resources

1000 Genomes phase 3

HGDP-CEPH

FASTA files

Errors and warnings

Output file list

Order of operations

Developer information

GitHub root

Python library

R library

Compilation

Adding new functionality

Discussion forums

Credits

File formats

Tutorials

Setup

Rules of Thumb

Data Exploration 1 — HWE, Allele Frequency Spectrum

Data Exploration 2 — Genomic Structure

Linkage

Relationship Matrix

Genome-Wide Assocation Analyses (GWAS)

Regressions

bcftools

Quick index search

Plink 2 Rules of Thumb and Cautionary Notes

--out and log files

When creating a name for the --out file, if you use different Plink commands like --hardy and --freq then you will not overwrite the outputted statistics files. However, the log files will be overwritten if they are not unique --out prefixes. It is generally good practice to keep log files. You can either add something informative to the name to avoid overwriting or use some unique identifier like a random number or the epoch time. There are shell commands that will automatically do this.

MAF Filtering

Empirical MAFs are fine down to <sample size>^{-0.5}, i.e. if a MAF estimate of (1/n) is supported by at least n different samples carrying the minor allele, it usually won't be off by enough to matter. For example, in the 1000 Genomes Phase 3 case this is around 0.02 given ~2,500 founders.

REF and Plink 1.x

Plink 1.x does not preserve REF alleles. This can be also be an issue when going between Plink 2 and Plink 1.x. See REF tutorial for how to address this. Plink 2 does preserve REF alleles.

Ancestry PCA

Watch out for population stratification PCA issues when using different datasets. These can underestimate population structure and discount some genetic variant effects.

Controlling for population stratification and inferring out-of-sample (new individual) ancestry using PCA is not straightforward. One potential issue is underestimating population structure (shrinkage) that can happen with high dimensional problems. A second issue is that these PCs may represent LD structure vs individual differences. In this case, intended population covariates may lower GWAS signal for these genetic variants.

For more details we refer you to these references:

Wang, C., Zhan, X., Liang, L., Abecasis, G. R., & Lin, X. (2015). Improved ancestry estimation for both genotyping and sequencing data using projection procrustes analysis and genotype imputation. The American Journal of Human Genetics, 96(6), 926-937.
Privé, F., Luu, K., Blum, M. G., McGrath, J. J., & Vilhjálmsson, B. J. (2020). Efficient toolkit implementing best practices for principal component analysis of population genetic data. Bioinformatics, 36(16), 4449-4457.
Zhang, D., Dey, R., & Lee, S. (2020). Fast and robust ancestry prediction using principal component analysis. Bioinformatics, 36(11), 3439-3446.
Dey, R., & Lee, S. (2019). Asymptotic properties of principal component analysis and shrinkage-bias adjustment under the generalized spiked population model. Journal of multivariate analysis, 173, 145-164.
Lee, S., Zou, F., & Wright, F. A. (2010). Convergence and prediction of principal component scores in high-dimensional settings. Annals of statistics, 38(6), 3605.

Data Exploration 1 — HWE, Allele Frequency Spectrum >>