Introduction, downloads

D: 20 Jan 2025

Recent version history

What's new?

Coming next

[Jump to search box]

General usage

Getting started

Flag usage summaries

Column set descriptors

Citation instructions

Standard data input

PLINK 1 binary (.bed)

PROVISIONAL_REF?

PLINK 2 binary (.pgen)

Autoconversion behavior

VCF/BCF (.vcf[.gz], .bcf)

Oxford genotype (.bgen)

Oxford haplotype (.haps)

PLINK 1 text (.ped, .tped)

PLINK 1 dosage

Sample ID conversion

Dosage import settings

Generate random

Unusual chromosome IDs

Allele frequencies

Phenotypes

Covariates

'Cluster' import

Reference genome (.fa)

Input filtering

Sample ID file

Variant ID file

Interval-BED file

--extract-col-cond

QUAL, FILTER, INFO

Chromosomes

SNPs only

Simple variant window

Multiple variant ranges

Deduplicate variants

Sample/variant thinning

Pheno./covar. condition

Missingness

Category subset

--keep-col-match

Missing genotypes

Number of distinct alleles

Allele frequencies/counts

Hardy-Weinberg

Imputation quality

Sex

Founder status

Main functions

Data management

--make-[b]pgen/--make-bed

--export

--output-chr

--split-par/--merge-par

--set-all-var-ids

--recover-var-ids

--update-map...

--update-ids...

--ref-allele

--ref-from-fa

--normalize

--indiv-sort

--write-covar

--variance-standardize

--quantile-normalize

--split-cat-pheno

--pheno-svd

--pmerge[-list]

--write-samples

Basic statistics

--freq

--geno-counts

--sample-counts

--missing

--genotyping-rate

--hardy

--het

--check-sex/--impute-sex

--fst

--pgen-info

Pairwise diffs

--pgen-diff

--sample-diff

Linkage disequilibrium

--indep...

--r[2]-[un]phased

--ld

Sample-distance matrices

Relationship/covariance

  (--make-grm-bin...)

--make-king...

--king-cutoff

Population stratification

--pca

PCA projection

Association analysis

--glm

--glm ERRCODE values

--gwas-ssf

--adjust-file

Report postprocessing

--clump

Linear scoring

--score[-list]

--variant-score

Distributed computation

Command-line help

Miscellaneous

Flag/parameter reuse

System resource usage

--loop-cats

.zst decompression

Pseudorandom numbers

Warnings as errors

.pgen validation

Resources

1000 Genomes phase 3

HGDP-CEPH

FASTA files

Errors and warnings

Output file list

Order of operations

Developer information

GitHub root

Python library

R library

Compilation

Adding new functionality

Discussion forums

Credits

File formats

Tutorials

Setup

Rules of Thumb

Data Exploration 1 — HWE, Allele Frequency Spectrum

Data Exploration 2 — Genomic Structure

Linkage

Relationship Matrix

Genome-Wide Assocation Analyses (GWAS)

Regressions

Post-Hoc

Formatting Files

bcftools

Variant IDs

Reference Alleles

Format for R

Shortcuts

Quick index search

Input filtering

The following flags allow you to exclude samples and/or variants from an analysis batch based on a variety of criteria.

Two general notes:

  • When a filter type can apply to either samples or variants, the sample-filter flag names start with 'keep'/'remove', and the variant-filter flag names start with 'extract'/'exclude'.
  • Some of these criteria are based on statistics such as estimated MAF that may vary through multiple filtering passes. If variation is problematic, use "--freq counts" to export initial statistics, and then include --read-freq in all filtering passes where you want to refer back to the initial stats.
ID lists

--keep <filename(s)...>
--remove <filename(s)...>

--keep-fam <filename(s)...>
--remove-fam <filename(s)...>

--keep accepts one or more space/tab-delimited text files with sample IDs, and removes all unlisted samples from the current analysis; --remove does the same for all listed samples. Similarly, --keep-fam and --remove-fam accept text files with family IDs in the first column, and keep or remove entire families.

--keep/--remove now support a wider variety of sample ID file formats:

  • If the first line starts with '#FID' or '#IID', it will be treated as a header line. As long as the first columns are "#FID IID", "#FID IID SID", "#IID", or "#IID SID", PLINK 2 will do the right thing. (Note that when FID is undefined, it is treated as '0'.)
  • If there is no header line, one-column lines are treated as IIDs, and multicolumn lines are treated the same way as in PLINK 1.x (first two columns assumed to be FID/IID).
Single sample ID

--indv <sample ID>

--indv accepts a single 1-3 part sample ID, and removes all samples with different IDs. Separate sample ID parts with spaces.

--extract [{bed0 | bed1}] <filename(s)...>
--exclude [{bed0 | bed1}] <filename(s)...>

--extract-intersect [{bed0 | bed1}] <filename(s)...>

--extract normally accepts one or more text file(s) with variant IDs (usually one per line, but it's okay for them to just be separated by spaces), and removes all unlisted variants from the current analysis. With the 'bed0' or 'bed1' modifier, the input file should be in 0-based or 1-based interval-BED format instead. For backward compatibility, 'range' is an alias for 'bed1'.

--exclude does the same for all listed variants.

--extract-intersect is just like --extract, except that a variant must be in the intersection, rather than just the union, of the --extract-intersect files to be kept.

--bed-border-bp <#>
--bed-border-kb <#>

--bed-border-bp extends all the intervals in an input BED file (for e.g. "--extract bed0") by the given number of base-pairs on both sides. --bed-border-kb interprets its argument as a kilobase count, and is otherwise identical.

--extract-col-cond <filename> [value col. number] [ID col.] [skip]

--extract-col-cond-match <(sub)string(s)...>
--extract-col-cond-mismatch <(sub)string(s)...>
--extract-col-cond-substr

--extract-col-cond-min <min>
--extract-col-cond-max <max>

--extract-col-cond excludes all variants which either don't appear in the given input file, or are associated with a value which doesn't satisfy the given condition. (This is a generalization of PLINK 1.x's --qual-scores flag.) It is designed to support filtering on INFO-like values stored in a separate tab-delimited file.

  • By default, values are read from column 2 of the file, and variant IDs are read from column 1. These column numbers can be changed by providing additional parameters to --extract-col-cond.
  • If a fourth 'skip' parameter is provided to --extract-col-cond, it is interpreted as a number of initial lines to skip if it's a number; otherwise it must be a single (possibly quoted) character, and all lines starting with that character are skipped.
  • Three types of conditions are supported:
    • When --extract-col-cond-match is specified without --extract-col-cond-substr, the condition is "value (including capitalization) exactly matches one of the given strings". Similarly, --extract-col-cond-mismatch without --extract-col-cond-substr invokes the condition "value matches none of the given strings".
    • When --extract-col-cond-match and/or -mismatch are specified with --extract-col-cond-substr, the variant is kept iff none of the --extract-col-cond-mismatch substrings are contained in the value, and either --extract-col-cond-match was unspecified or at least one of its substrings is contained.
    • Otherwise, the value is interpreted as a number, and the variant is kept if the number is in [<min>, <max>]. These bounds default to min=0, max=1.79769e+308 (DBL_MAX), and can be changed with --extract-col-cond-min/--extract-col-cond-max.
QUAL, FILTER, INFO

--var-min-qual <value>

--var-min-qual causes all variants with QUAL value smaller than the given number, or with no QUAL value at all, to be skipped.

--var-filter [exception(s)...]

To skip variants which failed one or more filters tracked by the FILTER field, use --var-filter. This can be combined with one or more (space-delimited) filter names to ignore.

##FILTER=<ID=q10,Description="Quality below 10">
##FILTER=<ID=s50,Description="Less than 50% of samples have data">
##contig=<ID=1,length=249250621>
#CHROM  POS    ID           REF  ALT  QUAL  FILTER
1       10583  rs58108140   G    A    25    PASS
1       10611  rs189107123  C    G    11    q10
1       13302  rs180734498  C    T    32    s50
1       13327  rs144762171  G    C    30    .
1       13957  rs201747181  TC   T    3     q10;s50

For example, given the .pvar file above:

  • --var-filter with no arguments would keep only rs58108140 and rs144762171;
  • "--var-filter q10" would keep rs58108140, rs180734498, and rs144762171 (PLINK matches against the 'q10' string, instead of checking the QUAL value here; use --var-min-qual to do the latter);
  • "--var-filter LowQual s50" would keep rs58108140, rs189107123, and rs144762171; and
  • "--var-filter q10 s50" would keep all five variants.

--extract-if-info <key> <operator> <value>
  (alias: --extract-if)
--exclude-if-info <key> <operator> <value>
  (alias: --exclude-if)
--require-info <key(s)...>
--require-no-info <key(s)...>

--extract-if-info removes all variants which don't satisfy a comparison predicate on an INFO key. For numbers, the supported operators are '!=', '<', '<=', '==' (single '=' also ok), '>=', and '>'; for strings, only '!=' and '='/'==' can be used. If the key or value is missing, the predicate evaluates to true iff the operator is '!='.

As a special case, you can specify the empty-string value as ';'. (This was not supported before 8 Jan 2023.)

Note that the '<', '>', and ';' characters have special meanings in practically all shells; it is necessary to wrap them in quoted expressions. In bash, you can either quote the special characters individually, e.g.

--extract-if-info AFR_AF '>'= 0.05

or put the entire predicate in double-quotes (recommended):

--extract-if-info "AFR_AF >= 0.05"

Similarly, --exclude-if-info removes all variants which do satisfy such a comparison predicate.

--require-info removes all variants which don't have all of the listed INFO keys. (A key is treated as if it isn't present when the associated value is '.'.) Similarly, --require-no-info removes all variants which have any of the listed keys.

Chromosomes

--chr <number(s)/range(s)...>
--not-chr <number(s)/range(s)...>

--chr excludes all variants not on the listed chromosome(s). Normally, valid choices for humans are 0 (i.e. unknown), 1-22, X, Y, MT, PAR1/PAR2 (pseudo-autosomal region of X; see --split-par/--merge-par), and XY (deprecated PLINK 1.x code intended to refer to the pseudo-autosomal region). Separate multiple chromosomes with spaces or commas, and use dashes to specify ranges. Spaces are not permitted immediately before or after a range-denoting dash.

For example, the following are all valid and equivalent:

--chr 1-4, 22, xy
--chr 1-4 22 XY
--chr 1,2,3,4,22,25

You might wonder about the '25'. Several non-autosomal chromosomes can also be identified by numeric code: if there are n autosomes, n+1 is the X chromosome, n+2 is Y, n+3 is XY, and n+4 is MT. (However, no numeric codes are associated with PAR1/PAR2.)

--not-chr is the reverse of --chr: variants on listed chromosome(s) are excluded. So

--not-chr 0 5-21 x y mt par1 par2

is equivalent to the three --chr examples above (assuming human data). (Yes, if your data uses PAR1/PAR2 codes, "--chr xy" will not cause them to be included. If this is problematic, see --autosome-par below.)

If you specified --allow-extra-chr, you can refer to the extra chromosome codes by name, e.g.

--allow-extra-chr --not-chr chr1_gl000191_random

--autosome

--autosome-par

--autosome excludes all unplaced and non-autosomal variants, while --autosome-par does not exclude XY/PAR1/PAR2. They can be combined with --not-chr, e.g.

--autosome-par --not-chr 5-21 xy

is also equivalent to the three --chr examples.

Keep only SNPs

--snps-only ['just-acgt']

--snps-only excludes all variants with one or more multi-character allele codes. With 'just-acgt', variants with single-character allele codes outside of {'A', 'C', 'G', 'T', 'a', 'c', 'g', 't', <missing code>} are also excluded.

Simple variant window

--from <variant ID>
--to <variant ID>

--from excludes all variants on different chromosomes than the named variant, as well as those with smaller base-pair position values. --to is similar, excluding variants with larger position values instead. If they are used together but the --from variant is after the --to variant, they are automatically swapped.

--snp <variant ID>
--window <total window size, in kb>
--exclude-snp <variant ID>

--snp specifies a single variant to load by name. If it's combined with --window, all variants with physical position no more than half the specified kb distance (decimal permitted) from the named variant are loaded as well.

Similarly, --exclude-snp specifies a single variant to exclude; this can also be combined with --window.

--from-bp <pos>
--to-bp <pos>
--from-kb <kb pos>
--to-kb <kb pos>
--from-mb <mb pos>
--to-mb <mb pos>

These flags let you use physical positions to specify a variant range to load. Kilobase and megabase values can include decimals. You are required to specify a single chromosome when using these.

Multiple ranges

--snps <variant ID(s)/range(s)...>
--exclude-snps <variant ID(s)/range(s)...>

--snps accepts a collection of individual variant IDs and variant ranges. For example,

--snps rs1111-rs2222, rs3333, rs4444

tells PLINK to load all variants between rs1111 and rs2222 inclusive, as well as rs3333 and rs4444. (Syntax works the same way as --chr. If your variant IDs contain dashes, you'll want to use the --d flag as well.) If rs1111 and rs2222 are on different chromosomes i < j, then all variants on chromosomes numbered between i and j are loaded, as well as the last variants on chromosome i and the first variants on chromosome j. (You can exclude some intermediate chromosomes by combining --snps with --not-chr.)

--exclude-snps excludes all the specified variants/ranges instead.

--force-intersect

To reduce the potential for confusion, PLINK 2 normally errors out when multiple variant-inclusion filters (--extract[-intersect], --extract-col-cond, --from/--to, --from-bp/--to-bp, --snp, --snps) are specified, since it may not be obvious whether the intersection or union will be taken. --force-intersect allows the run to proceed; the set intersection will be taken.

Deduplicate variants

--rm-dup [mode] ['list']

--rm-dup usually removes all but one instance of each duplicate-ID variant (ignoring the missing ID). With the 'list' modifier, the original duplicated IDs are written to plink2.rmdup.list.

The following modes of operation are supported:

  • 'error' (default): Check each group of duplicate-ID variants for equality. (Alleles are considered unequal even if the codes are the same, just in a different order; FILTER/INFO are considered unequal if the strings don't match exactly, even if they're semantically identical.) If any mismatches are found, this errors out, and writes a list of mismatching variant IDs to plink2.rmdup.mismatch.
  • 'retain-mismatch': When unequal duplicate-ID variants are found, keep every member of the group. The .rmdup.mismatch file is still written.
  • 'exclude-mismatch': When unequal duplicate-ID variants are found, exclude every member of the group.
  • 'exclude-all': Exclude all instances of all duplicate-ID variants.
  • 'force-first': Always keep just the first instance of each duplicate-ID variant.
Arbitrary thinning

--thin <p>
--thin-count <n>
--bp-space <bp count>
--thin-indiv <p>
--thin-indiv-count <n>
  (alias: --max-indv)

--thin removes variants at random by retaining each variant with probability p, --thin-count removes variants at random until only n remain, and --bp-space excludes one variant from each pair closer than the given bp count. (Yes, --bp-space is equivalent to VCFtools --thin; we can't do much about this mixup without breaking backward compatibility.) Note that LD-based pruning also has a variant thinning effect, and is normally more useful than these three commands.

Similarly, --thin-indiv removes samples at random by retaining each sample with probability p, while --thin-indiv-count removes samples at random until only n remain.

Phenotype/covariate-based

--keep-if <phenotype/covariate name> <operator> <value>
--remove-if <phenotype/covariate name> <operator> <value>

--keep-if removes all samples which don't satisfy a comparison predicate on a phenotype or covariate, while --remove-if does the reverse.

  • Syntax and treatment of missing values is the same as for --extract-if-info.
  • For binary phenotypes, either '2' or 'case' (any capitalization) can be used to refer to cases, and either '1', 'ctrl', or 'control' can be used to refer to controls.

--require-pheno [phenotype name(s)...]
--require-covar [covariate name(s)...]

When parameters are provided, --require-pheno removes samples missing any of the named phenotypes; otherwise, it removes samples missing any loaded phenotype. --require-covar does the same things for covariates.

--keep-cats <filename>
--keep-cat-names <name(s)...>
--remove-cats <filename>
--remove-cat-names <name(s)...>

--keep-cat-pheno <phenotype/covariate name>
--remove-cat-pheno <phenotype/covariate name>

If exactly one categorical phenotype/covariate is loaded, --keep-cats and --keep-cat-names can be used individually or in combination to define a list of categories to keep; all samples not in one of those categories are then removed from the current analysis. --keep-cats accepts a text file with one category name per line, and --keep-cat-names takes a space-delimited sequence of category names on the command line.

If multiple categorical phenotypes/covariates are loaded, use --keep-cat-pheno to specify which variable --keep-cats/--keep-cat-names should apply to. (This is still safe when only one categorical variable is present.)

Similarly, --remove-cats removes all samples in categories named in a file, --remove-cat-names removes all samples in categories named on the command line, and --remove-cat-pheno specifies which variable --remove-cats/--remove-cat-names should apply to.

String match

--keep-col-match <filename> <string(s) to match...>

--keep-col-match-name <column name>

--keep-col-match-num <n>

--keep-col-match accepts a space/tab-delimited text file with sample IDs in the first columns and a string to filter on in a later column. You can specify this column with either --keep-col-match-num or --keep-col-match-name (the latter requires a header line starting with #FID or #IID); with neither, "--keep-col-match-num 3" is assumed. All samples either missing from the file, or with a string value which doesn't match any of the strings you provided are removed from the analysis. The string comparison is case-sensitive, and numbers are not parsed, so '9', '9.0', '9e0', and '9E0' all compare unequal.

(This is a minor extension of PLINK 1.x's --filter flag.)

Missing genotype rates

--geno [maximum per-variant] [{dosage | hh-missing}]
--mind [maximum per-sample] [{dosage | hh-missing}]

--geno filters out all variants with missing call rates exceeding the provided value (default 0.1) to be removed, while --mind does the same for samples.

If any samples were removed by --mind, their IDs are written to plink2.mindrem.id.

By default, when a dosage is present but a hardcall is not, the genotype is treated as missing; add the 'dosage' modifier to treat this case as nonmissing. Alternatively, you can use 'hh-missing' to also treat heterozygous haploid calls as missing.

Number of distinct alleles

--min-alleles <count>
--max-alleles <count>

--min-alleles excludes variants with fewer than the given number of alleles in the .pvar/.bim file, while --max-alleles excludes variants with more. For example, "--max-alleles 2" filters out the multiallelic variants which would otherwise make --make-bed error out.

When a variant has exactly one ALT allele and it's a missing-code, these filters treat it as having only one allele.

--import-max-alleles <count>

--import-max-alleles is similar to --max-alleles, but applied during VCF/BCF/BGEN dataset import. This allows e.g. VCF/BCF files containing a few records with 255+ ALT alleles to be (partially) imported by PLINK 2 without a slow bcftools preprocessing step. Count must be at least 2.

Allele frequencies/counts

--maf [minimum freq] [mode]
  (alias: --min-af)
--max-maf <maximum freq> [mode]
  (alias: --max-af)
--mac <minimum count> [mode]
  (alias: --min-ac)
--max-mac <maximum count> [mode]
  (alias: --max-ac)

--maf filters out all variants with allele frequency below the provided threshold (default 0.01), while --max-maf imposes an upper bound. Similarly, --mac and --max-mac impose lower and upper allele count bounds, respectively.

By default, these flags operate on 'nonmajor' (i.e. sum of all but the largest value) allele frequencies/counts. Three other modes are supported: 'nref' (nonreference), 'alt1', and 'minor' (smallest). You can use bcftools-style freq:mode notation for this.

When pedigree information is present, --maf and --max-maf default to ignoring nonfounders when applying these filters; this can be changed with --nonfounders. There is no longer an analogous default for --mac/--max-mac; you now must explicitly specify how you want nonfounders to be handled (with --nonfounders or --ac-founders) when using those flags.

--af-pseudocount <x>

--af-pseudocount causes allele frequencies to be estimated as

   qhat := (x + <# of observations of current allele>) / (x · <# of distinct alleles> + <# of obs. of any allele>)

instead of the usual

   qhat := <# of observations of current allele> / <# of observations of any allele>.

When the --read-freq file contains observation counts, --af-pseudocount acts on those counts.

Hardy-Weinberg equilibrium tests

--hwe <p> [k] ['midp'] ['keep-fewhet']

--hwe filters out all variants which have Hardy-Weinberg equilibrium exact test p-value below p·10-nk, where n is the sample size, and k is 0 if unspecified.

  • The new k parameter is motivated by Greer PJ, et al. (2024) A reassessment of Hardy-Weinberg equilibrium filtering in large sample Genomic studies, which echoes findings by us and others that --hwe has frequently been used with inappropriately strict settings (throwing out genuine SNP-trait associations) on large datasets, and reports that k=0.001 produces consistent and appropriate behavior across a wide range of large sample sizes. (The preprint does not investigate the small-sample limit; we expect something like p=1e-5 to work well.)
    To address the pattern of misuse, --hwe now prints a warning (to be upgraded to an error in a future build) when filtering settings are suspiciously strict for the sample size and neither k nor 'keep-fewhet' (see below) was specified. You can explicitly set k=0 to silence the warning.
  • When significant population stratification is present, the Wahlund effect drives some variants to lower heterozygosity than the Hardy-Weinberg equilibrium level; when using --hwe for quality control, you probably want to keep these variants. Conveniently, systematic genotyping errors are much more likely to show up as excess heterozygosity. So we provide the 'keep-fewhet' mode, which only filters out variants with excess heterozygosity.
  • On chrX, p-values are now computed using the method described in Graffelman J, Weir BS (2016) Testing for Hardy-Weinberg equilibrium at biallelic genetic markers on the X chromosome. In keep-fewhet mode, the ratio between the Graffelman/Weir p-value and the female-only p-value is considered.
  • The 'midp' modifier applies the mid-p adjustment described in Graffelman J, Moreno V (2013) The mid p-value in exact tests for Hardy-Weinberg equilibrium. The mid-p adjustment tends to bring the null rejection rate in line with the nominal p-value, and also reduces the filter's tendency to favor retention of variants with missing data. It's a small refinement, but we recommend its use.
  • For multiallelic variants, a separate biallelic test is performed for every allele, and the variant is filtered out iff any of the tests yields a [mid-]p-value below the threshold.
  • Only founders are considered by this test; use --nonfounders to change this.
  • There is currently no special handling of case/control phenotypes;
    --keep-if <phenotype name> == control
    is frequently a good idea when using --hwe in a genome-wide association analysis (and matches PLINK 1.x's behavior).
Imputation quality

--mach-r2-filter [min] [max]
--minimac3-r2-filter <min> [max]

--mach-r2-filter excludes variants where the MaCH Rsq imputation quality metric (frequently labeled as 'INFO') is outside [0.1, 2.0]; change the bounds by providing parameters. Monomorphic variants, where Rsq == nan, are not excluded by this filter: the problem with them isn't imputation quality.

Similarly, --minimac3-r2-filter excludes variants where Minimac3's imputation quality metric is outside the given range. Note that this metric assumes that phased dosages have been imported with e.g. --vcf's dosage=HDS option; the computation still proceeds when unphased dosages are present, but the results will be underestimates. If you don't need phased dosages for any other reason, --{extract,exclude}-if-info is usually a more efficient way to do this properly.

"--minimac3-r2-filter 1" can be used to keep only perfectly-imputed-and-phased variants.

Sex

--keep-females

--keep-males

--keep-nosex
--remove-females
--remove-males
--remove-nosex

--keep-females excludes all male and unknown-sex samples, --keep-males excludes females and unknown-sex samples, and --keep-nosex excludes all known-sex samples. Conversely, --remove-females only excludes known females, --remove-males only excludes known males, and --remove-nosex only excludes unknown-sex samples.

Founder status

--keep-founders
--keep-nonfounders

--keep-founders excludes all samples with at least one known parental ID from the current analysis (note that it is not necessary for that parent to be in the current dataset), while --keep-nonfounders does the reverse.

--nonfounders

--ac-founders

By default, nonfounders are not counted by --freq or --maf/--max-maf/--mac/--max-mac/--hwe. Use the --nonfounders flag to include them.

Conversely, --ac-founders confirms that nonfounders should be excluded by --mac/--max-mac/"--freq counts". Why does this flag exist? Because we overlooked this detail when processing the preliminary 1000 Genomes hg38 callset. When nonfounders remain during execution of --mac/--max-mac/"--freq counts" and neither --ac-founders nor --nonfounders are specified, PLINK 2 now errors out.

--make-founders <require-2-missing> <first>

By default, if parental IDs are provided for a sample, they are not treated as a founder even if neither parent is in the dataset. With no modifiers, --make-founders clears both parental IDs whenever at least one parent is not in the dataset, and the affected samples are now considered founders. The 'require-2-missing' modifier causes this to only happen when both parents are missing.

This normally happens after all sample-affecting filters have been applied (so it's too late to affect e.g. --filter-founders). If you want this to happen before all filters instead, add the 'first' modifier.

Data management >>