Introduction, downloads

D: 14 Nov 2024

Recent version history

What's new?

Coming next

[Jump to search box]

General usage

Getting started

Flag usage summaries

Column set descriptors

Citation instructions

Standard data input

PLINK 1 binary (.bed)

PROVISIONAL_REF?

PLINK 2 binary (.pgen)

Autoconversion behavior

VCF/BCF (.vcf[.gz], .bcf)

Oxford genotype (.bgen)

Oxford haplotype (.haps)

PLINK 1 text (.ped, .tped)

PLINK 1 dosage

Sample ID conversion

Dosage import settings

Generate random

Unusual chromosome IDs

Allele frequencies

Phenotypes

Covariates

'Cluster' import

Reference genome (.fa)

Input filtering

Sample ID file

Variant ID file

Interval-BED file

--extract-col-cond

QUAL, FILTER, INFO

Chromosomes

SNPs only

Simple variant window

Multiple variant ranges

Deduplicate variants

Sample/variant thinning

Pheno./covar. condition

Missingness

Category subset

--keep-col-match

Missing genotypes

Number of distinct alleles

Allele frequencies/counts

Hardy-Weinberg

Imputation quality

Sex

Founder status

Main functions

Data management

--make-[b]pgen/--make-bed

--export

--output-chr

--split-par/--merge-par

--set-all-var-ids

--recover-var-ids

--update-map...

--update-ids...

--ref-allele

--ref-from-fa

--normalize

--indiv-sort

--write-covar

--variance-standardize

--quantile-normalize

--split-cat-pheno

--pheno-svd

--pmerge[-list]

--write-samples

Basic statistics

--freq

--geno-counts

--sample-counts

--missing

--genotyping-rate

--hardy

--het

--check-sex/--impute-sex

--fst

--pgen-info

Pairwise diffs

--pgen-diff

--sample-diff

Linkage disequilibrium

--indep...

--r[2]-[un]phased

--ld

Sample-distance matrices

Relationship/covariance

  (--make-grm-bin...)

--make-king...

--king-cutoff

Population stratification

--pca

PCA projection

Association analysis

--glm

--glm ERRCODE values

--gwas-ssf

--adjust-file

Report postprocessing

--clump

Linear scoring

--score[-list]

--variant-score

Distributed computation

Command-line help

Miscellaneous

Flag/parameter reuse

System resource usage

--loop-cats

.zst decompression

Pseudorandom numbers

Warnings as errors

.pgen validation

Resources

1000 Genomes phase 3

HGDP-CEPH

FASTA files

Errors and warnings

Output file list

Order of operations

Developer information

GitHub root

Python library

R library

Compilation

Adding new functionality

Discussion forums

Credits

File formats

Tutorials

Setup

Rules of Thumb

Data Exploration 1 — HWE, Allele Frequency Spectrum

Data Exploration 2 — Genomic Structure

Linkage

Relationship Matrix

Genome-Wide Assocation Analyses (GWAS)

Regressions

Post-Hoc

Formatting Files

bcftools

Variant IDs

Reference Alleles

Format for R

Shortcuts

Quick index search

Linear scoring

Sample scores

--score <filename> [i] [j] [k] [{header | header-read}]
                   [{center | variance-standardize | dominant | recessive}]
                   ['no-mean-imputation'] ['se'] ['zs'] ['ignore-dup-ids']
                   [{list-variants | list-variants-zs}]
                   ['cols='<column set descriptor>]

--score-list <fnm> [i] [j] [k] [{header | header-read}]
                   [{center | variance-standardize | dominant | recessive}]
                   ['no-mean-imputation'] ['se'] ['zs'] ['ignore-dup-ids']
                   ['cols='<column set descriptor>]
--score-col-nums <number(s)/range(s)...>

--q-score-range <range file> <data file> [i] [j] ['header'] ['min']

--score and --score-list apply one or more linear scoring systems to each sample, and report results to plink2.sscore. More precisely, if G is the full genotype/dosage matrix (rows = alleles, columns = samples) and a is a scoring-system vector with one coefficient per allele, --score[-list] computes the vector-matrix product aTG, and then divides by the number of allele observations (i.e. usually twice the number of variants; online documentation incorrectly said "variants" here before 16 May 2023) when reporting score-averages.

For --score-list, the input file should contain a list of filenames, one per line; each of those files is then processed as if it were passed to --score, then results are merged together. Note that, if all your files contain the same variants and alleles, --score-list tends to be a lot slower than passing in a single wide file to --score and using --score-col-nums.

The rest of this section describes --score.

  • The input file must have exactly one line per scored allele. Variant IDs are read from column #i and allele codes are read from column j, where i defaults to 1 and j defaults to i+1.
  • By default, a single column of coefficients is read from column #k, where k defaults to j+1. To specify multiple columns, use --score-col-nums.
  • 'header-read' causes the first line of the input file to be treated as a header line containing score names. Otherwise, score(s) are assigned the names 'SCORE1', 'SCORE2', etc.; and 'header' just causes the first line to be entirely ignored.
  • By default, copies of unnamed alleles contribute zero to score, while missing genotypes contribute an amount proportional to the loaded (via --read-freq) or imputed allele frequency. To throw out missing observations instead (decreasing the denominator in the final average when this happens), use the 'no-mean-imputation' modifier.
  • By default, G contains basic allelic dosages (0..2 on diploid chromosomes, 0..1 on haploid, male chrX encoding controlled by --xchr-model). The following modifiers affect this:
    • 'center' translates all dosages to mean zero. (More precisely, they are translated based on allele frequencies, which you can control with --read-freq.)
    • 'variance-standardize' linearly transforms each variant's dosage vector to have mean zero, variance 1.
    • 'dominant' causes dosages greater than 1 to be treated as 1, while 'recessive' uses max(dosage - 1, 0) on diploid chromosomes.
    'dominant', 'recessive', and 'variance-standardize' cannot be used with chrX.
  • The 'se' modifier causes the input coefficients to be treated as independent standard errors; in this case, standard errors for the score average/sum are reported, under a Gaussian approximation. (This will of course tend to underestimate standard errors when scored variants are in LD.)
  • By default, --score errors out if a variant ID in the input file appears multiple times in the main dataset. Use the 'ignore-dup-ids' modifier to skip them instead (a warning is still printed if such variants are present).
  • The 'list-variants[-zs]' modifier causes variant IDs used for scoring to be written to plink2.sscore.vars[.zst].
  • Refer to the file format entry for a list of supported column sets.

--q-score-range can be used to apply --score to many variant subsets at once, based on e.g. p-value ranges.

  • The "range file" should have range labels in the first column, p-value lower bounds in the second column, and upper bounds in the third column, e.g.
       S1  0.00 0.01
       S2  0.00 0.20
       S3  0.10 0.50

    (Lines with too few entries, or nonnumeric values in the second or third column, are ignored.) This would cause three sample-score reports to be generated: plink2.S1.sscore would only consider variants with p-values in [0, 0.01], plink2.S2.sscore would only consider [0, 0.2], and plink2.S3.sscore would only consider [0.1, 0.5].
  • The "data file" should contain a variant ID and a p-value on each line (except possibly the first). Variant IDs are read from column #i and p-values are read from column #j, where i defaults to 1 and j defaults to i+1. The 'header' modifier causes the first nonempty line of this file to be skipped.
  • By default, --q-score-range errors out when a variant ID appears multiple times in the data file (and is also present in the main dataset). To use the minimum p-value in this case instead, add the 'min' modifier.

For more sophisticated polygenic risk scoring, we recommend looking at the LDpred2 and PRSice-2 software packages.

PCA projection with --score

Since --score's new 'variance-standardize' modifier applies the same transformation to G as --pca does, --score can now execute the vector-matrix multiply corresponding to PCA projection.

The following command exports PCs to project onto, along with the allele frequencies needed to calibrate the 'variance-standardize' operation:

plink2 --pfile ref_data \
       --freq counts \
       --pca allele-wts vcols=chrom,ref,alt \
       --out ref_pcs

You can then project onto those PCs with

plink2 --pfile new_data \
       --read-freq ref_pcs.acount \
       --score ref_pcs.eigenvec.allele 2 5 header-read no-mean-imputation \
               variance-standardize \
       --score-col-nums 6-15 \
       --out new_projection

Note that these PCs will be scaled a bit differently from ref_data.eigenvec; you need to multiply or divide the PCs by a multiple of sqrt(eigenvalue) to put them on the same scale.

Also note that later PC coordinates for out-of-reference samples will tend to be shrunk toward zero; see e.g. Wang C, Zhan X, Liang L, Abecasis GR, Lin X (2015) Improved Ancestry Estimation for both Genotyping and Sequencing Data using Projection Procrustes Analysis and Genotype Imputation for discussion.

Variant scores

--variant-score <filename> ['bin' | 'bin4' | 'cols='<column set descriptor>]
                           ['zs'] ['single-prec']
  (alias: --vscore)
--vscore-col-nums <number(s)/range(s)...>

--variant-score is roughly the transpose of --score: it applies one or more linear scoring systems to each variant, and reports results to plink2.vscore[.zst]. More precisely, if G is the full genotype/dosage matrix (rows = variants, columns = samples) and s is a scoring-system vector with one coefficient per sample, --variant-score computes the vector-matrix product Gs. However, there are some details which differ, since the main purpose of this command is different.

  • The input file should contain one line per sample, each starting with a sample ID and followed by scoring weight(s). It can also have a header line with the sample ID representation (e.g. "#FID IID") and the score name(s).
    • By default, all score columns are read. --vscore-col-nums lets you select a subset.
  • Each entry of G is the sum of all non-REF dosages for that (sample, variant) combination; i.e. all ALT alleles in multiallelic variants are effectively collapsed together. Scaling is the same as for --score (including chrX being affected by --xchr-model). MAF-based mean imputation is always applied to missing dosages, since there's no option for computing a score-average.
  • Refer to the file format entry for a list of column sets supported by the usual text report.
  • The 'bin' and 'bin4' modifiers request binary output instead. In this case, the main plink2.vscore.bin output file contains floating-point values (double-precision with 'bin', single-precision with 'bin4'), column (score) ID(s) are saved to plink2.vscore.cols, and variant IDs are saved to plink2.vscore.vars[.zst].
  • By default, the computation uses double-precision numbers internally (even when single-precision output is requested); you can use 'single-prec' to sacrifice some accuracy for speed.

Distributed computation >>