Linear scoring
Sample scores
--score <filename> [i] [j] [k] [{header | header-read}]
[{center | variance-standardize | dominant | recessive}]
['no-mean-imputation'] ['se'] ['zs'] ['ignore-dup-ids']
[{list-variants | list-variants-zs}]
['cols='<column set descriptor>]
--score-list <fnm> [i] [j] [k] [{header | header-read}]
[{center | variance-standardize | dominant | recessive}]
['no-mean-imputation'] ['se'] ['zs'] ['ignore-dup-ids']
['cols='<column set descriptor>]
--score-col-nums <number(s)/range(s)...>
--q-score-range <range file> <data file> [i] [j] ['header'] ['min']
--score and --score-list apply one or more linear scoring systems to each sample, and report results to plink2.sscore. More precisely, if G is the full genotype/dosage matrix (rows = alleles, columns = samples) and a is a scoring-system vector with one coefficient per allele, --score[-list] computes the vector-matrix product aTG, and then divides by the number of allele observations (i.e. usually twice the number of variants; online documentation incorrectly said "variants" here before 16 May 2023) when reporting score-averages.
For --score-list, the input file should contain a list of filenames, one per line; each of those files is then processed as if it were passed to --score, then results are merged together. Note that, if all your files contain the same variants and alleles, --score-list tends to be a lot slower than passing in a single wide file to --score and using --score-col-nums.
The rest of this section describes --score.
- The input file must have exactly one line per scored allele. Variant IDs are read from column #i and allele codes are read from column j, where i defaults to 1 and j defaults to i+1.
- By default, a single column of coefficients is read from column #k, where k defaults to j+1. To specify multiple columns, use --score-col-nums.
- 'header-read' causes the first line of the input file to be treated as a header line containing score names. Otherwise, score(s) are assigned the names 'SCORE1', 'SCORE2', etc.; and 'header' just causes the first line to be entirely ignored.
- By default, copies of unnamed alleles contribute zero to score, while missing genotypes contribute an amount proportional to the loaded (via --read-freq) or imputed allele frequency. To throw out missing observations instead (decreasing the denominator in the final average when this happens), use the 'no-mean-imputation' modifier.
- By default, G contains basic allelic dosages (0..2 on diploid chromosomes, 0..1 on haploid, male chrX encoding controlled by --xchr-model). The following modifiers affect this:
- 'center' translates all dosages to mean zero. (More precisely, they are translated based on allele frequencies, which you can control with --read-freq.)
- 'variance-standardize' linearly transforms each variant's dosage vector to have mean zero, variance 1.
- 'dominant' causes dosages greater than 1 to be treated as 1, while 'recessive' uses max(dosage - 1, 0) on diploid chromosomes.
'dominant', 'recessive', and 'variance-standardize' cannot be used with chrX.
- The 'se' modifier causes the input coefficients to be treated as independent standard errors; in this case, standard errors for the score average/sum are reported, under a Gaussian approximation. (This will of course tend to underestimate standard errors when scored variants are in LD.)
- By default, --score errors out if a variant ID in the input file appears multiple times in the main dataset. Use the 'ignore-dup-ids' modifier to skip them instead (a warning is still printed if such variants are present).
- The 'list-variants[-zs]' modifier causes variant IDs used for scoring to be written to plink2.sscore.vars[.zst].
- Refer to the file format entry for a list of supported column sets.
--q-score-range can be used to apply --score to many variant subsets at once, based on e.g. p-value ranges.
- The "range file" should have range labels in the first column, p-value lower bounds in the second column, and upper bounds in the third column, e.g.
S1 0.00 0.01
S2 0.00 0.20
S3 0.10 0.50
(Lines with too few entries, or nonnumeric values in the second or third column, are ignored.) This would cause three sample-score reports to be generated: plink2.S1.sscore would only consider variants with p-values in [0, 0.01], plink2.S2.sscore would only consider [0, 0.2], and plink2.S3.sscore would only consider [0.1, 0.5].
- The "data file" should contain a variant ID and a p-value on each line (except possibly the first). Variant IDs are read from column #i and p-values are read from column #j, where i defaults to 1 and j defaults to i+1. The 'header' modifier causes the first nonempty line of this file to be skipped.
- By default, --q-score-range errors out when a variant ID appears multiple times in the data file (and is also present in the main dataset). To use the minimum p-value in this case instead, add the 'min' modifier.
For more sophisticated polygenic risk scoring, we recommend looking at the LDpred2 and PRSice-2 software packages.
Since --score's new 'variance-standardize' modifier applies the same transformation to G as --pca does, --score can now execute the vector-matrix multiply corresponding to PCA projection.
The following command exports PCs to project onto, along with the allele frequencies needed to calibrate the 'variance-standardize' operation:
plink2 --pfile ref_data \
--freq counts \
--pca allele-wts vcols=chrom,ref,alt \
--out ref_pcs
You can then project onto those PCs with
plink2 --pfile new_data \
--read-freq ref_pcs.acount \
--score ref_pcs.eigenvec.allele 2 5 header-read no-mean-imputation \
variance-standardize \
--score-col-nums 6-15 \
--out new_projection
Note that these PCs will be scaled a bit differently from ref_data.eigenvec; you need to multiply or divide the PCs by a multiple of sqrt(eigenvalue) to put them on the same scale.
Also note that later PC coordinates for out-of-reference samples will tend to be shrunk toward zero; see e.g. Wang C, Zhan X, Liang L, Abecasis GR, Lin X (2015) Improved Ancestry Estimation for both Genotyping and Sequencing Data using Projection Procrustes Analysis and Genotype Imputation for discussion.
--variant-score <filename> ['bin' | 'bin4' | 'cols='<column set descriptor>]
['zs'] ['single-prec']
(alias: --vscore)
--vscore-col-nums <number(s)/range(s)...>
--variant-score is roughly the transpose of --score: it applies one or more linear scoring systems to each variant, and reports results to plink2.vscore[.zst]. More precisely, if G is the full genotype/dosage matrix (rows = variants, columns = samples) and s is a scoring-system vector with one coefficient per sample, --variant-score computes the vector-matrix product Gs. However, there are some details which differ, since the main purpose of this command is different.
- The input file should contain one line per sample, each starting with a sample ID and followed by scoring weight(s). It can also have a header line with the sample ID representation (e.g. "#FID IID") and the score name(s).
- By default, all score columns are read. --vscore-col-nums lets you select a subset.
- Each entry of G is the sum of all non-REF dosages for that (sample, variant) combination; i.e. all ALT alleles in multiallelic variants are effectively collapsed together. Scaling is the same as for --score (including chrX being affected by --xchr-model). MAF-based mean imputation is always applied to missing dosages, since there's no option for computing a score-average.
- Refer to the file format entry for a list of column sets supported by the usual text report.
- The 'bin' and 'bin4' modifiers request binary output instead. In this case, the main plink2.vscore.bin output file contains floating-point values (double-precision with 'bin', single-precision with 'bin4'), column (score) ID(s) are saved to plink2.vscore.cols, and variant IDs are saved to plink2.vscore.vars[.zst].
- By default, the computation uses double-precision numbers internally (even when single-precision output is requested); you can use 'single-prec' to sacrifice some accuracy for speed.
Distributed computation >>
|