Introduction, downloads

S: 22 Oct 2024 (b.7.7)

D: 22 Oct 2024

Recent version history

What's new?

Future development

Limitations

Note to testers

[Jump to search box]

General usage

Getting started

Citation instructions

Standard data input

PLINK 1 binary (.bed)

Autoconversion behavior

PLINK text (.ped, .tped...)

VCF (.vcf[.gz], .bcf)

Oxford (.gen[.gz], .bgen)

23andMe text

Generate random

Unusual chromosome IDs

Recombination map

Allele frequencies

Phenotypes

Covariates

Clusters of samples

Variant sets

Binary distance matrix

IBD report (.genome)

Input filtering

Sample ID file

Variant ID file

Positional ranges file

Cluster membership

Set membership

Attribute-based

Chromosomes

SNPs only

Simple variant window

Multiple variant ranges

Sample/variant thinning

Covariates (--filter)

Missing genotypes

Missing phenotypes

Minor allele frequencies

Hardy-Weinberg

Mendel errors

Quality scores

Relationships

Main functions

Data management

--make-bed

--recode

--output-chr

--zero-cluster

--split-x/--merge-x

--set-me-missing

--fill-missing-a2

--set-missing-var-ids

--update-map...

--update-ids...

--flip

--flip-scan

--keep-allele-order...

--indiv-sort

--write-covar...

--[b]merge...

Merge failures

VCF reference merge

--merge-list

--write-snplist

--list-duplicate-vars

Basic statistics

--freq[x]

--missing

--test-mishap

--hardy

--mendel

--het/--ibc

--check-sex/--impute-sex

--fst

Linkage disequilibrium

--indep...

--r/--r2

--show-tags

--blocks

Distance matrices

Identity-by-state/Hamming

  (--distance...)

Relationship/covariance

  (--make-grm-bin...)

--rel-cutoff

Distance-pheno. analysis

  (--ibs-test...)

Identity-by-descent

--genome

--homozyg...

Population stratification

--cluster

--pca

--mds-plot

--neighbour

Association analysis

Basic case/control

  (--assoc, --model)

Stratified case/control

  (--mh, --mh2, --homog)

Quantitative trait

  (--assoc, --gxe)

Regression w/ covariates

  (--linear, --logistic)

--dosage

--lasso

--test-missing

Monte Carlo permutation

Set-based tests

REML additive heritability

Family-based association

--tdt

--dfam

--qfam...

--tucc

Report postprocessing

--annotate

--clump

--gene-report

--meta-analysis

Epistasis

--fast-epistasis

--epistasis

--twolocus

Allelic scoring (--score)

R plugins (--R)

Secondary input

GCTA matrix (.grm.bin...)

Distributed computation

Command-line help

Miscellaneous

Tabs vs. spaces

Flag/parameter reuse

System resource usage

Pseudorandom numbers

Resources

1000 Genomes

Teaching materials

Gene range lists

Functional SNP attributes

Errors and warnings

Output file list

Order of operations

For developers

GitHub repository

Compilation

Core algorithms

Partial sum lookup

Bit population count

Ternary dot product

Vertical population count

Exact statistical tests

Multithreaded gzip

Adding new functionality

Discussion forums

plink2-users

Credits

File formats

Quick index search

Standard data input

Most of PLINK's calculations operate on tables of samples and variant calls. The following flags are available for defining the form and location of this input, and associated metadata.

Discrete calls

PLINK 1 binary

--bfile [prefix]

The --bfile flag causes the binary fileset plink.bed + plink.bim + plink.fam to be referenced. (The structure of these files is described in the file formats appendix.) If a prefix is given, it replaces all instances of 'plink'.

--bed <filename>
--bim <filename>
--fam <filename>

--bed, --bim, and --fam let you specify the full name of one part of the PLINK 1 binary fileset, taking precedence over --bfile. For example,

plink --bfile toy --bed bob --freq

would reference the binary fileset bob.bed + toy.bim + toy.fam.

(.fam files are also present in some other fileset types. The --fam flag has the same function when loading them.)

--keep-autoconv

PLINK 1 binary is PLINK 1.9's preferred input format. In fact, PLINK 1.9 automatically converts most other formats to PLINK 1 binary before the main loading sequence1. As a result, if you're performing multiple operations on the same otherwise-formatted files, you may want to keep the autoconversion products and work with them, instead of repeating the conversion on every run. PLINK 1.9 gives you several ways to handle this situation.

1. If you just want to convert your data, don't use any other flags besides --out. For example:

plink --file text_fileset --out binary_fileset

This entirely skips the main loading sequence, so filters like --extract, --hwe, and --snps-only are not permitted (you'll get an error if you attempt to use them).

2. You can produce a binary fileset which is a filtered version of your text data. Use --make-bed for this.

3. You can directly analyze the text fileset. In this case, the autoconversion products are silently deleted at the end of the run2, to avoid clogging your drive with unwanted files. For example, the following command writes an allele frequency report to results.frq, and doesn't leave any other files behind besides results.log:

plink --file text_fileset --freq --out results

4. You can analyze the text fileset while specifying (with --keep-autoconv) that you also want to keep the autoconversion products. So the following command leaves behind results.bed, results.bim and results.fam as well as results.frq and results.log:

plink --file text_fileset --freq --keep-autoconv --out results

Finally, note that PLINK 2 autoconverts to a different binary format, but it still has an efficient --make-bed implementation. Thus, if you want your script to be portable to PLINK 2 with a minimum of fuss, it's reasonable to stick with --make-bed.

1: Since binary files are so much smaller than the equivalent text files, we expect that this will not put undue pressure on your available disk space. This architectural choice allows PLINK's core to focus entirely on efficient streaming processing of binary data; we hope the memory usage, development speed, and performance benefits we're able to deliver as a result are worth any slight inconvenience.
2: If you interrupt PLINK with e.g. Ctrl-C, or the program crashes, the files will not be deleted. You can use "rm *-temporary.*" (or "del *-temporary.*" on Windows) to clean up the mess.

PLINK text

--file [prefix]

This sets the filename prefix of the .ped+.map fileset to reference. (The default is 'plink' if you do not specify a prefix.)

As discussed above, PLINK 1.9 will autoconvert the fileset to binary; conversion is required because this is NOT a native format, and you should avoid using it when possible because it's simultaneously less efficient AND more lossy than e.g. VCF.

--ped <filename>

--map <filename>

These are analogous to --bed/--bim/--fam above. (You may no longer use "--ped -" to read the .ped from standard input; redirect it to a file and process it the normal way instead.)

--tfile [prefix]
--tped <filename>
--tfam <filename>

Similarly, these flags let you specify a transposed text fileset to load.

--lfile [prefix]
--lgen <filename>
--reference <filename>
--allele-count

--lfile/--lgen let you specify a long-format fileset to load. --reference lets you specify an associated list of reference alleles (variant IDs in the first column, reference alleles in the second column, optional non-reference alleles in the third column). --allele-count, when used with --reference, specifies that the .lgen file contains reference allele counts in the 4th column.

Irregularly-formatted PLINK text files

--no-fid
--no-parents
--no-sex
--no-pheno

These allow you to use .fam or .ped files which lack family ID, parental ID, sex, and/or phenotype columns.

Compound .ped genotypes and three-column .map files are automatically detected and handled correctly, so the --compound-genotypes and --map3 flags have been retired. (Note, however, that compound genotypes are not permitted in .tped files.)

--missing-genotype <char>

--missing-genotype2 <char>

Missing genotype calls are normally assumed to be represented by '0' in .ped and similar files; you can change this to most other (nonspace) characters with --missing-genotype. However, '1', '2', '3', '4', 'A', 'C', 'G', and 'T' are disallowed.

In .bim files, '.' is also treated as a missing code (this was not true before 16 Jan 2023); you can alter this second code with --missing-genotype2.

Variant Call Format

--vcf <filename>
--bcf <filename>

--vcf loads a (possibly gzipped) VCF file, extracting information which can be represented by the PLINK 1 binary format and ignoring everything else (after applying the load filters described below). For example, phase and dosage information are currently discarded. (This situation will improve in the future, but we do not have plans to try to handle everything in the file.)

VCF reference alleles are set to A2 by the autoconverter even when they appear to be minor. However, to maintain backwards compatibility with PLINK 1.07, PLINK 1.9 normally forces major alleles to A2 during its loading sequence. One workaround is permanently keeping the .bim file generated during initial conversion, for use as --a2-allele input whenever the reference sequence needs to be recovered. (If you use this method, note that, when your initial conversion step invokes --make-bed instead of just --out, you also need --keep-allele-order to avoid losing track of reference alleles before the very first write, because --make-bed triggers the regular loading sequence.)

--bcf loads a BCF2 file instead, and otherwise behaves identically to --vcf. It can be either uncompressed (as emitted by e.g. GATK 3) or BGZF-compressed (supported by htslib). The BCF1 output of old SAMtools builds is not supported; use "bcftools view" to convert such files to readable VCFs.

--double-id
--const-fid [FID]
--id-delim [delimiter]

VCF files just contain sample IDs, instead of the distinct family and within-family IDs tracked by PLINK. We offer three ways to convert these IDs:

  • --double-id causes both family and within-family IDs to be set to the sample ID.
  • --const-fid converts sample IDs to within-family IDs while setting all family IDs to a single value (default '0').
  • --id-delim causes sample IDs to be parsed as <FID><delimiter><IID>; the default delimiter is '_'. If any sample ID does not contain exactly one instance of the delimiter, an error is normally reported; however, if you have simultaneously specified --double-id or --const-fid, PLINK will fall back on that approach to handle zero-delimiter IDs.

If none of these three flags is present, the loader defaults to --double-id + --id-delim.

--vcf-idspace-to <character>

Since PLINK sample IDs cannot contain spaces, an error is normally reported when there's a space in a VCF sample ID. To work around this, you can use --vcf-idspace-to to convert all spaces in sample IDs to another character. This happens before regular parsing, so when the --vcf-idspace-to and --id-delim characters are identical, both space and the original --id-delim character are interpreted as FID/IID delimiters.

If you only want the space character to function as a delimiter, use "--id-delim ' '". (This is not compatible with --rerun.)

--biallelic-only ['strict'] ['list']
--vcf-min-qual <val>

By default, all variants are loaded; when more than one alternate allele is present, the reference allele and the most common alternate are tracked (ties broken in favor of the lower-numbered allele) and the rest are coded as missing calls. To simply skip all variants where at least two alternate alleles are present in the dataset, use --biallelic-only. Add the 'strict' modifier if you want to indiscriminately skip variants with 2+ alternate alleles listed even when only one alternate allele actually shows up (this minimizes merge headaches down the line), and use 'list' to dump a list of skipped variant IDs to plink.skip.3allele.

--vcf-min-qual causes all variants with QUAL value smaller than the given number, or with no QUAL value at all, to be skipped. (--qual-scores has similar functionality.)

--vcf-filter [exception(s)...]

To skip variants which failed one or more filters tracked by the FILTER field, use --vcf-filter. This can be combined with one or more (space-delimited) filter names to ignore.

##fileformat=VCFv4.1
##filedate=20140317
##source=myImputationProgramV3.14
##FILTER=<ID=q10,Description="Quality below 10">
##FILTER=<ID=s50,Description="Less than 50% of samples have data">
##FORMAT=<ID=GT,Number=1,Type=String,Description="Genotype">
##contig=<ID=1,length=249250621>
#CHROM POS   ID          REF ALT QUAL FILTER  INFO FORMAT HG00096 HG00097 HG00099
1      10583 rs58108140  G   A   25   PASS    .    GT     0/0     0/0     0/0
1      10611 rs189107123 C   G   11   q10     .    GT     0/0     0/1     0/0
1      13302 rs180734498 C   T   32   s50     .    GT     ./.     0/1     ./.
1      13327 rs144762171 G   C   30   .       .    GT     0/0     0/1     ./.
1      13957 rs201747181 TC  T   3    q10;s50 .    GT     0/0     ./.     ./.

For example, given the VCF file above:

  • --vcf-filter with no arguments would keep only rs58108140 and rs144762171;
  • "--vcf-filter q10" would keep rs58108140, rs180734498, and rs144762171 (PLINK matches against the 'q10' string, instead of checking the QUAL value here; use --vcf-min-qual to do the latter);
  • "--vcf-filter LowQual s50" would keep rs58108140, rs189107123, and rs144762171; and
  • "--vcf-filter q10 s50" would keep all five variants.

--vcf-require-gt

By default, when the GT field is absent, the variant is kept and all genotypes are set to missing. To skip the variant instead, use --vcf-require-gt.

--vcf-min-gq <val>

--vcf-min-gp <val>

--vcf-min-gq excludes all genotype calls with GQ below the given (nonnegative, decimal values permitted) threshold. Missing GQ values are not treated as being below the threshold.

Similarly, --vcf-min-gp excludes all genotype calls with GP value below the given threshold, assuming GP is 0-1 scaled rather than phred-scaled. (This convention will probably be in the VCFv4.3 specification, and some programs already adhere to it when generating earlier-version VCF files. Due to the latter, we do not enforce a minimum VCF format version, but be aware that this flag is useless on fully-standards-compliant pre-v4.3 VCFs.)

--vcf-half-call <mode>

The current VCF standard does not specify how '0/.' and similar GT values should be interpreted. By default (mode 'error'/'e'), PLINK 1.9 errors out and reports the line number of the anomaly. Should the half-call be intentional, though (this can be the case with Complete Genomics data), you can request the following other modes:

  • 'haploid'/'h': Treat half-calls as haploid/homozygous (the PLINK 1 file format does not distinguish between the two). This maximizes similarity between the VCF and BCF2 parsers.
  • 'missing'/'m': Treat half-calls as missing.
  • 'reference'/'r': Treat the missing part as reference.
Oxford format

--data [prefix]
--gen <filename>
--bgen <filename> ['snpid-chr']
--sample <filename>

--oxford-single-chr <chromosome code>
--oxford-pheno-name <column name>

--data causes the Oxford-format fileset plink.gen + plink.sample to be referenced. If a prefix is provided, it replaces 'plink'. --gen, --bgen, and --sample allow you to specify the filenames separately; --gen is necessary if your genomic data file has a .gen.gz extension, and --bgen is necessary for BGEN-format files.

  • The original .gen specification had 5 leading columns, but this was later amended to 6. Both flavors are now supported; PLINK 1.9 and 2.0 builds before 16 Apr 2021 did not support the 6-leading-column flavor.
  • With 5-leading-column .gen input, the first column is normally assumed to contain chromosome codes. To import a single-chromosome .gen file with ignorable chromosome/SNP ID column(s), use --oxford-single-chr.
  • With .bgen input, use the 'snpid-chr' modifier to specify that chromosome codes should be read from the "SNP ID" field. (Otherwise, the field is ignored.)
  • If sex information is in the .sample file, it must be in a column titled 'sex' (capital letters ok) of type 'D' (discrete covariate), and be coded in the usual 1=male/2=female/0=unknown manner, to be loaded.
  • By default, if there is at least one binary or continuous phenotype in the .sample file, the first such phenotype is loaded (and, if it is a binary phenotype, it is converted from 1/0 to 2/1 case/control coding). You can specify another binary/continuous phenotype by name with --oxford-pheno-name.

--hard-call-threshold <value>
--hard-call-threshold random
--missing-code [comma-separated list of values]
  (alias: --missing_code)

Since the PLINK 1 binary format cannot represent genotype probabilities, calls with uncertainty greater than 0.1 are normally treated as missing, and the rest are treated as hard calls. You can adjust this threshold by providing a numeric parameter to --hard-call-threshold.

Alternatively, when --hard-call-threshold is given the 'random' modifier, calls are independently randomized according to the probabilities in the file. (This is not ideal; it would be better to randomize in a haploblock-sensitive manner. But resampling a bunch of times with this and generating an empirical distribution of some statistic can still be more informative than applying a single threshold and calculating that statistic once.)

--hard-call-threshold can only be used on Oxford-format files for now; it will be extended to VCFs and other formats in the future.

--missing-code lets you specify the set of strings to interpret as missing phenotype values in a .sample file. For example, "--missing-code -9,0,NA,na" would cause '-9', '0', 'NA', and 'na' to all be interpreted as missing phenotypes. (Note that no spaces are currently permitted between the strings.) By default, only 'NA' is interpreted as missing.

23andMe text

--23file <filename> [family ID] [within-family ID] [sex] [phenotype] [paternal ID] [maternal ID]

--23file specifies an uncompressed 23andMe-formatted file to load (and convert to PLINK 1 binary before further processing).

  • The family and within-family IDs default to 'FAM001' and 'ID001' respectively if you don't provide them. Due to how the PLINK 1 binary fileset format is defined, they cannot contain spaces3. Since some PLINK commands merge the family ID and within-family ID with an underscore in their reports, we recommend using another character (such as '~') to separate compound name components. (If you don't have to distinguish between e.g. "Mac Donald" and 'MacDonald', upper CamelCase will also do.)
  • Sex should be 'm' or '1' for male, 'f' or '2' for female, 'i' to try to infer it from X/Y chromosome data (i.e. assume female unless there are single-allele calls on the X chromosome or nonmissing calls on the Y chromosome), or '0' to force to missing (this can be appropriate if there is no X/Y chromosome data). The default is 'i'.
  • The phenotype is set to missing (normally represented by -9) if unspecified. It must be a numeric value. Case/control phenotypes are normally coded as control = 1, case = 2.
  • The paternal ID and maternal ID are set to missing by default. They're irrelevant unless you are merging with a dataset which contains known relatives.

For example:

plink --23file genome.txt Chang Christopher --out plink_genome

Note that some variants in 23andMe files may have indel instead of SNP calls. If this is the case, you'll be notified during the loading process; you can then use --list-23-indels (preferably on a merged dataset, to minimize the impact of missing calls) to produce a list of the affected variant IDs.

The 23andMe file format also does not always mark the boundaries of the X chromosome pseudo-autosomal region, so PLINK does not convert the chromosome code for 2-allele markers on male X chromosomes to XY—instead, you'll get heterozygous haploid warnings down the line. Use --split-x to solve the problem.

3: As a general rule, handling of space-containing command-line parameters is undefined and subject to change without notice. Avoid them.

Randomized data

--dummy <sample count> <SNP count> [missing geno freq] [missing pheno freq] [{acgt | 1234 | 12}] ['scalar-pheno']

This tells PLINK to generate a simple dataset from scratch (useful for basic software testing).

All generated samples are females with random genotype and phenotype values. If the third parameter is a decimal value, it sets the frequency of missing genotype calls (default 0); if the fourth is also a decimal, it sets the frequency of missing phenotypes (which also defaults to 0). The 'acgt' modifier causes A/C/G/T genotype calls to be generated instead of the PLINK 1.07 default of A/B, while '1234' generates 1/2/3/4 genotypes, and '12' makes all calls 1/2. The 'scalar-pheno' modifier causes normally distributed (mean 0, stdev 1) rather than case/control phenotype values to be generated.

--simulate <simulation parameter file> [{tags | haps}] [{acgt | 1234 | 12}]

--simulate-ncases <number of cases>
--simulate-ncontrols <number of controls>
--simulate-prevalence <disease prevalence>
--simulate-label <name prefix>

--simulate-missing <missing geno freq>

--simulate generates a new dataset which contains some disease-associated SNPs. (If you want to simulate phenotypes based on real genotype data instead, use GCTA --simu-cc/--simu-qt.) For the basic version of the command, the simulation parameter file is expected to be a text file with one or more rows, where each row has six fields:

  1. Number of SNPs in set
  2. Label of this set of SNPs
  3. Reference allele frequency lower bound
  4. Reference allele frequency upper bound
  5. odds(case | heterozygote) / odds(case | homozygous for alternate allele)
  6. odds(case | homozygous for ref. allele) / odds(case | homozygous for alt. allele)

odds(X) := P(X) / (1 - P(X)). Note that PLINK 1.07's --simulate implementation actually interprets the last two fields as relative risks instead of odds ratios; while the difference is minimal for small values of P(X), we have changed the behavior to match the documentation to reduce future confusion.

If the 'tags' or 'haps' modifier is present, an extended nine-field simulation parameter file is expected instead:

  1. Number of SNPs in set
  2. Label of this set of SNPs
  3. Reference allele frequency lower bound, causal variant
  4. Reference allele frequency upper bound, causal variant
  5. Reference allele frequency lower bound, marker
  6. Reference allele frequency upper bound, marker
  7. Marker-causal variant LD
  8. odds(case | c.v. heterozygote) / odds(case | c.v. homozygous for alternate allele)
  9. odds(case | c.v. homozygous for ref. allele) / odds(case | c.v. homozygous for alt. allele)

With 'haps', both the causal variants and the markers are included in the dataset; 'tags' throws out the causal variants.

Normally, the reference allele is designated by 'D' and the alternate allele is 'd'. With 'haps', causal variants are labeled in that manner, while the linked marker reference and alternate alleles are instead designated by 'A' and 'B' respectively. You can use the 'acgt', '1234', or '12' modifier to replace this labeling with random bases.

By default, 1000 cases and 1000 controls are generated, population disease prevalence is 0.01, and missing genotype frequency is 0; you can change these numbers with --simulate-ncases, --simulate-ncontrols, --simulate-prevalence, and --simulate-missing, respectively.

--simulate-label attaches the given prefix, followed by a dash, to all FIDs and IIDs in the dataset. (This makes it easier to merge multiple simulated datasets.)

See the PLINK 1.07 documentation for further discussion.

--simulate-qt <simulation parameter file> [{tags | haps}] [{acgt | 1234 | 12}]

--simulate-n <number of samples>

--simulate-qt generates a new dataset with quantitative trait loci. For the basic version of the command, the simulation parameter file is expected to have the following six fields:

  1. Number of SNPs in set
  2. Label of this set of SNPs
  3. Reference allele frequency lower bound
  4. Reference allele frequency upper bound
  5. Additive genetic variance for each SNP
  6. Dominance deviation

All modifiers have essentially the same semantics as with --simulate. --simulate-label and --simulate-missing also act identically. We have fixed bugs in PLINK 1.07 --simulate-qt's phenotype generation.

By default, 1000 samples are generated; you can change this with --simulate-n.

Nonstandard chromosome IDs

--allow-extra-chr ['0']
  (alias: --aec)

Normally, PLINK reports an error if the input data contains unrecognized chromosome codes (such as hg19 haplotype chromosomes or unplaced contigs). If none of the additional codes start with a digit, you can permit them with the --allow-extra-chr flag. (These contigs are ignored by most analyses which skip unplaced regions.)

The '0' modifier causes these chromosome codes to be treated as if they had been set to zero. (This is sometimes necessary to produce reports readable by older software.)

--chr-set <autosome ct> ['no-x'] ['no-y'] ['no-xy'] ['no-mt']

--cow
--dog

--horse
--mouse
--rice
--sheep
--autosome-num <value>

--chr-set changes the chromosome set. The first parameter specifies the number of diploid autosome pairs if positive, or haploid chromosomes if negative. (Polyploid and aneuploid data are not supported, and there is currently no special handling of sex or mitochondrial chromosomes in all-haploid chromosome sets.)

Given diploid autosomes, the remaining modifiers let you indicate the absence of specific non-autosomal chromosomes, as an extra sanity check on the input data. Note that, when there are n autosome pairs, the X chromosome is assigned numeric code n+1, Y is n+2, XY (pseudo-autosomal region of X) is n+3, and MT (mitochondria) is n+4.

n is currently limited to 95, so if you're working with adder's-tongue fern genomes, you're out of luck4.

The other flags support PLINK 1.07 and GCTA semantics:

  • --cow = --chr-set 29 no-xy
  • --dog = --chr-set 38
  • --horse = --chr-set 31 no-xy no-mt
  • --mouse = --chr-set 19 no-xy no-mt
  • --rice5 = --chr-set -12
  • --sheep = --chr-set 26 no-xy no-mt
  • --autosome-num <value> = --chr-set <value> no-y no-xy no-mt

4: Just kidding. Contact us, and we'll send you a build supporting a higher autosome limit. Note that this isn't necessary if you're dealing with a draft assembly with lots of contigs, rather than actual autosomes—the standard build can handle that if you name your contigs 'contig1', 'contig2', etc. and use the --allow-extra-chr flag.
5: Rice genomes are actually diploid, but breeding programs frequently work with doubled haploids.

SHAPEIT recombination map

--cm-map <filename pattern> [chromosome code]
--zero-cms

--cm-map uses SHAPEIT-format recombination map file(s) to set centimorgan positions of all variants on either a single chromosome or every autosome. In the former case, the first parameter should be the exact name of the recombination map file, and the second parameter should be the chromosome code. In the latter case, the filename pattern should contain a '@' where the chromosome number would go, e.g.

plink --bfile binary_fileset --cm-map genetic_map_chr@_combined_b37.txt --make-bed --out fileset_with_cms

Conversely, --zero-cms can be used with --make-bed or --recode to zero out all centimorgan positions in the output fileset. This saves disk space and speeds up file I/O when you don't need the centimorgan values. (If they were originally set via --cm-map, you can always use --cm-map to recalculate them when needed.)

--zero-cms and --cm-map can be used simultaneously; in this case, --zero-cms acts first.

No-genotype-data corner cases

--allow-no-samples
--allow-no-vars

If the input fileset contains no samples or no variants, PLINK normally errors out. However, you can force it to proceed with --allow-no-samples/--allow-no-vars. (Most commands won't do anything useful with such a fileset, of course, and many will be silently skipped.)

Allele frequencies

When allele frequency estimates are needed, PLINK defaults to using empirical frequencies from the immediate dataset (with a pseudocount of 1 added when --maf-succ is specified). This is unsatisfactory when processing a small subset of a larger dataset or population.

--read-freq <.freq/.frq/.frq.count/.frqx filename>
  (alias: --update-freq)

--read-freq loads a PLINK 1.07, PLINK 1.9, or GCTA allele frequency report, and estimates MAFs (and heterozygote frequencies, if the report is from --freqx) from the file instead of the current genomic data table. It can be combined with --maf-succ if the file contains observation counts.

When a minor allele code is missing from the main dataset but present in the --read-freq file, it is now loaded.

Phenotypes

Loading from an alternate phenotype file

--pheno <filename>

--mpheno <n>
--pheno-name <column name>
--all-pheno

--pheno-merge

--pheno causes phenotype values to be read from the 3rd column of the specified space- or tab-delimited file, instead of the .fam or .ped file. The first and second columns of that file must contain family and within-family IDs, respectively.

In combination with --pheno, --mpheno lets you use the (n+2)th column instead of the 3rd column, while --pheno-name lets you select a column by title. (In order to use --pheno-name, there must be a header row with first two entries 'FID' and 'IID'.) The new --pheno-merge flag tells PLINK to use the phenotype value in the .fam/.ped file when no value is present in the --pheno file; without it, the phenotype is always treated as missing in this case.

--allow-no-sex is now required if you want to retain phenotype values for missing-sex samples. This is a change from PLINK 1.07; we believe it would be more confusing to continue treating regular and --pheno phenotypes differently, and apologize for any temporary inconvenience we've caused.

--all-pheno causes all phenotypes present in the --pheno file to be subject to the association tests you've requested. (--pheno-merge then applies to every phenotype.) Note that, when dealing with a very large number of phenotypes, specialized software is usually more appropriate than --all-pheno; we recommend Matrix eQTL or FastQTL, which process thousands of phenotypes simultaneously and achieve a level of efficiency not possible with --all-pheno + --assoc/--linear. (Update, 1 Apr 2019: PLINK 2.0 also handles this case efficiently now.)

Phenotype encoding

--missing-phenotype <integer>

--1

Missing phenotypes are normally expected to be encoded as -9. You can change this to another integer with --missing-phenotype. (This is a slight change from PLINK 1.07: floating point values are now disallowed due to rounding issues, and nonnumeric values such as 'NA' are rejected since they're treated as missing phenotypes no matter what. Note that --output-missing-phenotype can be given a nonnumeric string.)

Case/control phenotypes are expected to be encoded as 1=unaffected (control), 2=affected (case); 0 is accepted as an alternate missing value encoding. If you use the --1 flag, 0 is interpreted as unaffected status instead, while 1 maps to affected. This also forces phenotypes to be interpreted as case/control.

Case/control phenotype generation

--make-pheno <filename> <value>

Given a text file listing family and individual IDs in the first two columns, "--make-pheno [filename] '*'" designates all samples listed in the named file as cases, and all other samples as controls.

If the named file has a third column, and a value other than '*' is given, --make-pheno will designate all samples with third column entry equal to the given value as cases, all other samples mentioned in the file as controls, and all samples missing from the file as having missing phenotypes.

--tail-pheno <lower ceiling> [upper minimum]

--tail-pheno converts a scalar phenotype into a case/control phenotype. Samples with phenotype values less than or equal to the given lower ceiling are treated as controls, samples with phenotypes strictly greater than the upper minimum are treated as cases, and all other samples are treated as having missing phenotypes. If no upper minimum is provided, it is assumed to be equal to the lower ceiling.

You can combine this with e.g. --make-bed to save the new case/control phenotype.

Covariates

--covar <filename> ['keep-pheno-on-missing-cov']

--covar-name <column ID(s)/range(s)...>
--covar-number <column number(s)/range(s)...>

--no-const-covar
--allow-no-covars

--covar designates the file to load covariates from. The file format is the same as for --pheno (optional header line, FID and IID in first two columns, covariates in remaining columns). By default, the main phenotype is set to missing if any covariate is missing; you can disable this with the 'keep-pheno-on-missing-cov' modifier.

--covar-name lets you specify a subset of covariates to load, by column name; separate multiple column names with spaces or commas, and use dashes to designate ranges. (Spaces are not permitted immediately before or after a range-denoting dash.) --covar-number lets you use column numbers instead.

For example, if the first row of the covariate file is

FID IID SITE AGE DOB BMI ETH SMOKE STATUS ALC

then the following two expressions have the same effect:

--covar-name AGE, BMI-SMOKE, ALC
--covar-number 2, 4-6, 8

--no-const-covar excludes all constant covariates. PLINK normally errors out if this causes all covariates to be excluded (or if the --covar file contained no covariates in the first place), but you can use the --allow-no-covars flag to make it try to proceed.

Clusters of samples

--within <filename> ['keep-NA']

--mwithin <n>
--family

--within lets you define disjoint clusters/strata of samples for permutation procedures and stratified analyses. It normally accepts a file with FIDs in the first column, IIDs in the second column, and cluster names in the third column; --mwithin causes cluster names to be read from column (n+2) instead.

Alternatively, you can use --family to create a cluster for each family ID.

By default, --write-cluster generates a file with 'NA' in the cluster name field for all samples not in any cluster, and if such a file is reloaded with --within, they will remain unassigned. To actually create a 'NA' cluster (this is PLINK 1.07's behavior), use the 'keep-NA' modifier.

--loop-assoc <filename> ['keep-NA']

Given a cluster file, this runs each specified case/control association command once for each cluster, using membership in the cluster as the phenotype. This can be combined with --mwithin.

Variant sets

--set <filename>

--set-names <space/comma-delimited name(s)...>

--subset <filename>
--set-collapse-all <new set name>
  (alias: --make-set-collapse-all)

--complement-sets

--make-set-complement-all <new set name>

--set defines possibly overlapping sets of variants for set-based tests, given a .set file. To only keep some of the sets in the --set file, you can add --set-names (followed by a list of set names to load) and/or --subset (followed by the name of a text file containing the list). If both --set-names and --subset are present, all sets named in either list are loaded.

To merge all sets, use the --set-collapse-all flag. You're required to provide the merged set's name.

To invert every set, add the --complement-sets flag. All inverted sets will have 'C_' prefixes attached to their names.

--make-set-complement-all <name> defines a single set containing all variants not mentioned in the --set file; it's essentially identical to --complement-sets + --set-collapse-all <name>, except the set name doesn't have the 'C_' prefix.

--make-set <filename>

--make-set-border <kbs>
  (alias: --border)
--make-set-collapse-group

With --make-set, you can define sets from a list of named bp ranges instead. Each line of the --make-set input file is expected to have the following 4-5 fields in front:

  1. Chromosome code
  2. Start of range (base-pair units)
  3. End of range (this position is included in the interval)
  4. Set ID
  5. Group label (only needed with --make-set-collapse-group/--make-set-complement-group)

Additional notes:

  • A single set can contain multiple ranges.
  • --set-names, --subset, --set-collapse-all, --complement-sets, and --make-set-complement-all work with --make-set.
  • You can extend each bound out by a given number of kilobases with the --make-set-border flag.
  • --make-set-collapse-group causes final set IDs to be determined by column 5 instead of column 4. However, if --subset is present, it still applies to column 4.
  • The gene range lists on the resources page are in this format.

Compatibility note
When --set or --make-set was used in PLINK 1.07, variants not included in any set were often automatically excluded for all purposes (e.g. --make-bed), rather than just some set-based tests. This is no longer default behavior, but you can still invoke it with the --gene-all flag.

Input filtering >>