S: 22 Oct 2024 (b.7.7) D: 22 Oct 2024 Main functions (--distance...) (--make-grm-bin...) (--ibs-test...) (--assoc, --model) (--mh, --mh2, --homog) (--assoc, --gxe) (--linear, --logistic) Core algorithms Quick index search |
Standard data inputMost of PLINK's calculations operate on tables of samples and variant calls. The following flags are available for defining the form and location of this input, and associated metadata. Discrete callsPLINK 1 binary--bfile [prefix] The --bfile flag causes the binary fileset plink.bed + plink.bim + plink.fam to be referenced. (The structure of these files is described in the file formats appendix.) If a prefix is given, it replaces all instances of 'plink'. --bed <filename> --bed, --bim, and --fam let you specify the full name of one part of the PLINK 1 binary fileset, taking precedence over --bfile. For example, plink --bfile toy --bed bob --freq would reference the binary fileset bob.bed + toy.bim + toy.fam. (.fam files are also present in some other fileset types. The --fam flag has the same function when loading them.) PLINK 1 binary is PLINK 1.9's preferred input format. In fact, PLINK 1.9 automatically converts most other formats to PLINK 1 binary before the main loading sequence1. As a result, if you're performing multiple operations on the same otherwise-formatted files, you may want to keep the autoconversion products and work with them, instead of repeating the conversion on every run. PLINK 1.9 gives you several ways to handle this situation. 1. If you just want to convert your data, don't use any other flags besides --out. For example: plink --file text_fileset --out binary_fileset This entirely skips the main loading sequence, so filters like --extract, --hwe, and --snps-only are not permitted (you'll get an error if you attempt to use them). 2. You can produce a binary fileset which is a filtered version of your text data. Use --make-bed for this. 3. You can directly analyze the text fileset. In this case, the autoconversion products are silently deleted at the end of the run2, to avoid clogging your drive with unwanted files. For example, the following command writes an allele frequency report to results.frq, and doesn't leave any other files behind besides results.log: plink --file text_fileset --freq --out results 4. You can analyze the text fileset while specifying (with --keep-autoconv) that you also want to keep the autoconversion products. So the following command leaves behind results.bed, results.bim and results.fam as well as results.frq and results.log: plink --file text_fileset --freq --keep-autoconv --out results Finally, note that PLINK 2 autoconverts to a different binary format, but it still has an efficient --make-bed implementation. Thus, if you want your script to be portable to PLINK 2 with a minimum of fuss, it's reasonable to stick with --make-bed. 1: Since binary files are so much smaller than the equivalent text files, we expect that this will not put undue pressure on your available disk space. This architectural choice allows PLINK's core to focus entirely on efficient streaming processing of binary data; we hope the memory usage, development speed, and performance benefits we're able to deliver as a result are worth any slight inconvenience. PLINK text--file [prefix] This sets the filename prefix of the .ped+.map fileset to reference. (The default is 'plink' if you do not specify a prefix.) As discussed above, PLINK 1.9 will autoconvert the fileset to binary; conversion is required because this is NOT a native format, and you should avoid using it when possible because it's simultaneously less efficient AND more lossy than e.g. VCF. --ped <filename> --map <filename> These are analogous to --bed/--bim/--fam above. (You may no longer use "--ped -" to read the .ped from standard input; redirect it to a file and process it the normal way instead.) --tfile [prefix] Similarly, these flags let you specify a transposed text fileset to load. --lfile [prefix] --lfile/--lgen let you specify a long-format fileset to load. --reference lets you specify an associated list of reference alleles (variant IDs in the first column, reference alleles in the second column, optional non-reference alleles in the third column). --allele-count, when used with --reference, specifies that the .lgen file contains reference allele counts in the 4th column. Irregularly-formatted PLINK text files--no-fid These allow you to use .fam or .ped files which lack family ID, parental ID, sex, and/or phenotype columns. Compound .ped genotypes and three-column .map files are automatically detected and handled correctly, so the --compound-genotypes and --map3 flags have been retired. (Note, however, that compound genotypes are not permitted in .tped files.) Missing genotype calls are normally assumed to be represented by '0' in .ped and similar files; you can change this to most other (nonspace) characters with --missing-genotype. However, '1', '2', '3', '4', 'A', 'C', 'G', and 'T' are disallowed. In .bim files, '.' is also treated as a missing code (this was not true before 16 Jan 2023); you can alter this second code with --missing-genotype2. Variant Call Format--vcf <filename> --vcf loads a (possibly gzipped) VCF file, extracting information which can be represented by the PLINK 1 binary format and ignoring everything else (after applying the load filters described below). For example, phase and dosage information are currently discarded. (This situation will improve in the future, but we do not have plans to try to handle everything in the file.) VCF reference alleles are set to A2 by the autoconverter even when they appear to be minor. However, to maintain backwards compatibility with PLINK 1.07, PLINK 1.9 normally forces major alleles to A2 during its loading sequence. One workaround is permanently keeping the .bim file generated during initial conversion, for use as --a2-allele input whenever the reference sequence needs to be recovered. (If you use this method, note that, when your initial conversion step invokes --make-bed instead of just --out, you also need --keep-allele-order to avoid losing track of reference alleles before the very first write, because --make-bed triggers the regular loading sequence.) --bcf loads a BCF2 file instead, and otherwise behaves identically to --vcf. It can be either uncompressed (as emitted by e.g. GATK 3) or BGZF-compressed (supported by htslib). The BCF1 output of old SAMtools builds is not supported; use "bcftools view" to convert such files to readable VCFs. --double-id VCF files just contain sample IDs, instead of the distinct family and within-family IDs tracked by PLINK. We offer three ways to convert these IDs:
If none of these three flags is present, the loader defaults to --double-id + --id-delim. Since PLINK sample IDs cannot contain spaces, an error is normally reported when there's a space in a VCF sample ID. To work around this, you can use --vcf-idspace-to to convert all spaces in sample IDs to another character. This happens before regular parsing, so when the --vcf-idspace-to and --id-delim characters are identical, both space and the original --id-delim character are interpreted as FID/IID delimiters. If you only want the space character to function as a delimiter, use "--id-delim ' '". (This is not compatible with --rerun.) --biallelic-only ['strict'] ['list'] By default, all variants are loaded; when more than one alternate allele is present, the reference allele and the most common alternate are tracked (ties broken in favor of the lower-numbered allele) and the rest are coded as missing calls. To simply skip all variants where at least two alternate alleles are present in the dataset, use --biallelic-only. Add the 'strict' modifier if you want to indiscriminately skip variants with 2+ alternate alleles listed even when only one alternate allele actually shows up (this minimizes merge headaches down the line), and use 'list' to dump a list of skipped variant IDs to plink.skip.3allele. --vcf-min-qual causes all variants with QUAL value smaller than the given number, or with no QUAL value at all, to be skipped. (--qual-scores has similar functionality.) --vcf-filter [exception(s)...] To skip variants which failed one or more filters tracked by the FILTER field, use --vcf-filter. This can be combined with one or more (space-delimited) filter names to ignore. ##fileformat=VCFv4.1 For example, given the VCF file above:
By default, when the GT field is absent, the variant is kept and all genotypes are set to missing. To skip the variant instead, use --vcf-require-gt. --vcf-min-gp <val> --vcf-min-gq excludes all genotype calls with GQ below the given (nonnegative, decimal values permitted) threshold. Missing GQ values are not treated as being below the threshold. Similarly, --vcf-min-gp excludes all genotype calls with GP value below the given threshold, assuming GP is 0-1 scaled rather than phred-scaled. (This convention will probably be in the VCFv4.3 specification, and some programs already adhere to it when generating earlier-version VCF files. Due to the latter, we do not enforce a minimum VCF format version, but be aware that this flag is useless on fully-standards-compliant pre-v4.3 VCFs.) The current VCF standard does not specify how '0/.' and similar GT values should be interpreted. By default (mode 'error'/'e'), PLINK 1.9 errors out and reports the line number of the anomaly. Should the half-call be intentional, though (this can be the case with Complete Genomics data), you can request the following other modes:
Oxford format--data [prefix] --oxford-single-chr <chromosome code> --data causes the Oxford-format fileset plink.gen + plink.sample to be referenced. If a prefix is provided, it replaces 'plink'. --gen, --bgen, and --sample allow you to specify the filenames separately; --gen is necessary if your genomic data file has a .gen.gz extension, and --bgen is necessary for BGEN-format files.
--hard-call-threshold <value> Since the PLINK 1 binary format cannot represent genotype probabilities, calls with uncertainty greater than 0.1 are normally treated as missing, and the rest are treated as hard calls. You can adjust this threshold by providing a numeric parameter to --hard-call-threshold. Alternatively, when --hard-call-threshold is given the 'random' modifier, calls are independently randomized according to the probabilities in the file. (This is not ideal; it would be better to randomize in a haploblock-sensitive manner. But resampling a bunch of times with this and generating an empirical distribution of some statistic can still be more informative than applying a single threshold and calculating that statistic once.) --hard-call-threshold can only be used on Oxford-format files for now; it will be extended to VCFs and other formats in the future. --missing-code lets you specify the set of strings to interpret as missing phenotype values in a .sample file. For example, "--missing-code -9,0,NA,na" would cause '-9', '0', 'NA', and 'na' to all be interpreted as missing phenotypes. (Note that no spaces are currently permitted between the strings.) By default, only 'NA' is interpreted as missing. 23andMe text--23file <filename> [family ID] [within-family ID] [sex] [phenotype] [paternal ID] [maternal ID] --23file specifies an uncompressed 23andMe-formatted file to load (and convert to PLINK 1 binary before further processing).
For example: plink --23file genome.txt Chang Christopher --out plink_genome Note that some variants in 23andMe files may have indel instead of SNP calls. If this is the case, you'll be notified during the loading process; you can then use --list-23-indels (preferably on a merged dataset, to minimize the impact of missing calls) to produce a list of the affected variant IDs. The 23andMe file format also does not always mark the boundaries of the X chromosome pseudo-autosomal region, so PLINK does not convert the chromosome code for 2-allele markers on male X chromosomes to XY—instead, you'll get heterozygous haploid warnings down the line. Use --split-x to solve the problem. 3: As a general rule, handling of space-containing command-line parameters is undefined and subject to change without notice. Avoid them. Randomized data--dummy <sample count> <SNP count> [missing geno freq] [missing pheno freq] [{acgt | 1234 | 12}] ['scalar-pheno'] This tells PLINK to generate a simple dataset from scratch (useful for basic software testing). All generated samples are females with random genotype and phenotype values. If the third parameter is a decimal value, it sets the frequency of missing genotype calls (default 0); if the fourth is also a decimal, it sets the frequency of missing phenotypes (which also defaults to 0). The 'acgt' modifier causes A/C/G/T genotype calls to be generated instead of the PLINK 1.07 default of A/B, while '1234' generates 1/2/3/4 genotypes, and '12' makes all calls 1/2. The 'scalar-pheno' modifier causes normally distributed (mean 0, stdev 1) rather than case/control phenotype values to be generated. --simulate <simulation parameter file> [{tags | haps}] [{acgt | 1234 | 12}] --simulate-ncases <number of cases> --simulate-missing <missing geno freq> --simulate generates a new dataset which contains some disease-associated SNPs. (If you want to simulate phenotypes based on real genotype data instead, use GCTA --simu-cc/--simu-qt.) For the basic version of the command, the simulation parameter file is expected to be a text file with one or more rows, where each row has six fields:
odds(X) := P(X) / (1 - P(X)). Note that PLINK 1.07's --simulate implementation actually interprets the last two fields as relative risks instead of odds ratios; while the difference is minimal for small values of P(X), we have changed the behavior to match the documentation to reduce future confusion. If the 'tags' or 'haps' modifier is present, an extended nine-field simulation parameter file is expected instead:
With 'haps', both the causal variants and the markers are included in the dataset; 'tags' throws out the causal variants. Normally, the reference allele is designated by 'D' and the alternate allele is 'd'. With 'haps', causal variants are labeled in that manner, while the linked marker reference and alternate alleles are instead designated by 'A' and 'B' respectively. You can use the 'acgt', '1234', or '12' modifier to replace this labeling with random bases. By default, 1000 cases and 1000 controls are generated, population disease prevalence is 0.01, and missing genotype frequency is 0; you can change these numbers with --simulate-ncases, --simulate-ncontrols, --simulate-prevalence, and --simulate-missing, respectively. --simulate-label attaches the given prefix, followed by a dash, to all FIDs and IIDs in the dataset. (This makes it easier to merge multiple simulated datasets.) See the PLINK 1.07 documentation for further discussion. --simulate-qt <simulation parameter file> [{tags | haps}] [{acgt | 1234 | 12}] --simulate-n <number of samples> --simulate-qt generates a new dataset with quantitative trait loci. For the basic version of the command, the simulation parameter file is expected to have the following six fields:
All modifiers have essentially the same semantics as with --simulate. --simulate-label and --simulate-missing also act identically. We have fixed bugs in PLINK 1.07 --simulate-qt's phenotype generation. By default, 1000 samples are generated; you can change this with --simulate-n. Nonstandard chromosome IDs--allow-extra-chr ['0'] Normally, PLINK reports an error if the input data contains unrecognized chromosome codes (such as hg19 haplotype chromosomes or unplaced contigs). If none of the additional codes start with a digit, you can permit them with the --allow-extra-chr flag. (These contigs are ignored by most analyses which skip unplaced regions.) The '0' modifier causes these chromosome codes to be treated as if they had been set to zero. (This is sometimes necessary to produce reports readable by older software.) --chr-set <autosome ct> ['no-x'] ['no-y'] ['no-xy'] ['no-mt'] --cow --horse --chr-set changes the chromosome set. The first parameter specifies the number of diploid autosome pairs if positive, or haploid chromosomes if negative. (Polyploid and aneuploid data are not supported, and there is currently no special handling of sex or mitochondrial chromosomes in all-haploid chromosome sets.) Given diploid autosomes, the remaining modifiers let you indicate the absence of specific non-autosomal chromosomes, as an extra sanity check on the input data. Note that, when there are n autosome pairs, the X chromosome is assigned numeric code n+1, Y is n+2, XY (pseudo-autosomal region of X) is n+3, and MT (mitochondria) is n+4. n is currently limited to 95, so if you're working with adder's-tongue fern genomes, you're out of luck4. The other flags support PLINK 1.07 and GCTA semantics:
4: Just kidding. Contact us, and we'll send you a build supporting a higher autosome limit. Note that this isn't necessary if you're dealing with a draft assembly with lots of contigs, rather than actual autosomes—the standard build can handle that if you name your contigs 'contig1', 'contig2', etc. and use the --allow-extra-chr flag. SHAPEIT recombination map--cm-map <filename pattern> [chromosome code] --cm-map uses SHAPEIT-format recombination map file(s) to set centimorgan positions of all variants on either a single chromosome or every autosome. In the former case, the first parameter should be the exact name of the recombination map file, and the second parameter should be the chromosome code. In the latter case, the filename pattern should contain a '@' where the chromosome number would go, e.g. plink --bfile binary_fileset --cm-map genetic_map_chr@_combined_b37.txt --make-bed --out fileset_with_cms Conversely, --zero-cms can be used with --make-bed or --recode to zero out all centimorgan positions in the output fileset. This saves disk space and speeds up file I/O when you don't need the centimorgan values. (If they were originally set via --cm-map, you can always use --cm-map to recalculate them when needed.) --zero-cms and --cm-map can be used simultaneously; in this case, --zero-cms acts first. No-genotype-data corner cases--allow-no-samples If the input fileset contains no samples or no variants, PLINK normally errors out. However, you can force it to proceed with --allow-no-samples/--allow-no-vars. (Most commands won't do anything useful with such a fileset, of course, and many will be silently skipped.) Allele frequenciesWhen allele frequency estimates are needed, PLINK defaults to using empirical frequencies from the immediate dataset (with a pseudocount of 1 added when --maf-succ is specified). This is unsatisfactory when processing a small subset of a larger dataset or population. --read-freq <.freq/.frq/.frq.count/.frqx filename> --read-freq loads a PLINK 1.07, PLINK 1.9, or GCTA allele frequency report, and estimates MAFs (and heterozygote frequencies, if the report is from --freqx) from the file instead of the current genomic data table. It can be combined with --maf-succ if the file contains observation counts. When a minor allele code is missing from the main dataset but present in the --read-freq file, it is now loaded. PhenotypesLoading from an alternate phenotype file--pheno <filename> --mpheno <n> --pheno-merge --pheno causes phenotype values to be read from the 3rd column of the specified space- or tab-delimited file, instead of the .fam or .ped file. The first and second columns of that file must contain family and within-family IDs, respectively. In combination with --pheno, --mpheno lets you use the (n+2)th column instead of the 3rd column, while --pheno-name lets you select a column by title. (In order to use --pheno-name, there must be a header row with first two entries 'FID' and 'IID'.) The new --pheno-merge flag tells PLINK to use the phenotype value in the .fam/.ped file when no value is present in the --pheno file; without it, the phenotype is always treated as missing in this case. --allow-no-sex is now required if you want to retain phenotype values for missing-sex samples. This is a change from PLINK 1.07; we believe it would be more confusing to continue treating regular and --pheno phenotypes differently, and apologize for any temporary inconvenience we've caused. --all-pheno causes all phenotypes present in the --pheno file to be subject to the association tests you've requested. (--pheno-merge then applies to every phenotype.) Note that, when dealing with a very large number of phenotypes, specialized software is usually more appropriate than --all-pheno; we recommend Matrix eQTL or FastQTL, which process thousands of phenotypes simultaneously and achieve a level of efficiency not possible with --all-pheno + --assoc/--linear. (Update, 1 Apr 2019: PLINK 2.0 also handles this case efficiently now.) Phenotype encoding--missing-phenotype <integer> --1 Missing phenotypes are normally expected to be encoded as -9. You can change this to another integer with --missing-phenotype. (This is a slight change from PLINK 1.07: floating point values are now disallowed due to rounding issues, and nonnumeric values such as 'NA' are rejected since they're treated as missing phenotypes no matter what. Note that --output-missing-phenotype can be given a nonnumeric string.) Case/control phenotypes are expected to be encoded as 1=unaffected (control), 2=affected (case); 0 is accepted as an alternate missing value encoding. If you use the --1 flag, 0 is interpreted as unaffected status instead, while 1 maps to affected. This also forces phenotypes to be interpreted as case/control. Case/control phenotype generation--make-pheno <filename> <value> Given a text file listing family and individual IDs in the first two columns, "--make-pheno [filename] '*'" designates all samples listed in the named file as cases, and all other samples as controls. If the named file has a third column, and a value other than '*' is given, --make-pheno will designate all samples with third column entry equal to the given value as cases, all other samples mentioned in the file as controls, and all samples missing from the file as having missing phenotypes. --tail-pheno <lower ceiling> [upper minimum] --tail-pheno converts a scalar phenotype into a case/control phenotype. Samples with phenotype values less than or equal to the given lower ceiling are treated as controls, samples with phenotypes strictly greater than the upper minimum are treated as cases, and all other samples are treated as having missing phenotypes. If no upper minimum is provided, it is assumed to be equal to the lower ceiling. You can combine this with e.g. --make-bed to save the new case/control phenotype. Covariates--covar <filename> ['keep-pheno-on-missing-cov'] --covar-name <column ID(s)/range(s)...> --no-const-covar --covar designates the file to load covariates from. The file format is the same as for --pheno (optional header line, FID and IID in first two columns, covariates in remaining columns). By default, the main phenotype is set to missing if any covariate is missing; you can disable this with the 'keep-pheno-on-missing-cov' modifier. --covar-name lets you specify a subset of covariates to load, by column name; separate multiple column names with spaces or commas, and use dashes to designate ranges. (Spaces are not permitted immediately before or after a range-denoting dash.) --covar-number lets you use column numbers instead. For example, if the first row of the covariate file is FID IID SITE AGE DOB BMI ETH SMOKE STATUS ALC then the following two expressions have the same effect: --covar-name AGE, BMI-SMOKE, ALC --no-const-covar excludes all constant covariates. PLINK normally errors out if this causes all covariates to be excluded (or if the --covar file contained no covariates in the first place), but you can use the --allow-no-covars flag to make it try to proceed. Clusters of samples--within <filename> ['keep-NA'] --mwithin <n> --within lets you define disjoint clusters/strata of samples for permutation procedures and stratified analyses. It normally accepts a file with FIDs in the first column, IIDs in the second column, and cluster names in the third column; --mwithin causes cluster names to be read from column (n+2) instead. Alternatively, you can use --family to create a cluster for each family ID. By default, --write-cluster generates a file with 'NA' in the cluster name field for all samples not in any cluster, and if such a file is reloaded with --within, they will remain unassigned. To actually create a 'NA' cluster (this is PLINK 1.07's behavior), use the 'keep-NA' modifier. --loop-assoc <filename> ['keep-NA'] Given a cluster file, this runs each specified case/control association command once for each cluster, using membership in the cluster as the phenotype. This can be combined with --mwithin. Variant sets--set <filename> --set-names <space/comma-delimited name(s)...> --subset <filename> --complement-sets --make-set-complement-all <new set name> --set defines possibly overlapping sets of variants for set-based tests, given a .set file. To only keep some of the sets in the --set file, you can add --set-names (followed by a list of set names to load) and/or --subset (followed by the name of a text file containing the list). If both --set-names and --subset are present, all sets named in either list are loaded. To merge all sets, use the --set-collapse-all flag. You're required to provide the merged set's name. To invert every set, add the --complement-sets flag. All inverted sets will have 'C_' prefixes attached to their names. --make-set-complement-all <name> defines a single set containing all variants not mentioned in the --set file; it's essentially identical to --complement-sets + --set-collapse-all <name>, except the set name doesn't have the 'C_' prefix. --make-set-border <kbs> With --make-set, you can define sets from a list of named bp ranges instead. Each line of the --make-set input file is expected to have the following 4-5 fields in front:
Additional notes:
Compatibility note |