This page describes specialized PLINK 2.0 input and output file formats which are identifiable by file extension. (Most extensions not listed here have very simple one-entry-per-line or two-entry-per-line text formats.)
Unless otherwise specified, all multicolumn text files generated by PLINK 2.0 are tab-delimited, with one header line starting with '#'. In the column summaries, columns which are present unless removed by the column set descriptor are boldface, and columns which only appear under some data/flag/modifier combination(s) are italicized.
Comma-separated freqs/dosages for all alts; 'eq' requests '1=<ALT1 value>,2=<ALT2 value>,...' formatting with zero-values omitted, 'eqz' includes zeroes
'ALT_NUM_{FREQS,CTS}'
altnumeq
Comma-separated freqs/dosages for all alts
'FREQS'/'CTS'
freq, eq, eqz
Comma-separated freqs/dosages for all alleles
'NUM_FREQS'/'NUM_CTS'
numeq
Comma-separated freqs/dosages for all alleles
MACH_R2
machr2
MaCH imputation quality metric
MINIMAC3_R2
minimac3r2
Minimac3 phased-dosage imputation quality metric; inaccurate unless phased dosages were imported with e.g. "--vcf dosage=HDS" (dosage=DS is not enough)
Produced by --update-alleles when there are too many mismatches between the loaded alleles for a variant and the old-allele column(s) of the --update-alleles input file..
A text file with no header line, and one line per mismatching variant with the following three fields:
PLINK 1's preferred way to represent genotype calls. Must be accompanied by .bim and .fam files. Loaded with --bfile, and generated by --make-bed.
Do not confuse this with the UCSC Genome Browser's BED format, which is totally different. (It is safe to change a PLINK 1 .bed file's extension to .pgen and use --bpfile to load it.)
See the PLINK 1.9 documentation for a detailed description of the usual variant-major form, along with an example. PLINK 2 can also efficiently export the sample-major form ("--export ind-major-bed"); it has third byte equal to zero instead of one, but is otherwise analogous.
Native binary file format for Oxford statistical genetics tools, such as IMPUTE2 and SNPTEST. BGEN v1.1 files should always be accompanied by a .sample file. Loaded with --bgen, and produced by "--export bgen-1.{1,2,3}".
Variant information file accompanying a .bed or biallelic .pgen binary genotype table. (--make-just-bim can be used to update just this file.)
A text file with no header line, and one line per variant with the following six fields:
Chromosome code
Variant ID
Position in centimorgans (safe to use dummy value of '0')
Base-pair coordinate (1-based; limited to 231-2)
ALT ("A1" in PLINK 1.x) allele code
REF ("A2" in PLINK 1.x) allele code
A few notes:
Yes, the ALT column comes before the REF column in a .bim file.
When .bed files are involved, the ALT and REF allele codes will sometimes be swapped, since that's PLINK 1.x's default behavior whenever the true REF allele is less common than the ALT allele in the current dataset. If that's a problem, you can use --ref-allele to swap them back.
It is safe to change a .bim file's extension to .pvar and use --pfile to load it.
Variants with negative bp coordinates are ignored by PLINK.
PLINK 1.9 and 2.0 permit the centimorgan column to be omitted. (However, omission is not recommended if the .bim file needs to be read by other software.)
Produced by --pca. Accompanied by an .eigenval file, which contains one eigenvalue per line.
The .eigenvec file is a text file with a header line and between 1+V and 3+V columns per sample, where V is the number of requested principal components. The first columns contain the sample ID, and the rest are principal component scores in the same order as the .eigenval values (with column headers 'PC1', 'PC2', ...).
With the 'allele-wts' modifier, an .eigenvec.allele file is also generated. It's a text file with a header line, followed by one line per allele with the following columns:
Alternatively, with the 'biallelic-var-wts' modifier, an old-style .eigenvec.var file is generated. It's a text file with a header line, followed by one line per variant with the following columns:
A text file with no header line, and one line per variant with either 3N+5 or 3N+6 fields where N is the number of samples. Each line stores information for a single SNP.
In the 3N+5 case (corresponding to the original specification), the first five fields are:
"SNP ID"
rsID (treated by PLINK as the main variant ID)
Base-pair coordinate
Allele 1 (usually minor, use 'ref-first' when importing to treat as REF)
Allele 2 (usually major, use 'ref-last' when importing to treat as REF)
Unless the chromosome code was declared with --oxford-single-chr (in which case the SNP ID column is ignored), PLINK has no choice but to assume that the "SNP ID" column actually stores chromosome codes. (This is the convention when PLINK exports a 5-leading-column .gen file.)
The newer 3N+6 column flavor has a dedicated chromosome column in front. This was not supported by PLINK 1.9 or 2.0 before 16 Apr 2021.
Each subsequent triplet of values then indicate likelihoods of homozygote A1, heterozygote, and homozygote A2 genotypes at this variant, respectively, for one sample. If they add up to less than one, the remainder is a no-call probability weight.
The PLINK 2 binary format can represent allele count expected values, but it does not distinguish between e.g. {P(hom-ref)=0.28, P(het)=0.52, P(hom-alt)=0.2} and {P(hom-ref)=0.08, P(het)=0.92, P(hom-alt)=0}, and it ignores the no-call probability weight (though "0 0 0" will be correctly converted to a missing call). The --import-dosage-certainty flag can be used during import to replace some of the most uncertain genotype calls with missing values.
All statistics are computed across just the samples used in the regression.
1: For multiallelic variants, this column may contain multiple comma-separated alleles when the result doesn't depend on which allele is A1. 2: For males on chrX, these values are normally computed as if males were diploid, since that's the encoding used in the regression. The exception is when "--xchr-model 1" is specified, where male 0..1 values coexist with female 0..2 values in the regression. In that case, these columns will also be based on the mixed male 0..1, female 0..2 scaling.
To be clear, --glm only uses this 0..2 haploid coding on chrX, to put males and females on an equal footing in a world where X-inactivation is common. chrY/chrM use 0..1 coding.
These files contain single-precision (4-byte) floating point values. Using 1-based matrix indices, the first value in each file is the (1, 1) relationship value (.grm.bin) or observation count (.grm.N.bin); the second and third values are the (2, 1) and (2, 2) relationships/counts; the fourth through sixth values are the (3, 1), (3, 2) and (3, 3) relationships/counts in that order; and so on.
Note that .grm.bin files generated by GCTA versions before 1.1 have a different format.
Reference panel haplotype file format for IMPUTE2. Must be accompanied by a .legend file when no variant info header columns are present. Imported with --haps, and produced by "--export haps[legend]".
A text file with no header line, and either 2N+5 or 2N fields where N is the number of samples. In the former case, the first five columns are:
Chromosome code
Variant ID
Base-pair coordinate
Allele 0 (usually minor, use 'ref-first' when importing to treat as REF)
Allele 1 (usually major, use 'ref-last' when importing to treat as REF)
This is followed by a pair of 0/1-valued haplotype columns for the first sample, then a pair of haplotype columns for the second sample, etc. (For male samples on chrX, the second column may contain dummy '-' entries; otherwise, missing genotype calls are not permitted.)
When generated by PLINK 2, this is a text file which may or may not have a header line. If there's no header line (default with .grm.id files, can be forced for other .id files with --no-id-header), and there's a single column, they are IIDs; if there are two columns, they are FID/IID. Otherwise, there's one line per sample after the header line with the following columns:
Header
Contents
FID
Family ID (present iff .psam or --update-ids file has it)
IID
Individual ID (always present)
SID
Source ID (present iff .psam or --update-ids file has it)
A text file with a header line, and one line per sample pair with kinship coefficient no smaller than the --king-table-filter value. When --king-table-filter is not specified, all sample pairs are included. The following columns are present:
Produced by --make-king. Accompanied by a .king[.bin].id file containing sample IDs.
If text, a tab-delimited file that is either lower-triangular (excluding the diagonal) or square. If it's square, the upper-right triangle may be either zeroed out or the mirror-image of the lower-left triangle, depending on whether the 'square0' or 'square' modifier was used.
The binary format is semantically identical; it just has nothing but single- (4-byte) or double-precision (8-byte) floating point values, instead of text+delimiters+linebreaks.
Single-chromosome variant information file accompanying a bare .haps reference panel haplotype file. Imported with --legend, and produced by "--export hapslegend".
A text file with a header line, and one line per variant with the following four columns:
Header
Contents
id
Variant ID
position
Base-pair coordinate
a0
Allele 0 (usually minor, use 'ref-first' to treat as REF)
a1
Allele 1 (usually major, use 'ref-last' to treat as REF)
Pedigree information + genotype call text file. Must be accompanied by a .map file. Loaded with --pedmap, and produced by "--export ped". This format is simultaneously highly inefficient, even relative to other text formats, and limited in scope (unobserved minor allele codes can't be stored); continued use is strongly discouraged.
Contains no header line, and one line per sample with 2V+6 fields where V is the number of variants. The first six fields are the same as those in a .fam file. The seventh and eighth fields are allele calls for the first variant in the .map file ('0' = no call); the 9th and 10th are allele calls for the second variant; and so on. All variants must be biallelic (or monomorphic, or all-missing).
If all alleles are single-character, PLINK 1.9 and 2.0 will correctly parse the more compact "compound genotype" variant of this format, where each genotype call is represented as a single two-character string. This does not require the use of an additional loading flag. You can produce such a file with "--export compound-genotypes".
Most .pgen files have an embedded index, and do not have an accompanying .pgen.pgi file. When the index is not embedded, PLINK 2 expects it to be stored in "<.pgen filename>.pgi".
A draft specification of these formats is available. The first version will be finalized around the beginning of PLINK 2.0 beta testing.
Sample information file accompanying a .pgen binary genotype table. (--make-just-psam can be used to update just this file.)
A text file which usually has at least one header line, where only the last header line starts with '#FID' or '#IID'. This final header line specifies the columns in the .psam file; the following intermediate column headers are recognized:
IID (individual ID; required)
SID (source ID, when there are multiple samples for the same individual)
PAT (individual ID of father, '0' if unknown)
MAT (individual ID of mother, '0' if unknown)
SEX ('1' = male, '2' = female, 'NA'/'0' = unknown)
(FID must either be the first column, or absent. If it's absent, all FID values are now assumed to be '0'.) Any other value is treated as a phenotype/covariate name; see the phenotype/covariate documentation for column encoding details.
If no header line is present, the columns are assumed to be in .fam file order (FID, IID, PAT, MAT, SEX, PHENO1).
Multiple sequence alignment text file, produced by "--export phylip[-phased]", and recognized by FastTree, IQ-TREE, and several other phylogenetic tools. This format cannot be loaded by PLINK.
The header line contains two numbers, the number of sequences followed by the number of nucleotide codes per sequence.
Each subsequent line contains two fields. The first field contains the sample ID, and is padded by spaces to a fixed width, such that the longest sample ID is followed by exactly 3 spaces. (This imitates the behavior of vcf2phylip.) The second field contains IUPAC nucleotide codes.
Variant information file accompanying a .pgen binary genotype table. (--make-just-pvar can be used to update just this file.)
A text file which usually has at least one header line, where only the last header line starts with '#CHROM'. This final header line specifies the columns in the .pvar file; the following intermediate column headers are recognized:
POS (base-pair coordinate)
ID (variant ID; required)
REF (reference allele)
ALT (alternate alleles, comma-separated)
QUAL (phred-scaled quality score for whether the locus is variable at all)
FILTER ('PASS', '.', or semicolon-separated list of failing filter codes)
INFO (semicolon-separated list of flags and key-value pairs, with types declared in header)
FORMAT (terminates header line parsing)
CM (centimorgan position)
In particular, a VCF file, or a trimmed VCF file with all columns past the 5th (or 6th, etc.) removed, is valid input for anything expecting a .pvar-format file.
The following VCF-style header lines are also recognized:
"##INFO=<ID=PR,Number=0,Type=Flag...": Indicates the INFO/PR flag, which marks 'provisional' reference alleles (i.e. imported from a file which does not consistently track which allele is reference and which are alternates), is present. (This information is also present in .pgen files, and the loader reports an error when the .pvar and .pgen flags don't match.)
"##chrSet=...": Explicitly specifies the chromosome set. E.g. --make-pgen + --dog will cause "##chrSet=<ID=1,autosomePairCt=38,X,Y,XY,M>" to be written to the .pvar header, and as a consequence it isn't necessary to include the --dog flag when loading the new fileset.
When no header line is present, the columns are assumed to be in .bim file order (CHROM, ID, CM, POS, ALT, REF; or if only 5 columns are present, CM is assumed to be omitted).
Produced by "--export {A,AD}"; suitable for loading from R. This format cannot be loaded by PLINK.
A text file with a header line, and then one line per sample with V+6 (for "--export A") or 2V+6 (for "--export AD") fields, where V is the number of variants. The header line does not contain a preceding '#'. The first six fields are:
FID
Family ID
IID
Individual ID
PAT
Paternal individual ID
MAT
Maternal individual ID
SEX
Sex (1 = male, 2 = female, 0 = unknown)
PHENOTYPE
First active non-categorical phenotype (missing value if none)
This is followed by one or two fields per variant:
<Variant ID>_<counted allele>
Allelic dosage (missing = 'NA', haploid scaled to 0..2)
Produced by --make-rel. Accompanied by a .rel[.bin].id file containing sample IDs.
Contents are identical to that of a .grm/.grm.bin file. Possible shapes are essentially the same as for .king files; the only difference is that .king files have an omitted or constant-0.5 diagonal while .rel files do not.
Sample information file accompanying a .gen or .bgen genotype dosage file, or a .haps phased reference panel. Loaded with --data/--sample, and produced by --export in several cases.
By default, the .sample space-delimited files emitted by --export have two header lines, and then one line per sample with 4+ fields:
Number of singletons relative to this dataset, considering just diploid calls3
SINGLETON_CT
single
Number of singletons relative to this dataset
HAP_REF_INCL_FEMALE_Y_CT
haprefwfemaley
Haploid REF count, counting chrY for everyone
HAP_REF_CT
hapref
Haploid REF count, excluding chrY for nonmales
HAP_ALT_INCL_FEMALE_Y_CT
hapaltwfemaley
Haploid ALT count, counting chrY for everyone
HAP_ALT_CT
hapalt
Haploid ALT count, excluding chrY for nonmales
MISSING_INCL_FEMALE_Y_CT
missingwfemaley
Missing call count, counting chrY for everyone
MISSING_CT
missing
Missing call count, excluding chrY for nonmales
The 'hetsnp', 'dipts'/'ts'/'diptv'/'tv', 'dipnonsnpsymb'/'nonsnpsymb', 'symbolic', and 'nonsnp' columns count each ALT allele in a heterozygous ALTx-ALTy genotype separately, since they can be of different subtypes. (I.e. if they are of the same subtype, the corresponding count is incremented by 2.) As a consequence, these columns are unaffected by variant split/join.
3: If the ALT allele in a chrX biallelic variant appears in exactly one female and one male, that counts as a singleton in this column for just the female.
Number of missing hardcalls, not counting het haploids
MISSING_AND_HETHAP_CT
nmisshh
Number of missing hardcalls, counting het haploids
HETHAP_CT
hethap
Number of heterozygous haploid hardcalls
OBS_CT
nobs
Denominator (# samples, females excluded on chrY)
F_MISS_DOSAGE
fmissdosage
Missing dosage rate
F_MISS
fmiss
Missing hardcall rate, not counting het haploids
F_MISS_AND_HETHAP
fmisshh
Missing hardcall rate, counting het haploids
When dosages are present, MISSING_DOSAGE_CT will typically be slightly lower than MISSING_CT, since hardcalls normally aren't saved for dosages in (0.1, 0.9) or (1.1, 1.9).
Variant information + genotype call text file. Must be accompanied by a .tfam file. Loaded with --tfile, and produced by "--export tped".
Contains no header line, and one line per variant with 2N+4 fields where N is the number of samples. The first four fields are the same as those in a .map file. The fifth and sixth fields are allele calls for the first sample in the .tfam file ('0' = no call); the 7th and 8th are allele calls for the second sample; and so on. All variants must be biallelic (or monomorphic, or all-missing).
Variant information + sample ID + genotype call file; text if .vcf, binary if .bcf. Imported with --vcf/--bcf, and produced by "--export {b,v}cf".
Note that, while PLINK 2.0 supports a much larger subset of the VCF standard than PLINK 1.9, it still isn't appropriate for general-purpose VCF handling. Instead, the goal is to provide a very useful complement to bcftools. For example, PLINK 2.0 does not save per-call read depths, so any data management or analysis which requires them to be kept around should be done with bcftools or a similarly general tool; but once you're done with variant calling/imputation and are ready to treat your data as a single matrix of hardcalls or dosages (possibly with missing entries), PLINK 2.0 is much more efficient.
The VCFv4.3 files emitted by "--export vcf" start with the following three header lines:
##fileformat=VCFv4.3
##fileDate=<yyyymmdd date>
##source=PLINKv2.00
This is usually followed by all the VCF header lines (if any) present in the loaded .pvar file, a "##chrSet=" chromosome set description when appropriate, and additional "##contig=", INFO/PR, and FORMAT header lines when necessary to make the file conform to the VCF standard.
Next comes a tab-delimited header line with the following N+9 fields (where N is the number of samples), and one tab-delimited line per variant with the same fields:
#CHROM
Chromosome code
POS
Base-pair coordinate
ID
Variant identifier
REF
Reference allele (missing = 'N')
ALT
All alternate alleles, comma-separated (missing = '.')
QUAL
Phred-scaled quality score for whether the locus is variable at all
FILTER
'PASS', '.', or semicolon-separated list of failing filter codes
INFO
Semicolon-separated list of flags and key-value pairs, with types declared in header
FORMAT
'GT', 'DS', 'HDS', and/or 'GP' can be emitted by PLINK 2
<Sample ID>, ...
Genotype/dosage calls
Allele codes are supposed to either start with '<', only contain characters in the set {A,C,G,T,N,a,c,g,t,n}, be an isolated '*', or represent a breakend. --export issues a warning if an allele code does not satisfy this restriction.
The full VCFv4.3 specification is in the hts-specs GitHub repository; this includes details on the BCF binary encoding.
Comma-separated nonmajor alleles for second variant in pair
NONMAJ_FREQ_B
freq
(1 - <major-allele frequency>) for second variant in pair
[UN]PHASED_R[2]
(required)
Variant correlation coefficient
D
d
Linkage disequilibrium D (phased only)
DPRIME
dprime
Lewontin's D' (phased only)
ABS_DPRIME
dprimeabs
Absolute value of Lewontin's D' (phased only)
Sign of [UN]PHASED_R, D, and DPRIME is positive when the major (or, with 'ref-based', REF) alleles are positively correlated.
4: The 'maj' (or 'ref' when the 'ref-based' modifier is specified) column-set is included by default in --r-phased and --r-unphased's tabular output, but excluded by default for --r2-phased and --r2-unphased.
Produced by --r[2]-[un]phased when in matrix-output mode; the exact file extension distinguishes phased vs. unphased (which appears in the component before '.vcor1' or '.vcor2'), r vs. r2, and text vs. binary format. Accompanied by a <matrix filename>.vars file containing variant IDs.
Possible shapes are the same as for .king files, except that triangular files include the diagonal.
Number of missing hardcalls, not counting het haploids
MISSING_AND_HETHAP_CT
nmisshh
Number of missing hardcalls, counting het haploids
HETHAP_CT
hethap
Number of heterozygous haploid hardcalls
OBS_CT
nobs
Denominator (# variants for males, excludes chrY for females)
F_MISS_DOSAGE
fmissdosage
Missing dosage rate
F_MISS
fmiss
Missing hardcall rate, not counting het haploids
F_MISS_AND_HETHAP
fmisshh
Missing hardcall rate, counting het haploids
F_HETHAP
fhethap
Heterozygous haploid rate.
When dosages are present, MISSING_DOSAGE_CT will typically be slightly lower than MISSING_CT, since hardcalls normally aren't saved for dosages in (0.1, 0.9) or (1.1, 1.9).
Produced by "--variant-score bin". Accompanied by .vscore.cols and .vscore.vars text files containing column (score) and row (variant ID) labels, respectively.
A matrix of double-precision (8-byte) floating point variant scores.