Introduction, downloads

D: 2 Jul 2025

Recent version history

What's new?

Coming next

[Jump to search box]

General usage

Getting started

Flag usage summaries

Column set descriptors

Citation instructions

Standard data input

PLINK 1 binary (.bed)

PROVISIONAL_REF?

PLINK 2 binary (.pgen)

Autoconversion behavior

VCF/BCF (.vcf[.gz], .bcf)

Oxford genotype (.bgen)

Oxford haplotype (.haps)

EIGENSOFT binary

PLINK 1 text (.ped, .tped)

PLINK 1 dosage

Sample ID conversion

Dosage import settings

Generate random

Unusual chromosome IDs

Allele frequencies

Phenotypes

Covariates

'Cluster' import

Reference genome (.fa)

Input filtering

Sample ID file

Variant ID file

Interval-BED file

--extract-col-cond

QUAL, FILTER, INFO

Chromosomes

SNPs only

Simple variant window

Multiple variant ranges

Deduplicate variants

Sample/variant thinning

Pheno./covar. condition

Missingness

Category subset

--keep-col-match

Missing genotypes

Number of distinct alleles

Allele frequencies/counts

Hardy-Weinberg

Imputation quality

Sex

Founder status

Main functions

Data management

--make-[b]pgen/--make-bed

--export

--output-chr

--split-par/--merge-par

--set-all-var-ids

--recover-var-ids

--update-map...

--update-ids...

--ref-allele

--ref-from-fa

--normalize

--indiv-sort

--write-covar

--variance-standardize

--quantile-normalize

--split-cat-pheno

--pheno-svd

--pmerge[-list]

--write-samples

Basic statistics

--freq

--geno-counts

--sample-counts

--missing

--genotyping-rate

--hardy

--het

--check-sex/--impute-sex

--fst

--pgen-info

Pairwise diffs

--pgen-diff

--sample-diff

Linkage disequilibrium

--indep...

--r[2]-[un]phased

--ld

Sample-distance matrices

Relationship/covariance

(--make-grm-bin...)

--make-king...

--king-cutoff

Population stratification

--pca

PCA projection

Association analysis

--glm

--glm ERRCODE values

--gwas-ssf

--adjust-file

Report postprocessing

--clump

Linear scoring

--score[-list]

--variant-score

Distributed computation

Command-line help

Miscellaneous

Flag/parameter reuse

System resource usage

--loop-cats

.zst decompression

Pseudorandom numbers

Warnings as errors

.pgen validation

Resources

1000 Genomes phase 3

HGDP-CEPH

FASTA files

Errors and warnings

Output file list

Order of operations

Developer information

GitHub root

Python library

R library

Compilation

Adding new functionality

Discussion forums

Credits

File formats

Tutorials

Setup

Rules of Thumb

Data Exploration 1 — HWE, Allele Frequency Spectrum

Data Exploration 2 — Genomic Structure

Linkage

Relationship Matrix

Genome-Wide Assocation Analyses (GWAS)

Regressions

bcftools

Quick index search

Resources

This page is under construction. If there's something you consider to be an essential PLINK resource which is not mentioned on this page, contact us, comment in the plink2-users Google group, or open a GitHub issue.

The linked files are currently hosted by Dropbox. If you are unable to download them, contact us for access to an alternate source; we understand that Dropbox is blocked in some locations.

Genotype data

1000 Genomes phase 3, phased and (optionally) annotated

Callset: (main source, chrY/chrM/contigs source) (main source, chrY/chrM/contigs source) (source)

Split by chromosome? Keep singleton variants? (more info...) (hide info)

rsIDs from dbSNP 156? (more info...) (hide info)

INFO annotations?

KING-based pedigree corrections? (more info...) (hide info)

all_hg38.pgen.zst (3.16 GiB, requires --allow-extra-chr) all_hg38.pgen.zst (3.16 GiB, requires --allow-extra-chr) all_phase3.pgen.zst (2.25 GiB) all_phase3_ns.pgen.zst (2.13 GiB)
all_hg38_rs.pvar.zst (2.68 GiB, >75% of this is INFO annotations) (rename to "all_hg38.pvar.zst" before use) all_hg38_rs_noannot.pvar.zst (556 MiB) (rename to "all_hg38.pvar.zst" before use) all_hg38.pvar.zst (4.41 GiB, >90% of this is annotations) all_hg38.pvar.zst (2.52 GiB, >80% of this is annotations) all_hg38_noannot.pvar.zst (359 MiB) (rename to "all_hg38.pvar.zst" before use) all_hg38_noannot.pvar.zst (349 MiB) (rename to "all_hg38.pvar.zst" before use) all_phase3.pvar.zst (1.26 GiB) all_phase3_noannot.pvar.zst (614 MiB) (rename to "all_phase3.pvar.zst" before use) all_phase3_ns.pvar.zst (812 MiB) all_phase3_ns_noannot.pvar.zst (362 MiB) (rename to "all_phase3_ns.pvar.zst" before use)
hg38_corrected.psam hg38_orig.psam hg38_orig.psam phase3_corrected.psam phase3_orig.psam (rename to "all_hg38.psam" before use) (rename to "all_phase3.psam" before use) (rename to "all_phase3_ns.psam" before use)

Common sample information file (not for chrY/chrM): hg38_corrected.psam. hg38_orig.psam. phase3_corrected.psam. phase3_orig.psam. Create symlinks from chr1_hg38.psam, chr2_hg38.psam, chr1_phase3.psam, chr2_phase3.psam, etc. to this (or make a bunch of copies).

Remove "_rs" from the .pvar.zst filenames before use.

Remove "_rs_noannot" from the .pvar.zst filenames before use.

Remove "_noannot" from the .pvar.zst filenames before use.

Notes:

Due to a header line and an INFO annotation quirk, PLINK 2 builds older than 8 Jan 2023 are unable to convert this dataset to or from BCF.

.pgen.zst file(s) must be decompressed before use. (This isn't necessary for .pvar.zst files: see --pfile's 'vzs' modifier.) If you don't have another .zst decompressor installed, you can use PLINK 2 for this purpose:
plink2 --zst-decompress all_hg38.pgen.zst all_hg38.pgen

In addition to ~600 trios which were intentionally included, this dataset contains a few close relations which are not described in the .psam file, e.g. sibships where neither parent was sequenced. Use --remove with one of the following ID lists when you don't want close relations:

1st degree: deg1_hg38.king.cutoff.out.id (621 samples)
1st+2nd degree: deg2_hg38.king.cutoff.out.id (629 samples)

These lists were generated from the original dataset with "--king-cutoff 0.177" and "--king-cutoff 0.0884", respectively. If you're curious, here's the --make-king-table + --king-table-filter report listing all 1st/2nd-degree related sample pairs: deg2_hg38.kin0

1st degree: deg1_hg38.king.cutoff.out.id (621 samples)
1st+2nd degree: deg2_hg38.king.cutoff.out.id (627 samples)

This dataset was intended to contain only unrelated samples; unfortunately, a few parent-child pairs, sibships, and second-degree relationships snuck in. Use --remove with one of the following ID lists when you don't want close relations:

1st degree: deg1_phase3.king.cutoff.out.id (11 samples)
1st+2nd degree: deg2_phase3.king.cutoff.out.id (14 samples)

Coverage and heterozygosity statistics indicate that the sequenced HG03511 cell line has only one copy of chrX, and no copies of chrY. This can be a consequence of either mosaic loss of chrX in a female or mosaic loss of chrY in a male. The pedigree provided by 1000 Genomes labels this sample as female, so that is almost certainly the true sex, but (as of 22 Jun 2024) we have decided to label sex as NA in the KING-corrected .psam file because its chrX variant calls are much more representative of male data than female data. (Between 9 Jan 2023 and 21 Jun 2024, the sex was "corrected" to male; then we took a closer look at the original sequence reads.)

This dataset fuses results from two different pipelines. The primary chr1..chrX genotypes are phased, contain no missing calls, and only have biallelic left-normalized variants (multiallelic variants were "split"). The chrY/chrM/contigs genotypes are unphased, contain some missing calls, multiallelic variants there are unsplit, and there are a few variants which aren't left-normalized.

All relevant information in the original phased chr1..chrX callsetoriginal phased chr1..chrX callset is preserved. The chrY/chrM/contigs source material contains per-genotype AD, DP, GQ, and PL fields which cannot be represented by the .pgen file format, and are consequently not preserved.

There was previously an option to download "no-singleton" files. This is no longer available, since the Byrska-Bishop et al. quality-control pipeline removed almost all genuine singletons on chr1..chrX.

This dataset contains (unsplit) multiallelic variants, and a few variants which aren't left-normalized.

Refer to the 1000 Genomes website for additional sample information, data usage rules, and citation instructions.

Human Genome Diversity Project (HGDP) - CEPH Panel

Callset: (source) (source)

Split by chromosome?
INFO annotations?

hgdp_all.pgen.zst (2.26 GiB)
hgdp_all.pvar.zst (3.33 GiB) hgdp_all_noannot.pvar.zst (573 MiB) (rename to "hgdp_all.pvar.zst" before use)
hgdp.psam (rename to "hgdp_all.psam" before use)

Common sample information file: hgdp.psam. Create symlinks from hgdp_chr1.psam, hgdp_chr2.psam, etc. to this (or make a bunch of copies).

Remove "_noannot" from the .pvar.zst filenames before use.

hgdp_statphase.pgen.zst (1.61 GiB)
hgdp_statphase.pvar.zst (474 MiB)
hgdp.psam (rename to "hgdp_statphase.psam" before use)

Notes:

.pgen.zst file(s) must be decompressed before use. (This isn't necessary for .pvar.zst files: see --pfile's 'vzs' modifier.) If you don't have another .zst decompressor installed, you can use PLINK 2 for this purpose:
plink2 --zst-decompress hgdp_all.pgen.zst hgdp_all.pgen

This dataset was aligned to GRCh38, and variant calls were made on the autosomes, chrX, and chrY. There are 929 samples, with no 1st-degree relations. Samples have been sorted by ID.

The dataset contains (unsplit) multiallelic variants, and one variant on chrY which isn't left-normalized. ~6.57% of genotype calls are missing.

The source material contains per-genotype AD, DP, GQ, and PL fields which cannot be represented by the .pgen file format, and are consequently not preserved.

This dataset was aligned to GRCh38, and variant calls were made on only the autosomes. There are 929 samples, with no 1st-degree relations. Samples have been sorted by ID.

The dataset contains (unsplit) multiallelic variants, and ~4.27% of genotype calls are missing. All variants are left-normalized.

This data is freely available with no restrictions on the types of analyses that can be carried out. If you use this data in a publication, please cite Bergström et al. (2020) Insights into human genetic variation and population history from 929 diverse genomes. Science, 367.
Refer to the paper and the Fondation Jean Dausset-CEPH website for more information about this dataset.

Reference genomes

These are the reference genomes that the aforementioned 1000 Genomes and HGDP samples were aligned against. Note that --fa can directly read these compressed files.

GRCh38_full_analysis_set_plus_decoy_hla.fa.zst (716 MiB, contains a few non-A/C/G/T/N codes)
hs37d5.fa.zst (703 MiB)

Tutorials

Setup
Rules of Thumb
Data Exploration 1 — HWE, Allele Frequency Spectrum
Data Exploration 2 — Genomic Structure

Genome-Wide Association Analyses (GWAS)

Formatting Files

Shortcuts

Errors and warnings >>