Introduction, downloads

D: 14 Nov 2024

Recent version history

What's new?

Coming next

[Jump to search box]

General usage

Getting started

Flag usage summaries

Column set descriptors

Citation instructions

Standard data input

PLINK 1 binary (.bed)

PROVISIONAL_REF?

PLINK 2 binary (.pgen)

Autoconversion behavior

VCF/BCF (.vcf[.gz], .bcf)

Oxford genotype (.bgen)

Oxford haplotype (.haps)

PLINK 1 text (.ped, .tped)

PLINK 1 dosage

Sample ID conversion

Dosage import settings

Generate random

Unusual chromosome IDs

Allele frequencies

Phenotypes

Covariates

'Cluster' import

Reference genome (.fa)

Input filtering

Sample ID file

Variant ID file

Interval-BED file

--extract-col-cond

QUAL, FILTER, INFO

Chromosomes

SNPs only

Simple variant window

Multiple variant ranges

Deduplicate variants

Sample/variant thinning

Pheno./covar. condition

Missingness

Category subset

--keep-col-match

Missing genotypes

Number of distinct alleles

Allele frequencies/counts

Hardy-Weinberg

Imputation quality

Sex

Founder status

Main functions

Data management

--make-[b]pgen/--make-bed

--export

--output-chr

--split-par/--merge-par

--set-all-var-ids

--recover-var-ids

--update-map...

--update-ids...

--ref-allele

--ref-from-fa

--normalize

--indiv-sort

--write-covar

--variance-standardize

--quantile-normalize

--split-cat-pheno

--pheno-svd

--pmerge[-list]

--write-samples

Basic statistics

--freq

--geno-counts

--sample-counts

--missing

--genotyping-rate

--hardy

--het

--check-sex/--impute-sex

--fst

--pgen-info

Pairwise diffs

--pgen-diff

--sample-diff

Linkage disequilibrium

--indep...

--r[2]-[un]phased

--ld

Sample-distance matrices

Relationship/covariance

  (--make-grm-bin...)

--make-king...

--king-cutoff

Population stratification

--pca

PCA projection

Association analysis

--glm

--glm ERRCODE values

--gwas-ssf

--adjust-file

Report postprocessing

--clump

Linear scoring

--score[-list]

--variant-score

Distributed computation

Command-line help

Miscellaneous

Flag/parameter reuse

System resource usage

--loop-cats

.zst decompression

Pseudorandom numbers

Warnings as errors

.pgen validation

Resources

1000 Genomes phase 3

HGDP-CEPH

FASTA files

Errors and warnings

Output file list

Order of operations

Developer information

GitHub root

Python library

R library

Compilation

Adding new functionality

Discussion forums

Credits

File formats

Tutorials

Setup

Rules of Thumb

Data Exploration 1 — HWE, Allele Frequency Spectrum

Data Exploration 2 — Genomic Structure

Linkage

Relationship Matrix

Genome-Wide Assocation Analyses (GWAS)

Regressions

Post-Hoc

Formatting Files

bcftools

Variant IDs

Reference Alleles

Format for R

Shortcuts

Quick index search

Resources

This page is under construction. If there's something you consider to be an essential PLINK resource which is not mentioned on this page, contact us, comment in the plink2-users Google group, or open a GitHub issue.

The linked files are currently hosted by Dropbox. If you are unable to download them, contact us for access to an alternate source; we understand that Dropbox is blocked in some locations.

Genotype data

1000 Genomes phase 3, phased and (optionally) annotated

Callset: (main source, chrY/chrM/contigs source) (main source, chrY/chrM/contigs source) (source)

  Split by chromosome?   Keep singleton variants? (more info...)

  rsIDs from dbSNP 156? (more info...)

  INFO annotations?

  KING-based pedigree corrections? (more info...)

all_hg38.pgen.zst (3.16 GiB, requires --allow-extra-chr) all_hg38.pgen.zst (3.16 GiB, requires --allow-extra-chr) all_phase3.pgen.zst (2.25 GiB) all_phase3_ns.pgen.zst (2.13 GiB)
all_hg38_rs.pvar.zst (2.68 GiB, >75% of this is INFO annotations) (rename to "all_hg38.pvar.zst" before use) all_hg38_rs_noannot.pvar.zst (556 MiB) (rename to "all_hg38.pvar.zst" before use) all_hg38.pvar.zst (4.41 GiB, >90% of this is annotations) all_hg38.pvar.zst (2.52 GiB, >80% of this is annotations) all_hg38_noannot.pvar.zst (359 MiB) (rename to "all_hg38.pvar.zst" before use) all_hg38_noannot.pvar.zst (349 MiB) (rename to "all_hg38.pvar.zst" before use) all_phase3.pvar.zst (1.26 GiB) all_phase3_noannot.pvar.zst (614 MiB) (rename to "all_phase3.pvar.zst" before use) all_phase3_ns.pvar.zst (812 MiB) all_phase3_ns_noannot.pvar.zst (362 MiB) (rename to "all_phase3_ns.pvar.zst" before use)
hg38_corrected.psam hg38_orig.psam hg38_orig.psam phase3_corrected.psam phase3_orig.psam (rename to "all_hg38.psam" before use) (rename to "all_phase3.psam" before use) (rename to "all_phase3_ns.psam" before use)

Common sample information file (not for chrY/chrM): hg38_corrected.psam. hg38_orig.psam. phase3_corrected.psam. phase3_orig.psam. Create symlinks from chr1_hg38.psam, chr2_hg38.psam, chr1_phase3.psam, chr2_phase3.psam, etc. to this (or make a bunch of copies).

Remove "_rs" from the .pvar.zst filenames before use.

Remove "_rs_noannot" from the .pvar.zst filenames before use.

Remove "_noannot" from the .pvar.zst filenames before use.

chr1_hg38.pgen.zst (236 MiB), chr1_hg38_rs.pvar.zst (205 MiB) chr1_hg38_rs_noannot.pvar.zst (42.7 MiB) chr1_hg38.pvar.zst (193 MiB) chr1_hg38_noannot.pvar.zst (27.2 MiB)
chr2_hg38.pgen.zst (247 MiB), chr2_hg38_rs.pvar.zst (216 MiB) chr2_hg38_rs_noannot.pvar.zst (45.5 MiB) chr2_hg38.pvar.zst (204 MiB) chr2_hg38_noannot.pvar.zst (29.1 MiB)
chr3_hg38.pgen.zst (205 MiB), chr3_hg38_rs.pvar.zst (177 MiB) chr3_hg38_rs_noannot.pvar.zst (37.2 MiB) chr3_hg38.pvar.zst (167 MiB) chr3_hg38_noannot.pvar.zst (24.1 MiB)
chr4_hg38.pgen.zst (196 MiB), chr4_hg38_rs.pvar.zst (172 MiB) chr4_hg38_rs_noannot.pvar.zst (36.5 MiB) chr4_hg38.pvar.zst (162 MiB) chr4_hg38_noannot.pvar.zst (23.7 MiB)
chr5_hg38.pgen.zst (184 MiB), chr5_hg38_rs.pvar.zst (160 MiB) chr5_hg38_rs_noannot.pvar.zst (33.9 MiB) chr5_hg38.pvar.zst (151 MiB) chr5_hg38_noannot.pvar.zst (21.5 MiB)
chr6_hg38.pgen.zst (178 MiB), chr6_hg38_rs.pvar.zst (154 MiB) chr6_hg38_rs_noannot.pvar.zst (32.1 MiB) chr6_hg38.pvar.zst (145 MiB) chr6_hg38_noannot.pvar.zst (20.7 MiB)
chr7_hg38.pgen.zst (176 MiB), chr7_hg38_rs.pvar.zst (151 MiB) chr7_hg38_rs_noannot.pvar.zst (30.9 MiB) chr7_hg38.pvar.zst (143 MiB) chr7_hg38_noannot.pvar.zst (20.2 MiB)
chr8_hg38.pgen.zst (159 MiB), chr8_hg38_rs.pvar.zst (138 MiB) chr8_hg38_rs_noannot.pvar.zst (28.9 MiB) chr8_hg38.pvar.zst (130 MiB) chr8_hg38_noannot.pvar.zst (18.4 MiB)
chr9_hg38.pgen.zst (140 MiB), chr9_hg38_rs.pvar.zst (118 MiB) chr9_hg38_rs_noannot.pvar.zst (23.6 MiB) chr9_hg38.pvar.zst (111 MiB) chr9_hg38_noannot.pvar.zst (15.2 MiB)
chr10_hg38.pgen.zst (150 MiB), chr10_hg38_rs.pvar.zst (128 MiB) chr10_hg38_rs_noannot.pvar.zst (26.1 MiB) chr10_hg38.pvar.zst (121 MiB) chr10_hg38_noannot.pvar.zst (17.2 MiB)
chr11_hg38.pgen.zst (141 MiB), chr11_hg38_rs.pvar.zst (123 MiB) chr11_hg38_rs_noannot.pvar.zst (25.5 MiB) chr11_hg38.pvar.zst (116 MiB) chr11_hg38_noannot.pvar.zst (16.5 MiB)
chr12_hg38.pgen.zst (143 MiB), chr12_hg38_rs.pvar.zst (121 MiB) chr12_hg38_rs_noannot.pvar.zst (24.9 MiB) chr12_hg38.pvar.zst (115 MiB) chr12_hg38_noannot.pvar.zst (16.5 MiB)
chr13_hg38.pgen.zst (107 MiB), chr13_hg38_rs.pvar.zst (91.6 MiB) chr13_hg38_rs_noannot.pvar.zst (18.7 MiB) chr13_hg38.pvar.zst (86.6 MiB) chr13_hg38_noannot.pvar.zst (12.1 MiB)
chr14_hg38.pgen.zst (98.5 MiB), chr14_hg38_rs.pvar.zst (84.0 MiB) chr14_hg38_rs_noannot.pvar.zst (17.1 MiB) chr14_hg38.pvar.zst (79.5 MiB) chr14_hg38_noannot.pvar.zst (11.5 MiB)
chr15_hg38.pgen.zst (97.0 MiB), chr15_hg38_rs.pvar.zst (79.7 MiB) chr15_hg38_rs_noannot.pvar.zst (15.8 MiB) chr15_hg38.pvar.zst (75.5 MiB) chr15_hg38_noannot.pvar.zst (10.2 MiB)
chr16_hg38.pgen.zst (107 MiB), chr16_hg38_rs.pvar.zst (88.8 MiB) chr16_hg38_rs_noannot.pvar.zst (17.6 MiB) chr16_hg38.pvar.zst (84.0 MiB) chr16_hg38_noannot.pvar.zst (11.7 MiB)
chr17_hg38.pgen.zst (94.9 MiB), chr17_hg38_rs.pvar.zst (78.2 MiB) chr17_hg38_rs_noannot.pvar.zst (15.6 MiB) chr17_hg38.pvar.zst (74.1 MiB) chr17_hg38_noannot.pvar.zst (10.1 MiB)
chr18_hg38.pgen.zst (87.5 MiB), chr18_hg38_rs.pvar.zst (73.0 MiB) chr18_hg38_rs_noannot.pvar.zst (14.7 MiB) chr18_hg38.pvar.zst (69.1 MiB) chr18_hg38_noannot.pvar.zst (9.85 MiB)
chr19_hg38.pgen.zst (80.6 MiB), chr19_hg38_rs.pvar.zst (65.1 MiB) chr19_hg38_rs_noannot.pvar.zst (12.5 MiB) chr19_hg38.pvar.zst (61.9 MiB) chr19_hg38_noannot.pvar.zst (7.92 MiB)
chr20_hg38.pgen.zst (74.0 MiB), chr20_hg38_rs.pvar.zst (61.7 MiB) chr20_hg38_rs_noannot.pvar.zst (12.2 MiB) chr20_hg38.pvar.zst (58.5 MiB) chr20_hg38_noannot.pvar.zst (7.89 MiB)
chr21_hg38.pgen.zst (46.5 MiB), chr21_hg38_rs.pvar.zst (38.3 MiB) chr21_hg38_rs_noannot.pvar.zst (7.47 MiB) chr21_hg38.pvar.zst (36.3 MiB) chr21_hg38_noannot.pvar.zst (4.97 MiB)
chr22_hg38.pgen.zst (50.5 MiB), chr22_hg38_rs.pvar.zst (41.3 MiB) chr22_hg38_rs_noannot.pvar.zst (7.91 MiB) chr22_hg38.pvar.zst (39.2 MiB) chr22_hg38_noannot.pvar.zst (5.01 MiB)
chrX_hg38.pgen.zst (96.0 MiB), chrX_hg38_rs.pvar.zst (108 MiB) chrX_hg38_rs_noannot.pvar.zst (21.5 MiB) chrX_hg38.pvar.zst (94.5 MiB) chrX_hg38_noannot.pvar.zst (9.06 MiB)
chrY_hg38.pgen.zst (7.83 MiB), chrY_hg38_rs.pvar.zst (8.77 MiB) chrY_hg38_rs_noannot.pvar.zst (1.34 MiB) chrY_hg38.pvar.zst (8.07 MiB) chrY_hg38_noannot.pvar.zst (734 KiB)
chrM_hg38.pgen.zst (69.1 KiB), chrM_hg38_rs.pvar.zst (202 KiB) chrM_hg38_rs_noannot.pvar.zst (29.7 KiB) chrM_hg38.pvar.zst (188 KiB) chrM_hg38_noannot.pvar.zst (18.0 KiB)
contigs_hg38.pgen.zst (63.3 MiB), contigs_hg38_rs.pvar.zst (64.6 MiB) contigs_hg38_rs_noannot.pvar.zst (6.36 MiB) contigs_hg38.pvar.zst (137 MiB) contigs_hg38_noannot.pvar.zst (5.16 MiB)

chr1_hg38.pgen.zst (236 MiB), chr1_hg38.pvar.zst (347 MiB) chr1_hg38_noannot.pvar.zst (27.1 MiB)
chr2_hg38.pgen.zst (247 MiB), chr2_hg38.pvar.zst (365 MiB) chr2_hg38_noannot.pvar.zst (29.4 MiB)
chr3_hg38.pgen.zst (204 MiB), chr3_hg38.pvar.zst (298 MiB) chr3_hg38_noannot.pvar.zst (23.8 MiB)
chr4_hg38.pgen.zst (196 MiB), chr4_hg38.pvar.zst (290 MiB) chr4_hg38_noannot.pvar.zst (23.3 MiB)
chr5_hg38.pgen.zst (183 MiB), chr5_hg38.pvar.zst (271 MiB) chr5_hg38_noannot.pvar.zst (21.6 MiB)
chr6_hg38.pgen.zst (178 MiB), chr6_hg38.pvar.zst (259 MiB) chr6_hg38_noannot.pvar.zst (20.6 MiB)
chr7_hg38.pgen.zst (176 MiB), chr7_hg38.pvar.zst (252 MiB) chr7_hg38_noannot.pvar.zst (19.8 MiB)
chr8_hg38.pgen.zst (159 MiB), chr8_hg38.pvar.zst (232 MiB) chr8_hg38_noannot.pvar.zst (18.5 MiB)
chr9_hg38.pgen.zst (140 MiB), chr9_hg38.pvar.zst (195 MiB) chr9_hg38_noannot.pvar.zst (15.0 MiB)
chr10_hg38.pgen.zst (150 MiB), chr10_hg38.pvar.zst (213 MiB) chr10_hg38_noannot.pvar.zst (17.5 MiB)
chr11_hg38.pgen.zst (140 MiB), chr11_hg38.pvar.zst (205 MiB) chr11_hg38_noannot.pvar.zst (16.5 MiB)
chr12_hg38.pgen.zst (143 MiB), chr12_hg38.pvar.zst (202 MiB) chr12_hg38_noannot.pvar.zst (16.3 MiB)
chr13_hg38.pgen.zst (106 MiB), chr13_hg38.pvar.zst (152 MiB) chr13_hg38_noannot.pvar.zst (12.3 MiB)
chr14_hg38.pgen.zst (98.4 MiB), chr14_hg38.pvar.zst (139 MiB) chr14_hg38_noannot.pvar.zst (11.5 MiB)
chr15_hg38.pgen.zst (97.0 MiB), chr15_hg38.pvar.zst (131 MiB) chr15_hg38_noannot.pvar.zst (10.6 MiB)
chr16_hg38.pgen.zst (107 MiB), chr16_hg38.pvar.zst (146 MiB) chr16_hg38_noannot.pvar.zst (11.7 MiB)
chr17_hg38.pgen.zst (94.7 MiB), chr17_hg38.pvar.zst (129 MiB) chr17_hg38_noannot.pvar.zst (10.2 MiB)
chr18_hg38.pgen.zst (87.4 MiB), chr18_hg38.pvar.zst (120 MiB) chr18_hg38_noannot.pvar.zst (9.86 MiB)
chr19_hg38.pgen.zst (80.4 MiB), chr19_hg38.pvar.zst (106 MiB) chr19_hg38_noannot.pvar.zst (8.10 MiB)
chr20_hg38.pgen.zst (73.9 MiB), chr20_hg38.pvar.zst (101 MiB) chr20_hg38_noannot.pvar.zst (7.99 MiB)
chr21_hg38.pgen.zst (46.4 MiB), chr21_hg38.pvar.zst (62.4 MiB) chr21_hg38_noannot.pvar.zst (5.01 MiB)
chr22_hg38.pgen.zst (50.4 MiB), chr22_hg38.pvar.zst (67.7 MiB) chr22_hg38_noannot.pvar.zst (5.03 MiB)
chrX_hg38.pgen.zst (95.8 MiB), chrX_hg38.pvar.zst (161 MiB) chrX_hg38_noannot.pvar.zst (9.05 MiB)
chrY_hg38.pgen.zst (7.83 MiB), chrY_hg38.pvar.zst (8.07 MiB) chrY_hg38_noannot.pvar.zst (734 KiB)
chrM_hg38.pgen.zst (69.1 KiB), chrM_hg38.pvar.zst (188 KiB) chrM_hg38_noannot.pvar.zst (18.0 KiB)
contigs_hg38.pgen.zst (63.3 MiB), contigs_hg38.pvar.zst (137 MiB) contigs_hg38_noannot.pvar.zst (5.16 MiB)

chr1_phase3.pgen.zst (172 MiB), chr1_phase3.pvar.zst (100 MiB) chr1_phase3_noannot.pvar.zst (47.5 MiB)
chr2_phase3.pgen.zst (185 MiB), chr2_phase3.pvar.zst (110 MiB) chr2_phase3_noannot.pvar.zst (52.0 MiB)
chr3_phase3.pgen.zst (153 MiB), chr3_phase3.pvar.zst (90.6 MiB) chr3_phase3_noannot.pvar.zst (42.9 MiB)
chr4_phase3.pgen.zst (150 MiB), chr4_phase3.pvar.zst (89.1 MiB) chr4_phase3_noannot.pvar.zst (42.2 MiB)
chr5_phase3.pgen.zst (136 MiB), chr5_phase3.pvar.zst (81.6 MiB) chr5_phase3_noannot.pvar.zst (38.8 MiB)
chr6_phase3.pgen.zst (136 MiB), chr6_phase3.pvar.zst (78.4 MiB) chr6_phase3_noannot.pvar.zst (36.9 MiB)
chr7_phase3.pgen.zst (131 MiB), chr7_phase3.pvar.zst (73.5 MiB) chr7_phase3_noannot.pvar.zst (34.6 MiB)
chr8_phase3.pgen.zst (121 MiB), chr8_phase3.pvar.zst (71.2 MiB) chr8_phase3_noannot.pvar.zst (33.7 MiB)
chr9_phase3.pgen.zst (103 MiB), chr9_phase3.pvar.zst (55.6 MiB) chr9_phase3_noannot.pvar.zst (26.2 MiB)
chr10_phase3.pgen.zst (111 MiB), chr10_phase3.pvar.zst (62.3 MiB) chr10_phase3_noannot.pvar.zst (29.3 MiB)
chr11_phase3.pgen.zst (107 MiB), chr11_phase3.pvar.zst (62.7 MiB) chr11_phase3_noannot.pvar.zst (29.7 MiB)
chr12_phase3.pgen.zst (106 MiB), chr12_phase3.pvar.zst (59.8 MiB) chr12_phase3_noannot.pvar.zst (28.2 MiB)
chr13_phase3.pgen.zst (78.4 MiB), chr13_phase3.pvar.zst (44.6 MiB) chr13_phase3_noannot.pvar.zst (21.0 MiB)
chr14_phase3.pgen.zst (73.6 MiB), chr14_phase3.pvar.zst (41.4 MiB) chr14_phase3_noannot.pvar.zst (19.5 MiB)
chr15_phase3.pgen.zst (71.6 MiB), chr15_phase3.pvar.zst (38.0 MiB) chr15_phase3_noannot.pvar.zst (17.9 MiB)
chr16_phase3.pgen.zst (79.8 MiB), chr16_phase3.pvar.zst (42.0 MiB) chr16_phase3_noannot.pvar.zst (19.8 MiB)
chr17_phase3.pgen.zst (68.2 MiB), chr17_phase3.pvar.zst (36.4 MiB) chr17_phase3_noannot.pvar.zst (17.1 MiB)
chr18_phase3.pgen.zst (65.2 MiB), chr18_phase3.pvar.zst (35.4 MiB) chr18_phase3_noannot.pvar.zst (16.8 MiB)
chr19_phase3.pgen.zst (57.6 MiB), chr19_phase3.pvar.zst (28.9 MiB) chr19_phase3_noannot.pvar.zst (13.5 MiB)
chr20_phase3.pgen.zst (52.5 MiB), chr20_phase3.pvar.zst (28.2 MiB) chr20_phase3_noannot.pvar.zst (13.3 MiB)
chr21_phase3.pgen.zst (34.6 MiB), chr21_phase3.pvar.zst (17.4 MiB) chr21_phase3_noannot.pvar.zst (8.08 MiB)
chr22_phase3.pgen.zst (35.8 MiB), chr22_phase3.pvar.zst (17.4 MiB) chr22_phase3_noannot.pvar.zst (8.20 MiB)
chrX_phase3.pgen.zst (73.0 MiB), chrX_phase3.pvar.zst (44.7 MiB) chrX_phase3_noannot.pvar.zst (18.3 MiB)
chrY_phase3.pgen.zst (325 KiB), chrY_phase3.pvar.zst (605 KiB), chrY_phase3_noannot.pvar.zst (241 KiB), chrY_phase3.psam (1233 samples)
chrM_phase3.pgen.zst (50.4 KiB), chrM_phase3.pvar.zst (15.7 KiB), chrM_phase3_noannot.pvar.zst (10.4 KiB), chrM_phase3_corrected.psam chrM_phase3_orig.psam (2534 samples, rename to "chrM_phase3.psam" before use)

Notes:

  • Due to a header line and an INFO annotation quirk, PLINK 2 builds older than 8 Jan 2023 are unable to convert this dataset to or from BCF.
  • .pgen.zst file(s) must be decompressed before use. (This isn't necessary for .pvar.zst files: see --pfile's 'vzs' modifier.) If you don't have another .zst decompressor installed, you can use PLINK 2 for this purpose:
    plink2 --zst-decompress all_hg38.pgen.zst all_hg38.pgen
  • In addition to ~600 trios which were intentionally included, this dataset contains a few close relations which are not described in the .psam file, e.g. sibships where neither parent was sequenced. Use --remove with one of the following ID lists when you don't want close relations:
    These lists were generated from the original dataset with "--king-cutoff 0.177" and "--king-cutoff 0.0884", respectively. If you're curious, here's the --make-king-table + --king-table-filter report listing all 1st/2nd-degree related sample pairs: deg2_hg38.kin0
  • In addition to ~600 trios which were intentionally included, this dataset contains a few close relations which are not described in the .psam file, e.g. sibships where neither parent was sequenced. Use --remove with one of the following ID lists when you don't want close relations:
    These lists were generated from the original dataset with "--king-cutoff 0.177" and "--king-cutoff 0.0884", respectively. If you're curious, here's the --make-king-table + --king-table-filter report listing all 1st/2nd-degree related sample pairs: deg2_hg38.kin0
  • This dataset was intended to contain only unrelated samples; unfortunately, a few parent-child pairs, sibships, and second-degree relationships snuck in. Use --remove with one of the following ID lists when you don't want close relations:
    These lists were generated from the original dataset with "--king-cutoff 0.177" and "--king-cutoff 0.0884", respectively. If you're curious, here's the --make-king-table + --king-table-filter report listing all 1st/2nd-degree related sample pairs: deg2_phase3.kin0
  • Coverage and heterozygosity statistics indicate that the sequenced HG03511 cell line has only one copy of chrX, and no copies of chrY. This can be a consequence of either mosaic loss of chrX in a female or mosaic loss of chrY in a male. The pedigree provided by 1000 Genomes labels this sample as female, so that is almost certainly the true sex, but (as of 22 Jun 2024) we have decided to label sex as NA in the KING-corrected .psam file because its chrX variant calls are much more representative of male data than female data. (Between 9 Jan 2023 and 21 Jun 2024, the sex was "corrected" to male; then we took a closer look at the original sequence reads.)
  • This dataset fuses results from two different pipelines. The primary chr1..chrX genotypes are phased, contain no missing calls, and only have biallelic left-normalized variants (multiallelic variants were "split"). The chrY/chrM/contigs genotypes are unphased, contain some missing calls, multiallelic variants there are unsplit, and there are a few variants which aren't left-normalized.
  • All relevant information in the original phased chr1..chrX callsetoriginal phased chr1..chrX callset is preserved. The chrY/chrM/contigs source material contains per-genotype AD, DP, GQ, and PL fields which cannot be represented by the .pgen file format, and are consequently not preserved.
  • There was previously an option to download "no-singleton" files. This is no longer available, since the Byrska-Bishop et al. quality-control pipeline removed almost all genuine singletons on chr1..chrX.
  • This dataset contains (unsplit) multiallelic variants, and a few variants which aren't left-normalized.
  • Refer to the 1000 Genomes website for additional sample information, data usage rules, and citation instructions.
Human Genome Diversity Project (HGDP) - CEPH Panel

Callset: (source) (source)

  Split by chromosome?
  INFO annotations?

hgdp_all.pgen.zst (2.26 GiB)
hgdp_all.pvar.zst (3.33 GiB) hgdp_all_noannot.pvar.zst (573 MiB) (rename to "hgdp_all.pvar.zst" before use)
hgdp.psam (rename to "hgdp_all.psam" before use)

Common sample information file: hgdp.psam. Create symlinks from hgdp_chr1.psam, hgdp_chr2.psam, etc. to this (or make a bunch of copies).

Remove "_noannot" from the .pvar.zst filenames before use.

hgdp_chr1.pgen.zst (210 MiB), hgdp_chr1.pvar.zst (280 MiB) hgdp_chr1_noannot.pvar.zst (46.0 MiB)
hgdp_chr2.pgen.zst (177 MiB), hgdp_chr2.pvar.zst (276 MiB) hgdp_chr2_noannot.pvar.zst (47.0 MiB)
hgdp_chr3.pgen.zst (141 MiB), hgdp_chr3.pvar.zst (228 MiB) hgdp_chr3_noannot.pvar.zst (38.8 MiB)
hgdp_chr4.pgen.zst (132 MiB), hgdp_chr4.pvar.zst (223 MiB) hgdp_chr4_noannot.pvar.zst (37.8 MiB)
hgdp_chr5.pgen.zst (126 MiB), hgdp_chr5.pvar.zst (207 MiB) hgdp_chr5_noannot.pvar.zst (35.0 MiB)
hgdp_chr6.pgen.zst (119 MiB), hgdp_chr6.pvar.zst (195 MiB) hgdp_chr6_noannot.pvar.zst (33.0 MiB)
hgdp_chr7.pgen.zst (129 MiB), hgdp_chr7.pvar.zst (192 MiB) hgdp_chr7_noannot.pvar.zst (32.2 MiB)
hgdp_chr8.pgen.zst (105 MiB), hgdp_chr8.pvar.zst (176 MiB) hgdp_chr8_noannot.pvar.zst (29.8 MiB)
hgdp_chr9.pgen.zst (110 MiB), hgdp_chr9.pvar.zst (151 MiB) hgdp_chr9_noannot.pvar.zst (25.3 MiB)
hgdp_chr10.pgen.zst (110 MiB), hgdp_chr10.pvar.zst (163 MiB) hgdp_chr10_noannot.pvar.zst (27.4 MiB)
hgdp_chr11.pgen.zst (97.7 MiB), hgdp_chr11.pvar.zst (158 MiB) hgdp_chr11_noannot.pvar.zst (26.9 MiB)
hgdp_chr12.pgen.zst (108 MiB), hgdp_chr12.pvar.zst (157 MiB) hgdp_chr12_noannot.pvar.zst (26.3 MiB)
hgdp_chr13.pgen.zst (89.6 MiB), hgdp_chr13.pvar.zst (124 MiB) hgdp_chr13_noannot.pvar.zst (20.5 MiB)
hgdp_chr14.pgen.zst (70.6 MiB), hgdp_chr14.pvar.zst (105 MiB) hgdp_chr14_noannot.pvar.zst (17.8 MiB)
hgdp_chr15.pgen.zst (70.6 MiB), hgdp_chr15.pvar.zst (98.4 MiB) hgdp_chr15_noannot.pvar.zst (16.6 MiB)
hgdp_chr16.pgen.zst (81.4 MiB), hgdp_chr16.pvar.zst (110 MiB) hgdp_chr16_noannot.pvar.zst (18.4 MiB)
hgdp_chr17.pgen.zst (77.8 MiB), hgdp_chr17.pvar.zst (100 MiB) hgdp_chr17_noannot.pvar.zst (16.6 MiB)
hgdp_chr18.pgen.zst (60.8 MiB), hgdp_chr18.pvar.zst (92.7 MiB) hgdp_chr18_noannot.pvar.zst (15.7 MiB)
hgdp_chr19.pgen.zst (67.3 MiB), hgdp_chr19.pvar.zst (80.6 MiB) hgdp_chr19_noannot.pvar.zst (13.2 MiB)
hgdp_chr20.pgen.zst (61.1 MiB), hgdp_chr20.pvar.zst (81.1 MiB) hgdp_chr20_noannot.pvar.zst (13.4 MiB)
hgdp_chr21.pgen.zst (39.6 MiB), hgdp_chr21.pvar.zst (48.8 MiB) hgdp_chr21_noannot.pvar.zst (8.13 MiB)
hgdp_chr22.pgen.zst (49.9 MiB), hgdp_chr22.pvar.zst (54.5 MiB) hgdp_chr22_noannot.pvar.zst (8.85 MiB)
hgdp_chrX.pgen.zst (80.1 MiB), hgdp_chrX.pvar.zst (105 MiB) hgdp_chrX_noannot.pvar.zst (19.6 MiB)
hgdp_chrY.pgen.zst (4.20 MiB), hgdp_chrY.pvar.zst (4.08 MiB) hgdp_chrY_noannot.pvar.zst (823 KiB)

hgdp_statphase.pgen.zst (1.61 GiB)
hgdp_statphase.pvar.zst (474 MiB)
hgdp.psam (rename to "hgdp_statphase.psam" before use)

Notes:

  • .pgen.zst file(s) must be decompressed before use. (This isn't necessary for .pvar.zst files: see --pfile's 'vzs' modifier.) If you don't have another .zst decompressor installed, you can use PLINK 2 for this purpose:
    plink2 --zst-decompress hgdp_all.pgen.zst hgdp_all.pgen
  • This dataset was aligned to GRCh38, and variant calls were made on the autosomes, chrX, and chrY. There are 929 samples, with no 1st-degree relations. Samples have been sorted by ID.
  • The dataset contains (unsplit) multiallelic variants, and one variant on chrY which isn't left-normalized. ~6.57% of genotype calls are missing.
  • The source material contains per-genotype AD, DP, GQ, and PL fields which cannot be represented by the .pgen file format, and are consequently not preserved.
  • This dataset was aligned to GRCh38, and variant calls were made on only the autosomes. There are 929 samples, with no 1st-degree relations. Samples have been sorted by ID.
  • The dataset contains (unsplit) multiallelic variants, and ~4.27% of genotype calls are missing. All variants are left-normalized.
  • This data is freely available with no restrictions on the types of analyses that can be carried out. If you use this data in a publication, please cite Bergström et al. (2020) Insights into human genetic variation and population history from 929 diverse genomes. Science, 367.
  • Refer to the paper and the Fondation Jean Dausset-CEPH website for more information about this dataset.

Reference genomes

These are the reference genomes that the aforementioned 1000 Genomes and HGDP samples were aligned against. Note that --fa can directly read these compressed files.

Tutorials

Errors and warnings >>