This page is under construction. If there's something you consider to be an essential PLINK resource which is not mentioned on this page, contact us, comment in the plink2-users Google group, or open a GitHub issue.
The linked files are currently hosted by Dropbox. If you are unable to download them, contact us for access to an alternate source; we understand that Dropbox is blocked in some locations.
The no-singleton dataset can be a good starting point if you were planning on filtering out low-MAF variants anyway, or you're constrained to ≤ 8 GiB of workspace memory.
There are three kinds of variant IDs in the source VCFs: <chrom>:<pos>:ref:alt for short variants on chr1..chrX, HGSV IDs for structural variants on chr1..chrX, and "." for all chrY/chrM/contig variants. No rsIDs are present, partly because the dataset contains short variants that were submitted for inclusion in dbSNP 156 but were absent from the then-current dbSNP 155.
dbSNP 156 now exists. This allows >99.95% of the non-HGSV variants on chr1..chrX to be assigned rsIDs. The smallest ID has been chosen when multiple options are available.
We have also assigned rsIDs to some chrY/chrM/contig variants when the records match up perfectly, but this part is less complete.
The KING-robust algorithm is very effective at identifying 1st-degree relations within a population, and its output can also be used to distinguish between parent-child vs. sibling relationships — IBS0 is much higher for the latter, all other things being equal. One relationship previously flagged by KING-robust (NA20317-NA20318) was formally acknowledged to be a probable clerical error, and added to the official pedigree on 2020-07-31. (phase3_orig.psam does not contain this relationship, since it is based on the 2016-05-05 snapshot of the official pedigree.)
The KING-corrected .psam files contain this relationship, along with a few others with similarly strong supporting evidence (see the .kin0 file below). The sex of sample HG02300 was corrected from male to female in the official pedigree after 2021-07-12. This was propagated to the KING-corrected file on 8 Jan 2023. See also the note on sample HG03511 below.Coverage and heterozygosity statistics indicate that the sequenced HG03511 cell line has only one copy of chrX, and no copies of chrY. This can be a consequence of either mosaic loss of chrX in a female or mosaic loss of chrY in a male. The pedigree provided by 1000 Genomes labels this sample as female, so that is almost certainly the true sex, but (as of 22 Jun 2024) we have decided to label sex as NA in the KING-corrected .psam file because its chrX variant calls are much more representative of male data than female data. (Between 9 Jan 2023 and 21 Jun 2024, the sex was "corrected" to male; then we took a closer look at the original sequence reads.)
Due to a header line and an INFO annotation quirk, PLINK 2 builds older than 8 Jan 2023 are unable to convert this dataset to or from BCF.
.pgen.zst file(s) must be decompressed before use. (This isn't necessary for .pvar.zst files: see --pfile's 'vzs' modifier.) If you don't have another .zst decompressor installed, you can use PLINK 2 for this purpose: plink2 --zst-decompress all_hg38.pgen.zst all_hg38.pgen
In addition to ~600 trios which were intentionally included, this dataset contains a few close relations which are not described in the .psam file, e.g. sibships where neither parent was sequenced. Use --remove with one of the following ID lists when you don't want close relations:
These lists were generated from the original dataset with "--king-cutoff 0.177" and "--king-cutoff 0.0884", respectively. If you're curious, here's the --make-king-table + --king-table-filter report listing all 1st/2nd-degree related sample pairs: deg2_hg38.kin0
In addition to ~600 trios which were intentionally included, this dataset contains a few close relations which are not described in the .psam file, e.g. sibships where neither parent was sequenced. Use --remove with one of the following ID lists when you don't want close relations:
These lists were generated from the original dataset with "--king-cutoff 0.177" and "--king-cutoff 0.0884", respectively. If you're curious, here's the --make-king-table + --king-table-filter report listing all 1st/2nd-degree related sample pairs: deg2_hg38.kin0
This dataset was intended to contain only unrelated samples; unfortunately, a few parent-child pairs, sibships, and second-degree relationships snuck in. Use --remove with one of the following ID lists when you don't want close relations:
These lists were generated from the original dataset with "--king-cutoff 0.177" and "--king-cutoff 0.0884", respectively. If you're curious, here's the --make-king-table + --king-table-filter report listing all 1st/2nd-degree related sample pairs: deg2_phase3.kin0
Coverage and heterozygosity statistics indicate that the sequenced HG03511 cell line has only one copy of chrX, and no copies of chrY. This can be a consequence of either mosaic loss of chrX in a female or mosaic loss of chrY in a male. The pedigree provided by 1000 Genomes labels this sample as female, so that is almost certainly the true sex, but (as of 22 Jun 2024) we have decided to label sex as NA in the KING-corrected .psam file because its chrX variant calls are much more representative of male data than female data. (Between 9 Jan 2023 and 21 Jun 2024, the sex was "corrected" to male; then we took a closer look at the original sequence reads.)
This dataset fuses results from two different pipelines. The primary chr1..chrX genotypes are phased, contain no missing calls, and only have biallelic left-normalized variants (multiallelic variants were "split"). The chrY/chrM/contigs genotypes are unphased, contain some missing calls, multiallelic variants there are unsplit, and there are a few variants which aren't left-normalized.
There was previously an option to download "no-singleton" files. This is no longer available, since the Byrska-Bishop et al. quality-control pipeline removed almost all genuine singletons on chr1..chrX.
This dataset contains (unsplit) multiallelic variants, and a few variants which aren't left-normalized.
.pgen.zst file(s) must be decompressed before use. (This isn't necessary for .pvar.zst files: see --pfile's 'vzs' modifier.) If you don't have another .zst decompressor installed, you can use PLINK 2 for this purpose: plink2 --zst-decompress hgdp_all.pgen.zst hgdp_all.pgen
This dataset was aligned to GRCh38, and variant calls were made on the autosomes, chrX, and chrY. There are 929 samples, with no 1st-degree relations. Samples have been sorted by ID.
The dataset contains (unsplit) multiallelic variants, and one variant on chrY which isn't left-normalized. ~6.57% of genotype calls are missing.
The source material contains per-genotype AD, DP, GQ, and PL fields which cannot be represented by the .pgen file format, and are consequently not preserved.
This dataset was aligned to GRCh38, and variant calls were made on only the autosomes. There are 929 samples, with no 1st-degree relations. Samples have been sorted by ID.
The dataset contains (unsplit) multiallelic variants, and ~4.27% of genotype calls are missing. All variants are left-normalized.
These are the reference genomes that the aforementioned 1000 Genomes and HGDP samples were aligned against. Note that --fa can directly read these compressed files.