Introduction, downloads

D: 6 Dec 2024

Recent version history

What's new?

Coming next

[Jump to search box]

General usage

Getting started

Flag usage summaries

Column set descriptors

Citation instructions

Standard data input

PLINK 1 binary (.bed)

PROVISIONAL_REF?

PLINK 2 binary (.pgen)

Autoconversion behavior

VCF/BCF (.vcf[.gz], .bcf)

Oxford genotype (.bgen)

Oxford haplotype (.haps)

PLINK 1 text (.ped, .tped)

PLINK 1 dosage

Sample ID conversion

Dosage import settings

Generate random

Unusual chromosome IDs

Allele frequencies

Phenotypes

Covariates

'Cluster' import

Reference genome (.fa)

Input filtering

Sample ID file

Variant ID file

Interval-BED file

--extract-col-cond

QUAL, FILTER, INFO

Chromosomes

SNPs only

Simple variant window

Multiple variant ranges

Deduplicate variants

Sample/variant thinning

Pheno./covar. condition

Missingness

Category subset

--keep-col-match

Missing genotypes

Number of distinct alleles

Allele frequencies/counts

Hardy-Weinberg

Imputation quality

Sex

Founder status

Main functions

Data management

--make-[b]pgen/--make-bed

--export

--output-chr

--split-par/--merge-par

--set-all-var-ids

--recover-var-ids

--update-map...

--update-ids...

--ref-allele

--ref-from-fa

--normalize

--indiv-sort

--write-covar

--variance-standardize

--quantile-normalize

--split-cat-pheno

--pheno-svd

--pmerge[-list]

--write-samples

Basic statistics

--freq

--geno-counts

--sample-counts

--missing

--genotyping-rate

--hardy

--het

--check-sex/--impute-sex

--fst

--pgen-info

Pairwise diffs

--pgen-diff

--sample-diff

Linkage disequilibrium

--indep...

--r[2]-[un]phased

--ld

Sample-distance matrices

Relationship/covariance

  (--make-grm-bin...)

--make-king...

--king-cutoff

Population stratification

--pca

PCA projection

Association analysis

--glm

--glm ERRCODE values

--gwas-ssf

--adjust-file

Report postprocessing

--clump

Linear scoring

--score[-list]

--variant-score

Distributed computation

Command-line help

Miscellaneous

Flag/parameter reuse

System resource usage

--loop-cats

.zst decompression

Pseudorandom numbers

Warnings as errors

.pgen validation

Resources

1000 Genomes phase 3

HGDP-CEPH

FASTA files

Errors and warnings

Output file list

Order of operations

Developer information

GitHub root

Python library

R library

Compilation

Adding new functionality

Discussion forums

Credits

File formats

Tutorials

Setup

Rules of Thumb

Data Exploration 1 — HWE, Allele Frequency Spectrum

Data Exploration 2 — Genomic Structure

Linkage

Relationship Matrix

Genome-Wide Assocation Analyses (GWAS)

Regressions

Post-Hoc

Formatting Files

bcftools

Variant IDs

Reference Alleles

Format for R

Shortcuts

Quick index search

Reference alleles

Significance

The reference allele (REF) refers to the allele(variant) by which to compare others and perform operations. Originally, in GWAS, there were no set conventions and often the reference was the major allele within a sampled dataset. This could change with sampling and lead to issues when comparing studies, etc. Today, we use some population determined REF to ensure consistency. This is typically based on a well-characterized reference genome, such as the human genome reference (GRCh38).

Objective

Appreciate how older Plink files do not preserve REF.

Reset REF using the GRCh38 FASTA file.

Accidentaly swapped REF and ALT alleles

Plink 2:

time plink2 \ --pfile 'vzs' ./data/raw/all_hg38 \ --make-bed \ --thin-count 10 --seed 111 --threads 1 --memory 8000 require \ --out ./data/processed/all_hg38_thin10

Plink 1.x:

time plink19 \ --bfile ./data/processed/all_hg38_thin10 \ --make-bed \ --out ./data/processed/all_hg38_thin10_plink19

What we did here was first convert the GRCh38 referenced data to a bed format that older versions (Plink 1.9 and 1) can read. Then we used Plink 1.9 to remake the bed files. Plink 1.x uses a sample-based allele major rule for the REF. Let's compare the results. Open the resultant *.bim files: ./data/processed/all_hg38_thin10 and ./data/processed/all_hg38_thin10_plink19.

Plink 2 pfile-->bed (original data)

CHR     ID              CM      BP              ALT     REF
4	rs548499580	0	9970397	        T	C
6	rs189327745	0	169876868	G	C
8	rs11993439	0	143242766	G	A
8	rs182875526	0	144703073	T	C
9	rs1307812694	0	64039847	T	C
10	rs1838765422	0	5078561	        C	G
12	rs1493770	0	91727497	A	G
13	rs7332542	0	110188033	T	C
15	rs559747534	0	40486304	G	T
X	rs184776415	0	54756962	G	C

Plink 2 bed --> Plink 1.9 bed (Plink 1.x formatted data)

4	rs548499580	0	9970397	        T	C
6	rs189327745	0	169876868	G	C
8	rs11993439	0	143242766	A	G
8	rs182875526	0	144703073	T	C
9	rs1307812694	0	64039847	T	C
10	rs1838765422	0	5078561	        C	G
12	rs1493770	0	91727497	G	A
13	rs7332542	0	110188033	T	C
15	rs559747534	0	40486304	G	T
23	rs184776415	0	54756962	G	C

As we walk down the REF column between the original data and the Plink 1.x processed one, we clearly see that some of the REF assignments are flipped.

What happens when we try to go back to Plink 2?

Plink 1.9 bed --> Plink 2 pvar

Plink 2:

time plink2 \ --bfile ./data/processed/all_hg38_thin10_plink19 \ --make-pfile \ --out ./data/processed/all_hg38_thin10_plink19_plink2

Check the resulting pvar file: ./data/processed/all_hg38_thin10_plink19_plink2.pvar

#CHROM	POS	        ID	        REF	ALT
4	9970397	        rs548499580	C	T
6	169876868	rs189327745	C	G
8	143242766	rs11993439	G	A
8	144703073	rs182875526	C	T
9	64039847	rs1307812694	C	T
10	5078561	        rs1838765422	G	C
12	91727497	rs1493770	A	G
13	110188033	rs7332542	C	T
15	40486304	rs559747534	T	G
X	54756962	rs184776415	C	G

Note that pvar and bim files swap the overall REF and ALT column order but go ahead and compare the pvar to the bim above.

Walking down the REF column, we see that the flipped allele assignments are retained (garbled) in the pvar!! Not good.

Cautionary Note: Be careful when working between Plink 2 and older Plink versions. REF will not be preserved.

(Re-)referencing GWAS data

Fortunately, Plink 2 has a simple method to set or re-reference the data set.

Plink 2:

time plink2 \ --pfile ./data/processed/all_hg38_thin10_plink19_plink2 \ --make-pfile \ --ref-from-fa \ --fa ./data/raw/GRCh38_full_analysis_set_plus_decoy_hla.fa.zst \ --out ./data/processed/all_hg38_thin10_plink19_plink2_reref --ref-from-fa
Command to set the REF allele using the --fa file when it can be done unambiguously. Note that this is never possible for deletions and some insertions. For these see --ref-allele instead. For more information see ref_allele.


Check the pvar file. You will see that they are now referenced to the original build assigment that we started with. (Keep in mind that overall REF and ALT orders are swapped between pvar and bim.)

#CHROM	POS	        ID              REF	ALT
4	9970397	        rs548499580	C	T
6	169876868	rs189327745	C	G
8	143242766	rs11993439	A	G
8	144703073	rs182875526	C	T
9	64039847	rs1307812694	C	T
10	5078561	        rs1838765422	G	C
12	91727497	rs1493770	G	A
13	110188033	rs7332542	C	T
15	40486304	rs559747534	T	G
X	54756962	rs184776415	C	G

Format for R >>