Introduction, downloads

D: 22 Dec 2024

Recent version history

What's new?

Coming next

[Jump to search box]

General usage

Getting started

Flag usage summaries

Column set descriptors

Citation instructions

Standard data input

PLINK 1 binary (.bed)

PROVISIONAL_REF?

PLINK 2 binary (.pgen)

Autoconversion behavior

VCF/BCF (.vcf[.gz], .bcf)

Oxford genotype (.bgen)

Oxford haplotype (.haps)

PLINK 1 text (.ped, .tped)

PLINK 1 dosage

Sample ID conversion

Dosage import settings

Generate random

Unusual chromosome IDs

Allele frequencies

Phenotypes

Covariates

'Cluster' import

Reference genome (.fa)

Input filtering

Sample ID file

Variant ID file

Interval-BED file

--extract-col-cond

QUAL, FILTER, INFO

Chromosomes

SNPs only

Simple variant window

Multiple variant ranges

Deduplicate variants

Sample/variant thinning

Pheno./covar. condition

Missingness

Category subset

--keep-col-match

Missing genotypes

Number of distinct alleles

Allele frequencies/counts

Hardy-Weinberg

Imputation quality

Sex

Founder status

Main functions

Data management

--make-[b]pgen/--make-bed

--export

--output-chr

--split-par/--merge-par

--set-all-var-ids

--recover-var-ids

--update-map...

--update-ids...

--ref-allele

--ref-from-fa

--normalize

--indiv-sort

--write-covar

--variance-standardize

--quantile-normalize

--split-cat-pheno

--pheno-svd

--pmerge[-list]

--write-samples

Basic statistics

--freq

--geno-counts

--sample-counts

--missing

--genotyping-rate

--hardy

--het

--check-sex/--impute-sex

--fst

--pgen-info

Pairwise diffs

--pgen-diff

--sample-diff

Linkage disequilibrium

--indep...

--r[2]-[un]phased

--ld

Sample-distance matrices

Relationship/covariance

  (--make-grm-bin...)

--make-king...

--king-cutoff

Population stratification

--pca

PCA projection

Association analysis

--glm

--glm ERRCODE values

--gwas-ssf

--adjust-file

Report postprocessing

--clump

Linear scoring

--score[-list]

--variant-score

Distributed computation

Command-line help

Miscellaneous

Flag/parameter reuse

System resource usage

--loop-cats

.zst decompression

Pseudorandom numbers

Warnings as errors

.pgen validation

Resources

1000 Genomes phase 3

HGDP-CEPH

FASTA files

Errors and warnings

Output file list

Order of operations

Developer information

GitHub root

Python library

R library

Compilation

Adding new functionality

Discussion forums

Credits

File formats

Tutorials

Setup

Rules of Thumb

Data Exploration 1 — HWE, Allele Frequency Spectrum

Data Exploration 2 — Genomic Structure

Linkage

Relationship Matrix

Genome-Wide Assocation Analyses (GWAS)

Regressions

Post-Hoc

Formatting Files

bcftools

Variant IDs

Reference Alleles

Format for R

Shortcuts

Quick index search

bcftools - Adding Functional Annotations

Significance

Plink 2 includes functions to work with bcftools. This adds functionality such as variant calling, annotation, and filtering.

Objective

Show how Plink 2 and bcftools can be used to add functional annotations and to filter by these annotations.

Demonstrate export/import commands between these platforms.

Part 1: Setup bcftools (and samtools)

See bcftools installation.
Add this to your PATH similar to what you did for Plink 2 (path).

You may want to add samtools at this point using a similar install method.

Part 2: Adding External Annotations

As our starting point, we will export a QCd 1000 Genomes dataset - all_hg38_qcd_LE1 - from prior examples. If you do not have this in your ./data/processed/ folder then go ahead and make it, see Tutorial Shortcuts.

Run this Plink 2 command to export into a bcf format.

Plink 2:

time plink2 \ --pfile ./data/processed/all_hg38_qcd_LE1 \ --chr 22 \ --export bcf \ --out ./data/export/all_hg38_qcd_LE1_chr22_2bcf

--export bcf
This converts the pfile set into a binary (variant) call format (bcf) that bcftools can read.


To add the annotation, we will use bcftools consequence analysis. It uses the reference genome and a genomic feature file to add annotations. The bcftools algorithm also accounts for adjacent variant effects using phase information (assuming this is part of the input file).

For the 1kGP3, we need to use the GRCh38 build for our reference. The compressed file is included in the tutorial material, ./data/raw/GGRCh38_full_analysis_set_plus_decoy_hla.fa.zst.

Decompress the zstd file.

Plink 2:

time plink2 --zst-decompress ./data/raw/GRCh38_full_analysis_set_plus_decoy_hla.fa.zst ./data/raw/GRCh38_full_analysis_set_plus_decoy_hla.fa

For the genomic features file, we will use the Ensembl Homo_sapiens.GRCh38.113.chromosome.22.gff3. This is part of our tutorial materials and should be found in ./misc/.
Decompress the gzipped file.

If you want other chromosome files, see FTP-curl instructions.

Cautionary Note - bcftools only reads GFF3 files and will give a warning or error if it is not in this format.

At this point you should have these three necessary files in these locations.

./data/export/all_hg38_qcd_LE1_chr22_2bcf

./data/raw/GRCh38_full_analysis_set_plus_decoy_hla.fa

./misc/Homo_sapiens.GRCh38.113.chromosome.22.gff3


Run bcftools (as shell command):

time bcftools csq \ -f ./data/raw/GRCh38_full_analysis_set_plus_decoy_hla.fa \ -g ./misc/Homo_sapiens.GRCh38.113.chromosome.22.gff3 ./data/export/all_hg38_qcd_LE1_chr22_2bcf.bcf \ -o ./data/export/all_hg38_qcd_LE1_chr22_annot_ensembl.vcf

bcftools csq
The command for the consequence analysis, which performs annotation.

-f <genomic reference data>
The genomic reference file that corresponds to your genomics data; all_hg38 (1000 Genomes) data in this case.

-g <genomic feature annotation file> <file to be annotated>
The genomic feature data used for annotations. This should match the build of the data and annotation file; h(g)38.

Note that by default this will create a human readable vcf file and not a binary. You can use the option -Ob for a bcf file. We are using vcf here, so that we can read the outputs.


Let's take a look at what was added.

bcftools:

time bcftools query -f '%CHROM\t%POS\t%ID\t%INFO/BCSQ\n' ./data/export/all_hg38_qcd_LE1_chr22_annot_ensembl.vcf | sed -n '50,60p'

time bcftools query
Run a query on the vcf/bcf file.

-f '%CHROM\t%POS\t%ID\t%INFO/BCSQ\n' <vcf file>
Fields to return from the given <vcf file>. Note, %INFO/BCSQ returns BCSQ subfield under INFO. BCSQ is specific to the command that we ran, and won't exist (for this example) if that command did not run correctly.

sed
The sed -n '50,60p' command is used to print lines 50 to 60 from a file or standard input.


Description of Image

As you can see there are now annotation added to relevant variants.

The added field names are:

Consequence|gene|transcript|biotype|strand|amino_acid_change|dna_change

The last three fields may not be there for all annotations. You can read more about this on the bcftools site (link above).

Let's explore further.

bcftools:

time bcftools query -f '%INFO/BCSQ\n' ./data/export/all_hg38_qcd_LE1_chr22_annot_ensembl.vcf | awk -F'|' '{print $1}' | sort | uniq

| awk -F'|' '{print $1}' | sort | uniq
To extract the first column from a pipe-separated list, sort it, and then remove duplicates.


.
3_prime_utr
3_prime_utr&NMD_transcript
5_prime_utr
5_prime_utr&NMD_transcript
intron
missense
missense&NMD_transcript
non_coding
splice_acceptor
splice_donor
splice_region
splice_region&NMD_transcript
stop_gained
synonymous
synonymous&NMD_transcript

This is the set of Consequences. You can run similar commands to get the set of genes, transcripts, etc. specific to the Ensembl genomic feature set. You will need to go to that reference material to learn more about how these are coded.

Part 3: Filtering Variants

One use case may be to run Plink 2 association or other operations on a filtered subset of variants.

bcftools:

time bcftools view -i 'INFO/BCSQ~"missense"' ./data/export/all_hg38_qcd_LE1_chr22_annot_ensembl.vcf -o ./data/export/all_hg38_qcd_LE1_chr22_annot_ensembl_missense.vcf

view -i 'INFO/BCSQ~"missense"'
This filters in (-i) INFO/BCSQ fields with the substring "missense" from the newly annotated vcf file.

Note - This is inclusive to any INFO/BCSQ with the substring "missense". There can be multiple annotations for a given site. For example, missense and synonymous. In this case, the variant will be kept. You can play with other regular expressions to be more specific.

-o
Using the above filter, we generate a new subset of variants in the vcf format.


Let's check the results using a similar query above, pulling up the first three variants. Note that they all have the "missense" tag.

bcftools query -f '%CHROM\t%POS\t%ID\t%INFO/BCSQ\n' ./data/export/all_hg38_qcd_LE1_chr22_annot_ensembl_missense.vcf | head -n 3

22      15528913        22:15528913C,T  missense|OR11H1|ENST00000643195|protein_coding|+|241A>241V|15528913C>T
22      15690112        22:15690112C,T  missense|POTEH|ENST00000343518|protein_coding|+|12S>12F|15690112C>T
22      15690591        22:15690591G,A  missense&NMD_transcript|POTEH|ENST00000452800|NMD|+|116V>116I|15690591G>A,missense|POTEH|ENST00000343518|protein_coding|+|172V>172I|15690591G>A

Part 4: Going Back to Plink 2

Let us now go back to Plink 2 with this filtered subset of chromosome 22 missense variants.

Plink 2:

time plink2 \ --vcf ./data/export/all_hg38_qcd_LE1_chr22_annot_ensembl_missense.vcf \ --make-pgen \ --out ./data/processed/all_hg38_qcd_LE1_chr22_annot_ensembl_missense

Go ahead and open ./data/processed/all_hg38_qcd_LE1_chr22_annot_ensembl_missense.pvar (contains general information for each variant). Search for 19963748 .
You will see the functional annotation appended to the end of the description.

22	19963748	22:19963748G,A	G	A	AC=2370;AF=0.370081;CM=21.2869;AN=6404;AN_EAS=1170;AN_AMR=980;AN_EUR=1266;AN_AFR=1786;AN_SAS=1202;AN_EUR_unrel=1006;AN_EAS_unrel=1008;AN_AMR_unrel=694;AN_SAS_unrel=978;AN_AFR_unrel=1322;AF_EAS=0.282051;AF_AMR=0.386735;AF_EUR=0.489731;AF_AFR=0.282195;AF_SAS=0.446755;AF_EUR_unrel=0.5;MAF_EUR_unrel=0.5;AF_EAS_unrel=0.279762;MAF_EAS_unrel=0.279762;AF_AMR_unrel=0.377522;MAF_AMR_unrel=0.377522;AF_SAS_unrel=0.440695;MAF_SAS_unrel=0.440695;AF_AFR_unrel=0.280635;MAF_AFR_unrel=0.280635;AC_EAS=330;AC_AMR=379;AC_EUR=620;AC_AFR=504;AC_SAS=537;AC_EUR_unrel=503;AC_EAS_unrel=282;AC_AMR_unrel=262;AC_SAS_unrel=431;AC_AFR_unrel=371;AC_Het_EAS=234;AC_Het_AMR=261;AC_Het_EUR=310;AC_Het_AFR=356;AC_Het_SAS=285;AC_Het_EUR_unrel=237;AC_Het_EAS_unrel=198;AC_Het_AMR_unrel=176;AC_Het_SAS_unrel=225;AC_Het_AFR_unrel=255;AC_Het=1446;AC_Hom_EAS=96;AC_Hom_AMR=118;AC_Hom_EUR=310;AC_Hom_AFR=148;AC_Hom_SAS=252;AC_Hom_EUR_unrel=266;AC_Hom_EAS_unrel=84;AC_Hom_AMR_unrel=86;AC_Hom_SAS_unrel=206;AC_Hom_AFR_unrel=116;AC_Hom=924;HWE_EAS=0.759847;ExcHet_EAS=0.665216;HWE_AMR=0.00762833;ExcHet_AMR=0.00441707;HWE_EUR=0.633307;ExcHet_EUR=0.727715;HWE_AFR=0.62088;ExcHet_AFR=0.719037;HWE_SAS=0.322565;ExcHet_SAS=0.864597;HWE=0.0749371;ExcHet=0.965854;BCSQ=missense|COMT|ENST00000403184|protein_coding|+|158V>158M|19963748G>A

Variant IDs >>