time bcftools query -f '%INFO/BCSQ\n' ./data/export/all_hg38_qcd_LE1_chr22_annot_ensembl.vcf | awk -F'|' '{print $1}' | sort | uniq
| awk -F'|' '{print $1}' | sort | uniq
To extract the first column from a pipe-separated list, sort it, and then remove duplicates.
.
3_prime_utr
3_prime_utr&NMD_transcript
5_prime_utr
5_prime_utr&NMD_transcript
intron
missense
missense&NMD_transcript
non_coding
splice_acceptor
splice_donor
splice_region
splice_region&NMD_transcript
stop_gained
synonymous
synonymous&NMD_transcript
This is the set of Consequences. You can run similar commands to get the set of genes, transcripts, etc. specific to the Ensembl genomic feature set. You will need to go to that reference material to learn more about how these are coded.
Part 3: Filtering Variants
One use case may be to run Plink 2 association or other operations on a filtered subset of variants.
bcftools:
time bcftools view -i 'INFO/BCSQ~"missense"' ./data/export/all_hg38_qcd_LE1_chr22_annot_ensembl.vcf -o ./data/export/all_hg38_qcd_LE1_chr22_annot_ensembl_missense.vcf
view -i 'INFO/BCSQ~"missense"'
This filters in (-i) INFO/BCSQ fields with the substring "missense" from the newly annotated vcf file.
Note - This is inclusive to any INFO/BCSQ with the substring "missense". There can be multiple annotations for a given site. For example, missense and synonymous. In this case, the variant will be kept. You can play with other regular expressions to be more specific.
-o
Using the above filter, we generate a new subset of variants in the vcf format.
Let's check the results using a similar query above, pulling up the first three variants. Note that they all have the "missense" tag.
bcftools query -f '%CHROM\t%POS\t%ID\t%INFO/BCSQ\n' ./data/export/all_hg38_qcd_LE1_chr22_annot_ensembl_missense.vcf | head -n 3
22 15528913 22:15528913C,T missense|OR11H1|ENST00000643195|protein_coding|+|241A>241V|15528913C>T
22 15690112 22:15690112C,T missense|POTEH|ENST00000343518|protein_coding|+|12S>12F|15690112C>T
22 15690591 22:15690591G,A missense&NMD_transcript|POTEH|ENST00000452800|NMD|+|116V>116I|15690591G>A,missense|POTEH|ENST00000343518|protein_coding|+|172V>172I|15690591G>A
Part 4: Going Back to Plink 2
Let us now go back to Plink 2 with this filtered subset of chromosome 22 missense variants.
Plink 2:
time plink2 \
--vcf ./data/export/all_hg38_qcd_LE1_chr22_annot_ensembl_missense.vcf \
--make-pgen \
--out ./data/processed/all_hg38_qcd_LE1_chr22_annot_ensembl_missense
Go ahead and open
./data/processed/all_hg38_qcd_LE1_chr22_annot_ensembl_missense.pvar (contains general information for each variant). Search for
19963748 .
You will see the functional annotation appended to the end of the description.
22 19963748 22:19963748G,A G A AC=2370;AF=0.370081;CM=21.2869;AN=6404;AN_EAS=1170;AN_AMR=980;AN_EUR=1266;AN_AFR=1786;AN_SAS=1202;AN_EUR_unrel=1006;AN_EAS_unrel=1008;AN_AMR_unrel=694;AN_SAS_unrel=978;AN_AFR_unrel=1322;AF_EAS=0.282051;AF_AMR=0.386735;AF_EUR=0.489731;AF_AFR=0.282195;AF_SAS=0.446755;AF_EUR_unrel=0.5;MAF_EUR_unrel=0.5;AF_EAS_unrel=0.279762;MAF_EAS_unrel=0.279762;AF_AMR_unrel=0.377522;MAF_AMR_unrel=0.377522;AF_SAS_unrel=0.440695;MAF_SAS_unrel=0.440695;AF_AFR_unrel=0.280635;MAF_AFR_unrel=0.280635;AC_EAS=330;AC_AMR=379;AC_EUR=620;AC_AFR=504;AC_SAS=537;AC_EUR_unrel=503;AC_EAS_unrel=282;AC_AMR_unrel=262;AC_SAS_unrel=431;AC_AFR_unrel=371;AC_Het_EAS=234;AC_Het_AMR=261;AC_Het_EUR=310;AC_Het_AFR=356;AC_Het_SAS=285;AC_Het_EUR_unrel=237;AC_Het_EAS_unrel=198;AC_Het_AMR_unrel=176;AC_Het_SAS_unrel=225;AC_Het_AFR_unrel=255;AC_Het=1446;AC_Hom_EAS=96;AC_Hom_AMR=118;AC_Hom_EUR=310;AC_Hom_AFR=148;AC_Hom_SAS=252;AC_Hom_EUR_unrel=266;AC_Hom_EAS_unrel=84;AC_Hom_AMR_unrel=86;AC_Hom_SAS_unrel=206;AC_Hom_AFR_unrel=116;AC_Hom=924;HWE_EAS=0.759847;ExcHet_EAS=0.665216;HWE_AMR=0.00762833;ExcHet_AMR=0.00441707;HWE_EUR=0.633307;ExcHet_EUR=0.727715;HWE_AFR=0.62088;ExcHet_AFR=0.719037;HWE_SAS=0.322565;ExcHet_SAS=0.864597;HWE=0.0749371;ExcHet=0.965854;BCSQ=missense|COMT|ENST00000403184|protein_coding|+|158V>158M|19963748G>A
Variant IDs >>