D: 6 Dec 2024 Main functions (--make-grm-bin...) Quick index search |
Reference allelesSignificanceThe reference allele (REF) refers to the allele(variant) by which to compare others and perform operations. Originally, in GWAS, there were no set conventions and often the reference was the major allele within a sampled dataset. This could change with sampling and lead to issues when comparing studies, etc. Today, we use some population determined REF to ensure consistency. This is typically based on a well-characterized reference genome, such as the human genome reference (GRCh38).ObjectiveAppreciate how older Plink files do not preserve REF. Reset REF using the GRCh38 FASTA file.Accidentaly swapped REF and ALT allelesPlink 2:time plink2 \
--pfile 'vzs' ./data/raw/all_hg38 \
--make-bed \
--thin-count 10 --seed 111 --threads 1 --memory 8000 require \
--out ./data/processed/all_hg38_thin10
Plink 1.x:
time plink19 \
--bfile ./data/processed/all_hg38_thin10 \
--make-bed \
--out ./data/processed/all_hg38_thin10_plink19
What we did here was first convert the GRCh38 referenced data to a bed format that older versions (Plink 1.9 and 1) can read. Then we used Plink 1.9 to remake the bed files. Plink 1.x uses a sample-based allele major rule for the REF. Let's compare the results. Open the resultant *.bim files: ./data/processed/all_hg38_thin10 and ./data/processed/all_hg38_thin10_plink19.
Plink 2 pfile-->bed (original data)
CHR ID CM BP ALT REF 4 rs548499580 0 9970397 T C 6 rs189327745 0 169876868 G C 8 rs11993439 0 143242766 G A 8 rs182875526 0 144703073 T C 9 rs1307812694 0 64039847 T C 10 rs1838765422 0 5078561 C G 12 rs1493770 0 91727497 A G 13 rs7332542 0 110188033 T C 15 rs559747534 0 40486304 G T X rs184776415 0 54756962 G CPlink 2 bed --> Plink 1.9 bed (Plink 1.x formatted data) 4 rs548499580 0 9970397 T C 6 rs189327745 0 169876868 G C 8 rs11993439 0 143242766 A G 8 rs182875526 0 144703073 T C 9 rs1307812694 0 64039847 T C 10 rs1838765422 0 5078561 C G 12 rs1493770 0 91727497 G A 13 rs7332542 0 110188033 T C 15 rs559747534 0 40486304 G T 23 rs184776415 0 54756962 G CAs we walk down the REF column between the original data and the Plink 1.x processed one, we clearly see that some of the REF assignments are flipped. What happens when we try to go back to Plink 2? Plink 1.9 bed --> Plink 2 pvar Plink 2: time plink2 \
--bfile ./data/processed/all_hg38_thin10_plink19 \
--make-pfile \
--out ./data/processed/all_hg38_thin10_plink19_plink2
Check the resulting pvar file: ./data/processed/all_hg38_thin10_plink19_plink2.pvar
#CHROM POS ID REF ALT 4 9970397 rs548499580 C T 6 169876868 rs189327745 C G 8 143242766 rs11993439 G A 8 144703073 rs182875526 C T 9 64039847 rs1307812694 C T 10 5078561 rs1838765422 G C 12 91727497 rs1493770 A G 13 110188033 rs7332542 C T 15 40486304 rs559747534 T G X 54756962 rs184776415 C GNote that pvar and bim files swap the overall REF and ALT column order but go ahead and compare the pvar to the bim above. Walking down the REF column, we see that the flipped allele assignments are retained (garbled) in the pvar!! Not good. (Re-)referencing GWAS dataFortunately, Plink 2 has a simple method to set or re-reference the data set. Plink 2:time plink2 \
--pfile ./data/processed/all_hg38_thin10_plink19_plink2 \
--make-pfile \
--ref-from-fa \
--fa ./data/raw/GRCh38_full_analysis_set_plus_decoy_hla.fa.zst \
--out ./data/processed/all_hg38_thin10_plink19_plink2_reref
--ref-from-fa Command to set the REF allele using the --fa Check the pvar file. You will see that they are now referenced to the original build assigment that we started with. (Keep in mind that overall REF and ALT orders are swapped between pvar and bim.) #CHROM POS ID REF ALT 4 9970397 rs548499580 C T 6 169876868 rs189327745 C G 8 143242766 rs11993439 A G 8 144703073 rs182875526 C T 9 64039847 rs1307812694 C T 10 5078561 rs1838765422 G C 12 91727497 rs1493770 G A 13 110188033 rs7332542 C T 15 40486304 rs559747534 T G X 54756962 rs184776415 C G |