A fault-tolerant method for HLA typing with PacBio data
- Chia-Jung Chang†1,
- Pei-Lung Chen†2, 3, 4, 5,
- Wei-Shiung Yang2, 3, 4 and
- Kun-Mao Chao1, 6Email author
© Chang et al.; licensee BioMed Central Ltd. 2014
Received: 3 June 2014
Accepted: 26 August 2014
Published: 3 September 2014
Human leukocyte antigen (HLA) genes are critical genes involved in important biomedical aspects, including organ transplantation, autoimmune diseases and infectious diseases. The gene family contains the most polymorphic genes in humans and the difference between two alleles is only a single base pair substitution in many cases. The next generation sequencing (NGS) technologies could be used for high throughput HLA typing but in silico methods are still needed to correctly assign the alleles of a sample. Computer scientists have developed such methods for various NGS platforms, such as Illumina, Roche 454 and Ion Torrent, based on the characteristics of the reads they generate. However, the method for PacBio reads was less addressed, probably owing to its high error rates. The PacBio system has the longest read length among available NGS platforms, and therefore is the only platform capable of having exon 2 and exon 3 of HLA genes on the same read to unequivocally solve the ambiguity problem caused by the “phasing” issue.
We proposed a new method BayesTyping1 to assign HLA alleles for PacBio circular consensus sequencing reads using Bayes’ theorem. The method was applied to simulated data of the three loci HLA-A, HLA-B and HLA-DRB1. The experimental results showed its capability to tolerate the disturbance of sequencing errors and external noise reads.
The BayesTyping1 method could overcome the problems of HLA typing using PacBio reads, which mostly arise from sequencing errors of PacBio reads and the divergence of HLA genes, to some extent.
Human leukocyte antigen (HLA) system contains a set of genes that encode for major histocompatibility complex (MHC) in humans. The main function of MHC molecules is to mediate interactions between antigen-presenting cells, various lymphocytes and other body cells; therefore, malfunctions of HLA may associate with certain disorders in the immune system, for example, drug hypersensitivity reactions  and some autoimmune diseases, e.g., type 1 diabetes and systemic lupus erythematosus . HLA also plays an important role in transplantation of organs or stem cells [3, 4] and is associated with infectious diseases such as HIV .
There are 10,533 HLA alleles in the IMGT/HLA Database  and the number is still increasing. The HLA genes are the most polymorphic genes in humans and the difference between two alleles is often only a single base pair substitution. There are two main classes of HLA genes. The class I HLA genes (HLA-A, -B, and -C) each encodes a glycoprotein chain in association with the monomorphic molecule β2-microglobulin on the cell surface of most somatic cells, and the class II HLA genes (HLA-DP, -DQ and -DR) each encodes an α or a β glycoprotein chain associated as heterodimers on the cell surface of antigen-presenting cells . The exon 2 and exon 3 sequence of class I HLA genes and the exon 2 sequence of class II HLA genes form the critical peptide-binding groove responsible for the specificity of peptide recognition and binding .
It has been shown that high-resolution HLA matching improves survival rates of marrow transplantation . Therefore, to identify the alleles of a sample, it is better to use DNA-based methods instead of serological approaches . In addition, with the advance of the next generation sequencing (NGS) technologies, HLA typing by NGS seems to be a promising approach for HLA sequencing and allele assignment owing to its efficiency and cost effectiveness.  reviewed the latest approaches of template preparation, sequencing platforms and data-analysis for HLA typing by NGS. It was showed that the four major NGS platforms, Roche GS 454 FLX, Illumina MiSeq/HiSeq, PGM Ion Torrent and Pacific Biosciences SMRT (PacBio), were all capable of producing sequences suitable for the resolution of HLA genotyping.
However, among the four platforms, HLA typing by PacBio was the least addressed. For example, the module HLA Typing in the software Omixon Target from Omixon only works with Illumina, Roche 454 and Ion Torrent. The software NGSengine from GenDx is platform independent but is optimized for Roche 454, Ion Torrent, MiSeq. It might be due to the high error rate of PacBio (about 15-20%), which makes it more difficult to genotype polymorphic regions such as HLA. Moreover, to sequence multiple samples simultaneously, sequences could be labelled with barcodes for identification of samples . With a higher error rate, there are more barcode-calling errors and reads are more likely identified as of wrong samples. To our knowledge, the troublesome issue related to wrong barcode assignment in PacBio HLA typing has not been addressed previously.
Characteristics of CLR and CCS reads for targeted sequencing applications excerpted from 
≥98% (3 pass)
Maximum Mean Readlength
To solve the challenging “ambiguity” related to genotyping of class I HLA genes, the targeted sequences have to cover the whole region of exon 2 and exon 3 (and also intron 2 in between the two exons), which is about 1 Kb, and the read lengths of both types fit the requirement. Both types of reads are also proved to be effective in detecting genomic variants . However, in order to save cost for diagnosis applications of HLA typing, it is better to apply the barcode multiplexing technology . With a high error rate, the barcodes of CLR reads are much more likely to be mis-mapped and the reads are more likely identified as of wrong samples. Therefore, we choose CCS reads as the target of our method.
One possibility to genotype HLA is to use assembly methods such as CAP3 and PCAP [15, 16] to recover the two genomic sequences of a sample and compare them with the alleles. However, there are still problems using CCS reads. First, the number of reads for each amplicon of each sample is not high (depending on the number of barcodes used). Take the experiment in  for example. There were 2,352 distinct sequence products (49 amplicons × 48 barcode pairs). When the insert size is 1 Kb, the average number of reads for each distinct product was only 6.38. Second, the error rate of CCS is still too high to distinguish two HLA alleles having only little base-pair difference. Third, there are still barcode-calling errors for CCS reads (data not shown), which could induce noise reads and increase the obstacles of HLA typing.
To address these problems, here we propose a method using Bayes’ theorem. Given a few CCS reads generated from the target sequence (of the regions containing exon 2 and exon 3 for class I HLA genes or of the regions containing exon 2 for class II genes) of a sample, our method is able to correctly assign the pair of alleles of the sample. We simulated the alleles for each sample and the CCS reads generated according to the alleles. Different levels of reads from wrong samples were added to disturb the experiments. The experimental results showed that our method can stand for a high percentage of noise reads.
The simulation follows the setting of the multiplexing targeted sequencing technology . In each run, there are multiple samples with multiple amplicons sequenced simultaneously. We assume that the reads have been grouped by their samples and amplicons (loci of HLA) and the reads might be identified as of wrong samples due to barcode-calling errors.
Alleles of the samples
The alleles of the samples were assigned following the distribution of the Taiwan Minnan population . We obtained the frequencies of alleles from the allele frequency net database [18, 19]. For the alleles with zero frequency, we gave them 0.1% frequencies of appearance. In their study, only HLA-A, B and DRB1 are involved, so we only simulated HLA on these three loci. The table of frequencies for the three loci can be found in Additional file 1: Table S1. When implementing our methods, the frequencies were normalized to make the summation of each loci equal 1.
Linkage disequilibrium was not concerned, which means alleles on different loci are assigned independently. In reality, given the probability density function of the pairs of alleles in a population, we can adjust our methods by setting p(a i ,a j ) in Equation (1) accordingly. In addition, since the frequencies were censused for alleles with 2-digit resolution, the alleles of higher resolution were selected with uniform distribution. To observe the impacts of homozygous samples on genotyping, 30% of the samples have two identical alleles.
Target sequences of the alleles
The HLA sequences were downloaded from the IGMT/HLA Database Release 3.15.0 . Since there are only CDSs instead of genomic sequences for most alleles in the database, we created the genomic sequences and corresponding target sequences of our own.
Three HLA loci and their corresponding reference alleles
Reads generated from the target sequences
The CCS reads for the target sequence of an allele were produced with PBSIM , which is the only simulator that generates PacBio libraries as far as we know. We adopted its model-based method and the default settings for CCS reads (length-mean=450, length-sd=170, accuracy-mean=0.98, accuracy-sd=0.02).
Types of runs
Three types of runs with the same total number of reads: 960
Bayes’ theorem for HLA typing
Given the set of reads assigned to a sample (see the smaller rectangles in the rightmost block of Figure 2), we used Bayes’ theorem to infer the pair of alleles (a p ,a q ) of the sample.
The probability p(a i ,a j ), which is the probability of a random sample having the allele pair (a i ,a j ), depends on the population of the sample and we assumed all p(a i ,a j )’s are the same when the population is unknown. To find the pair of alleles (a p ,a q ) that maximize formula (1) when the r1,…,r n are fixed, it is sufficient to compare p(r1,…,r n |a i ,a j ).
Under this definition, p(r k |a i ) stands for the probability when a sequence of length |r k | generated by a i equals the sequence of r k . The summation of this function for all possible sequences of length |r k | would be 1 and therefore, it is a legal probability function.
Using this method, the number of reads produced by a p (or a q ) can be estimated as the number of reads whose |r k =a p | is greater than |r k =a q |. When the number of reads of one allele is far less than that of the other (e.g. 50%), the sample is regarded as having two identical alleles.
This method works well when the given set of reads are all from the alleles of the sample (i.e. the correct reads). However, the barcode-calling errors might result in mixing reads from different samples (i.e., the noise reads). Alleles that are close to both the correct reads and the noise reads are more likely to be predicted as the answers. To deal with the problem, we assumed there are a few number of noise reads before selecting the pair (a p ,a q ).
In the equation, ρ k =0 means the read r k is a correct read and ρ k =1 means r k is a noise read. Note that equation 7 is the same as equation 6 when m=0. We name the methods based on equation 6 and equation 7 as B a y e s T y p i n g0 and B a y e s T y p i n g1, respectively.
The value |r k =a i | can be calculated by aligning the read r k and the allele a i . We use the score of the alignment instead of the number of matches in our program because the score catches more information (e.g. indels).
To accelerate, we excluded those unlikely alleles, which had less than ten percent of reads satisfying |r k =a i |= maxall x|r k =a x | because we assume the correct alleles tend to have more number of best matching reads. To reduce those impacts of noise reads, we excluded the unlikely reads, which had less than 96% of identities to all the remained alleles. The pre-processing steps are illustrated in Figure 3.
At last, we used the methods B a y e s T y p i n g0 and B a y e s T y p i n g1 to infer the pair of alleles and their corresponding reads. The original sequences of the corresponding reads can further be used to assemble the genomic sequences of the alleles.
Results and discussion
Accuracies of BayesTyping0 for experiments without noise reads
For experiments with noise reads, we compared our methods with a method MaxTwo, which gives the first two alleles by comparing the number of reads having the maximum alignment scores with them. For all the three methods, when the number of reads of one allele is less than 50% of the number of the other allele, the sample is regarded as having two identical alleles.
It could be presumed that the error rates would increase as the number of noise reads increases, e.g., when there are more barcode-calling errors, or the number of correct reads decrease, e.g., when multiplexing more samples. For example, for HLA-A, the error rate is more than 10% for type 3 experiments when only 10 noise reads are induced. BayesTyping1 showed the best capability to tolerate the disturbance of the noise reads. Even when there were no noise reads, which conflicted with the assumption of BayesTyping1, BayesTyping1 also performed well. One the other hand, BayesTyping0 usually performed best when there are no noise reads, but it suffered as the number of noise reads increased. It even performed worse than MaxTwo when the noise reads outnumbered the correct reads.
The difference error rates between the three loci might reflect the characteristics of the sequences of alleles currently gathered. Although the number of HLA-B alleles is more than that of HLA-A, the HLA-B sequences seem more distinguishable because the error rates of typing were lower. It seems that HLA-DRB1 has the best distinguishable alleles.
It showed that a higher level of m worked better when noise reads were added, i.e., given any vertical line, a higher level of m has a fewer error rate. The difference of error rates also becomes larger as the number of noise reads increases. However, the error rates would converge at some point and a larger m would make little effect, i.e., given any vertical line, the difference of error rates shrinks as the level of m increases. When m is too large and there are only few numbers of noise reads, BayesTyping1 will perform worse than BayesTyping0 (data not shown). Theoretically, all pairs of alleles have the same probability when m is 100% of the input reads.
Ambiguous allele combinations
When the read length is not long enough to cover the region of exon 2 + intron + exon 3, for the samples with allele a and allele b (or allele c and allele d), there is no way to distinguish which combination of alleles is correct.
The results of BayesTyping1 on 1 Kb and 450 bp reads for 24 samples with ambiguous pairs of alleles
1 Kb (correct)
1 Kb (ambiguous)
450 bp (correct)
450 bp (ambiguous)
It showed that using 1 Kb reads, BayesTyping1 could correctly assign the pairs of alleles without ambiguity in most cases, even when the number of the noise reads equalled the number of correct reads. On the other hand, using 450 bp reads, BayesTyping1 could also achieve good accuracies. This might be due to the variation of PacBio read length. With higher depth of coverage, there are still a few reads that are long enough to cover exon 2 and exon 3. It is surprising that BayesTyping1 caused much more ambiguities when the number of noise reads was fewer. It might be because BayesTyping1 tends to treat longer reads as noise reads when there are no noise reads in fact.
Diversity of noise reads
The error rates of these three experiments showed not much difference. The number of correct reads, the number of noise reads and the difference between the two numbers play a much more important role.
Homozygous and heterozygous samples
Four numbers for Fisher’s exact test of type 2 experiments
Results of Fisher’s exact test
The experimental results showed that BayesTyping1 can identify HLA alleles accurately using reasonably low number of PacBio CCS reads. BayesTyping1 can tolerate sequencing errors, which are introduced by the PacBio sequencing technology, and noise reads, which are introduced by barcode-calling errors, to some degree. The three types of experiments suggest it is better to multiplex 12 or 24 samples instead of 48 samples to maintain a high accuracy, since the number of reads for each sample in a 48-sample example might be too few for HLA typing.
This research was supported in part by Ministry of Education, R.O.C, under the Grant AE-00-00-06. KMC and CJC were supported in part by NSC grants 100-2221-E-002-131-MY3 and 101-2221-E-002-063-MY3. PLC and WSY were supported in part by an NTU-NTUH grant UN102-067.
- Mallal S, Nolan D, Witt C, Masel G, Martin A, Moore C, Sayer D, Castley A, Mamotte C, Maxwell D, James I, Christiansen FT: Association between presence of HLA-B* 5701, HLA-DR7, and HLA-DQ3 and hypersensitivity to HIV-1 reverse-transcriptase inhibitor abacavir. The Lancet. 2002, 359 (9308): 727-732. 10.1016/S0140-6736(02)07873-X.View ArticleGoogle Scholar
- Lie BA, Thorsby E: Several genes in the extended human MHC contribute to predisposition to autoimmune diseases. Curr Opin Immunol. 2005, 17 (5): 526-531. 10.1016/j.coi.2005.07.001.View ArticlePubMedGoogle Scholar
- Tiercy J-M: Molecular basis of HLA polymorphism: implications in clinical transplantation. Transpl Immunol. 2002, 9 (2): 173-180.View ArticlePubMedGoogle Scholar
- Tait BD: The ever-expanding list of HLA alleles: changing HLA nomenclature and its relevance to clinical transplantation. Transplant Rev. 2011, 25 (1): 1-8. 10.1016/j.trre.2010.08.001.View ArticleGoogle Scholar
- Gao X, Bashirova A, Iversen AK, Phair J, Goedert JJ, Buchbinder S, Hoots K, Vlahov D, Altfeld M, O’Brien SJ, Carrington M: AIDS restriction HLA allotypes target distinct intervals of HIV-1 pathogenesis. Nat Med. 2005, 11 (12): 1290-1292. 10.1038/nm1333.View ArticlePubMedGoogle Scholar
- Robinson J, Halliwell JA, McWilliam H, Lopez R, Parham P, Marsh SG: The IMGT/HLA database. Nucleic Acids Res. 2013, 41 (D1): 1222-1227. 10.1093/nar/gks949.View ArticleGoogle Scholar
- Erlich H, Opelz G, Hansen J: HLA DNA typing and transplantation. Immunity. 2001, 14 (4): 347-356. 10.1016/S1074-7613(01)00115-7.View ArticlePubMedGoogle Scholar
- Lee S. J, Klein J, Haagenson M, Baxter-Lowe LA, Confer DL, Eapen M, Fernandez-Vina M, Flomenberg N, Horowitz M, Hurley CK, Noreen H, Oudshoorn M, Petersdorf E, Setterholm M, Spellman S, Weisdorf D, Williams TM, Anasetti C: High-resolution donor-recipient HLA matching contributes to the success of unrelated donor marrow transplantation. Blood. 2007, 110 (13): 4576-4583. 10.1182/blood-2007-06-097386.View ArticlePubMedGoogle Scholar
- Middleton D: History of DNA typing for the human MHC. Rev Immunogenet. 1998, 1 (2): 135-156.Google Scholar
- De Santis D, Dinauer D, Duke J, Erlich H, Holcomb C, Lind C, Mackiewicz K, Monos D, Moudgil A, Norman P, Parham P, Sasson A: Allcock RJ: 16th ihiw: Review of hla typing by ngs. Int J Immunogenet. 2013, 40 (1): 72-76.View ArticlePubMedGoogle Scholar
- PacBio: Multiplexing Targeted Sequencing using Barcodes. Technical Note PN 100-114-500-01, Pacific Biosciences 2012, [http://www.pacificbiosciences.com/pdf/TN_Multiplexing_Targeted_Sequencing_Using_Barcodes.pdf],
- Adams SD, Barracchini KC, Chen D, Robbins F, Wang L, Larsen P, Luhm R, Stroncek DF: Ambiguous allele combinations in HLA Class I and Class II sequence-based typing: when precise nucleotide sequencing leads to imprecise allele identification. J Transl Med. 2004, 2 (1): 30-10.1186/1479-5876-2-30.View ArticlePubMed CentralPubMedGoogle Scholar
- Roberts RJ, Carneiro MO, Schatz MC: The advantages of SMRT sequencing. Genome Biol. 2013, 14: 405-10.1186/gb-2013-14-6-405.View ArticlePubMedGoogle Scholar
- PacBio: Targeted Sequencing – SNP Detection and Validation. Technical Note PN 100-092-600-03, Pacific Biosciences 2012, ,Google Scholar
- Huang X, Madan A: CAP3: A DNA sequence assembly program. Genome Res. 1999, 9 (9): 868-877. 10.1101/gr.9.9.868.View ArticlePubMed CentralPubMedGoogle Scholar
- Huang X, Wang J, Aluru S, Yang S-P, Hillier L: PCAP: a whole-genome assembly program. Genome Res. 2003, 13 (9): 2164-2170. 10.1101/gr.1390403.View ArticlePubMed CentralPubMedGoogle Scholar
- Shaw CK, Chen LL, Lee A, Lee TD: Distribution of HLA gene and haplotype frequencies in Taiwan: a comparative study among Min-nan, Hakka, Aborigines and mainland Chinese. Tissue Antigens. 1999, 53 (1): 51-64. 10.1034/j.1399-0039.1999.530106.x.View ArticlePubMedGoogle Scholar
- The Allele Frequency Net Database. [http://www.allelefrequencies.net],
- Gonzalez-Galarza FF, Christmas S, Middleton D, Jones AR: Allele frequency net: a database and online repository for immune gene frequencies in worldwide populations. Nucleic Acids Res. 2011, 39: 913-919. 10.1093/nar/gkq911.View ArticleGoogle Scholar
- IGMT/HLA Database. [http://www.ebi.ac.uk/ipd/imgt/hla/],
- Ono Y, Asai K, Hamada M: PBSIM: PacBio reads simulator–toward accurate genome assembly. Bioinformatics. 2013, 29 (1): 119-121. 10.1093/bioinformatics/bts649.View ArticlePubMedGoogle Scholar
- Harris RS: Improved pairwise alignment of genomic dna. PhD thesis. The Pennsylvania State University; 2007,Google Scholar
- Agresti A: Categorical Data Analysis, vol 359. 2002, Gainesvilla, Florida: John Wiley & Sons, 91–101View ArticleGoogle Scholar
This article is published under license to BioMed Central Ltd. This is an Open Access article distributed under the terms of the Creative Commons Attribution License (http://creativecommons.org/licenses/by/4.0), which permits unrestricted use, distribution, and reproduction in any medium, provided the original work is properly credited. The Creative Commons Public Domain Dedication waiver (http://creativecommons.org/publicdomain/zero/1.0/) applies to the data made available in this article, unless otherwise stated.