Bloom filters in bioinformatics

Bloom filters are space-efficient probabilistic data structures used to test whether an element is a part of a set. Bloom filters require much less space than other data structures for representing sets, however the downside of Bloom filters is that there is a false positive rate when querying the data structure. Since multiple elements may have the same hash values for a number of hash functions, then there is a probability that querying for a non-existent element may return a positive if another element with the same hash values has been added to the Bloom filter. Assuming that the hash function has equal probability of selecting any index of the Bloom filter, the false positive rate of querying a Bloom filter is a function of the number of bits, number of hash functions and number of elements of the Bloom filter. This allows the user to manage the risk of a getting a false positive by compromising on the space benefits of the Bloom filter.

Bloom filters are primarily used in bioinformatics to test the existence of a k-mer in a sequence or set of sequences. The k-mers of the sequence are indexed in a Bloom filter, and any k-mer of the same size can be queried against the Bloom filter. This is a preferable alternative to hashing the k-mers of a sequence with a hash table, particularly when the sequence is very long, since it is very demanding to store large numbers of k-mers in memory.

Applications edit

Sequence characterization edit

 
A visualization of querying a Bloom filter of k-mers of a DNA sequence. The first step is storing the k-mers of a sequence into a Bloom filter. Querying is done similarly where the query sequence is broken down into its corresponding k-mers, and the k-mers are used to query the Bloom filter.

The preprocessing step in many bioinformatics applications involves classifying sequences, primarily classifying reads from a DNA sequencing experiment. For example, in metagenomic studies it is important to be able to tell if a sequencing read belongs to a new species.[1] and in clinical sequencing projects it is vital to filter out reads from the genomes of contaminating organisms. There are many bioinformatics tools that use Bloom filters to classify reads by querying k-mers of a read to a set of Bloom filters generated from known reference genomes. Some tools that use this method are FACS[2] and BioBloom tools.[3] While these methods may not outclass other bioinformatics classification tools like Kraken,[4] they offer a memory-efficient alternative.

A recent area of research with Bloom filters in sequence characterization is in developing ways to query raw reads from sequencing experiments. For example, how can one determine which reads contain a specific 30-mer in the entire NCBI Sequence Read Archive? This task is similar to that which is accomplished by BLAST, however it involves querying a much larger dataset; while BLAST queries against a database of reference genomes, this task demands that specific reads that contain the k-mer are returned. BLAST and similar tools cannot handle this problem efficiently, therefore Bloom filter based data structures have been implemented to this end. Binary bloom trees[5] are binary trees of Bloom filters that facilitates querying transcripts in large RNA-seq experiments. BIGSI[6] borrows bitsliced signatures from the field of document retrieval to index and query the entirety of microbial and viral sequencing data in the European Nucleotide Archive. The signature of a given dataset is encoded as a set of Bloom filters from that dataset.

Genome assembly edit

 
Comparison between a hash table and a Bloom filter to store a de Bruijn graph in memory. Note that while edge information may be preserved in a hash table, it is not stored in a Bloom filter, which complicates graph traversal. A Bloom filter of the same size as a hash table will still use less space due to not storing the values of the k-mers themselves.

The memory efficiency of Bloom filters has been used in genome assembly as a way to reduce the space footprint of k-mers from sequencing data. The contribution of Bloom filter based assembly methods is combining Bloom filters and de Bruijn graphs into a structure called a probabilistic de Bruijn graph,[7] which optimizes memory usage at the cost of the false positive rate inherent to Bloom filters. Instead of storing the de Bruijn graph in a hash table, it is stored in a Bloom filter.

Using a Bloom filter to store the de Bruijn graph complicates the graph traversal step to build the assembly, since edge information is not encoded in the Bloom filter. Graph traversal is accomplished by querying the Bloom filter for any of the four possible subsequent k-mers from the current node. For example, if the current node is for the k-mer ACT, then the next node must be for one of the k-mers CTA, CTG, CTC or CTT. If a query k-mer exists in the Bloom filter, then the k-mer is added to the path. Therefore, there are two sources for false positives in querying the Bloom filter when traversing the de Bruijn graph. There is the probability that one or more of the three false k-mers exist elsewhere in the sequencing set to return a false positive, and there is the aforementioned inherent false positive rate of the Bloom filter itself. The assembly tools that use Bloom filters must account for these sources of false positives in their methods. ABySS 2[8] and Minia[9] are examples of assemblers that uses this approach for de novo assembly.

Sequencing error correction edit

Next-generation sequencing (NGS) methods have allowed the generation of new genome sequences much faster and cheaper than the previous Sanger sequencing methods. However, these methods have a higher error rate,[10][11] which complicates downstream analysis of the sequence and can even give rise to erroneous conclusions. Many methods have been developed to correct the errors in NGS reads, but they use large amounts of memory which makes them impractical for large genomes, such as the human genome. Therefore, tools using Bloom filters have been developed to address these limitations, taking advantage of their efficient memory usage. Musket[12] and  BLESS[13] are examples of such tools. Both methods use the k-mer spectrum approach for error correction. The first step of this approach is to count the multiplicity of k-mers, however while BLESS only uses Bloom filters to store the counts, Musket uses Bloom filters only to count unique k-mers, and stores non-unique k-mers in a hash table, as described in a previous work[14]

RNA-Seq edit

Bloom filters are also employed in some RNA-Seq pipelines. RNA-Skim[15] clusters RNA transcripts and then uses Bloom filters to find sig-mers: k-mers that are only found in one of the clusters. These sig-mers are then used to estimate the transcript abundance levels. Therefore, it does not analyze every possible k-mer which results in performance and memory-usage improvements, and has been shown to work as well as previous methods.

References edit

  1. ^ Lundeberg, Joakim; Arvestad, Lars; Andersson, Björn; Allander, Tobias; Käller, Max; Stranneheim, Henrik (2010-07-01). "Classification of DNA sequences using Bloom filters". Bioinformatics. 26 (13): 1595–1600. doi:10.1093/bioinformatics/btq230. ISSN 1367-4803. PMC 2887045. PMID 20472541.
  2. ^ Lundeberg, Joakim; Arvestad, Lars; Andersson, Björn; Allander, Tobias; Käller, Max; Stranneheim, Henrik (2010-07-01). "Classification of DNA sequences using Bloom filters". Bioinformatics. 26 (13): 1595–1600. doi:10.1093/bioinformatics/btq230. ISSN 1367-4803. PMC 2887045. PMID 20472541.
  3. ^ Chu, Justin; Sadeghi, Sara; Raymond, Anthony; Jackman, Shaun D.; Nip, Ka Ming; Mar, Richard; Mohamadi, Hamid; Butterfield, Yaron S.; Robertson, A. Gordon (2014-12-01). "BioBloom tools: fast, accurate and memory-efficient host species sequence screening using bloom filters". Bioinformatics. 30 (23): 3402–3404. doi:10.1093/bioinformatics/btu558. ISSN 1367-4811. PMC 4816029. PMID 25143290.
  4. ^ Wood, Derrick E.; Salzberg, Steven L. (2014-03-03). "Kraken: ultrafast metagenomic sequence classification using exact alignments". Genome Biology. 15 (3): R46. doi:10.1186/gb-2014-15-3-r46. ISSN 1474-760X. PMC 4053813. PMID 24580807.
  5. ^ Carl Kingsford; Solomon, Brad (March 2016). "Fast search of thousands of short-read sequencing experiments". Nature Biotechnology. 34 (3): 300–302. doi:10.1038/nbt.3442. ISSN 1546-1696. PMC 4804353. PMID 26854477.
  6. ^ Iqbal, Zamin; McVean, Gil; Rocha, Eduardo P. C.; Bakker, Henk C. den; Bradley, Phelim (February 2019). "Ultrafast search of all deposited bacterial and viral genomic data". Nature Biotechnology. 37 (2): 152–159. doi:10.1038/s41587-018-0010-1. ISSN 1546-1696. PMC 6420049. PMID 30718882.
  7. ^ Brown, C. Titus; Tiedje, James M.; Howe, Adina; Canino-Koning, Rosangela; Hintze, Arend; Pell, Jason (2012-08-14). "Scaling metagenome sequence assembly with probabilistic de Bruijn graphs". Proceedings of the National Academy of Sciences. 109 (33): 13272–13277. arXiv:1112.4193. Bibcode:2012PNAS..10913272P. doi:10.1073/pnas.1121464109. ISSN 0027-8424. PMC 3421212. PMID 22847406.
  8. ^ Birol, Inanc; Warren, Rene L.; Coombe, Lauren; Khan, Hamza; Jahesh, Golnaz; Hammond, S. Austin; Yeo, Sarah; Chu, Justin; Mohamadi, Hamid (2017-05-01). "ABySS 2.0: resource-efficient assembly of large genomes using a Bloom filter". Genome Research. 27 (5): 768–777. doi:10.1101/gr.214346.116. ISSN 1088-9051. PMC 5411771. PMID 28232478.
  9. ^ Chikhi, Rayan; Rizk, Guillaume (2013-09-16). "Space-efficient and exact de Bruijn graph representation based on a Bloom filter". Algorithms for Molecular Biology. 8 (1): 22. doi:10.1186/1748-7188-8-22. ISSN 1748-7188. PMC 3848682. PMID 24040893.
  10. ^ Loman, Nicholas J.; Misra, Raju V.; Dallman, Timothy J.; Constantinidou, Chrystala; Gharbia, Saheer E.; Wain, John; Pallen, Mark J. (May 2012). "Performance comparison of benchtop high-throughput sequencing platforms". Nature Biotechnology. 30 (5): 434–439. doi:10.1038/nbt.2198. ISSN 1546-1696. PMID 22522955. S2CID 5300923.
  11. ^ Wang, Xin Victoria; Blades, Natalie; Ding, Jie; Sultana, Razvan; Parmigiani, Giovanni (2012-07-30). "Estimation of sequencing error rates in short reads". BMC Bioinformatics. 13: 185. doi:10.1186/1471-2105-13-185. ISSN 1471-2105. PMC 3495688. PMID 22846331.
  12. ^ Schmidt, Bertil; Schröder, Jan; Liu, Yongchao (2013-02-01). "Musket: a multistage k-mer spectrum-based error corrector for Illumina sequence data". Bioinformatics. 29 (3): 308–315. doi:10.1093/bioinformatics/bts690. ISSN 1367-4803. PMID 23202746.
  13. ^ Hwu, Wen-Mei; Ma, Jian; Chen, Deming; Wu, Xiao-Long; Heo, Yun (2014-05-15). "BLESS: Bloom filter-based error correction solution for high-throughput sequencing reads". Bioinformatics. 30 (10): 1354–1362. doi:10.1093/bioinformatics/btu030. ISSN 1367-4803. PMC 6365934. PMID 24451628.
  14. ^ Pellow, David; Filippova, Darya; Kingsford, Carl (2017-06-01). "Improving Bloom Filter Performance on Sequence Data Using k-mer Bloom Filters". Journal of Computational Biology. 24 (6): 547–557. doi:10.1089/cmb.2016.0155. ISSN 1066-5277. PMC 5467106. PMID 27828710.
  15. ^ Zhang, Zhaojun; Wang, Wei (2014-06-15). "RNA-Skim: a rapid method for RNA-Seq quantification at transcript level". Bioinformatics. 30 (12): i283–i292. doi:10.1093/bioinformatics/btu288. ISSN 1367-4803. PMC 4058932. PMID 24931995.