BioMart is a community-driven project to provide a single point of access to distributed research data. The BioMart project contributes open source software and data services to the international scientific community. Although the BioMart software is primarily used by the biomedical research community, it is designed in such a way that any type of data can be incorporated into the BioMart framework. The BioMart project originated at the European Bioinformatics Institute as a data management solution[1] for the Human Genome Project.[2] Since then, BioMart has grown to become a multi-institute collaboration involving various database projects on five continents.[3][4][5][6]

Written inJava
Operating systemUnix-like
Available inEnglish
TypeFederated database system
LicenseLGPL
Websiteuseast.ensembl.org/info/data/biomart/index.html

Integration with Ensembl edit

BioMart is a powerful tool for researchers and bioinformaticians that allows a user to export data from Ensembl, this could include data such as gene ID’s, gene positions, associated variations, protein domains and sequences. BioMArt allows the data to be exported into convenient file types like FASTA, XLS, CSV, TSV, HTML. Researchers can use the exported data in a variety of applications, including genomic studies, gene expression analysis, and comparative genomics. BioMart's intuitive interface enables users to customize queries to access specific data sets or features of interest easily[7]

Software edit

BioMart is a freely available, open-source, federated database system that provides unified access to disparate, geographically distributed data sources.[8] BioMart allows databases hosted on different servers to be presented seamlessly to users, facilitating collaborative projects. BioMart contains several levels of query optimization to efficiently manage large data sets, and offers a diverse selection of graphical user interfaces and application programming interfaces to allow queries to be performed in whatever manner is most convenient for the user. BioMart's capabilities are extended by integration with several widely used software packages such as Bioconductor,[9] Galaxy,[10] Cytoscape,[11] and Taverna.[12]

Data sources and community edit

There are around 40 BioMart data sources including the Atlas of UTR Regulatory Activity (AURA), the COSMIC cancer database, Ensembl Genomes, HapMap, InterPro, Mouse Genome Informatics (MGI), Rfam and UniProt. Access is provided by institutions including the European Bioinformatics Institute (EBI) and the Wellcome Trust Sanger Institute in the UK, Cold Spring Harbor Laboratory and the National Center for Biotechnology Information (NCBI) in the United States and French National Centre for Scientific Research (CNRS).[13] The BioMart Central Portal was established to provide a convenient single point of access to this growing pool of data sources.[3][5][6]

Bulk data retrieval edit

The BioMart tool is designed to help users save time by querying data from multiple genes or variants at once. BioMart is used for returning a large amount of focused data at once, being useful for Data mining, although it is recommended to limit queries to fewer than 500 inputs at a time.[14] Oftentimes, bioinformatics researchers need to perform many queries at once from multiple data sources. By using BioMart, the user saves a significant amount of time compared to manually inputting their search into multiple tools.[15]

These queries can be performed through the web interface, without requiring any programming experience, or programmatically, through tools like biomaRt using the R language interface. Common uses include ID conversions, retrieval of gene locations, and downloading sequences.[14]

There are four main steps to using BioMart through Ensembl, as listed below.

  1. Selecting a dataset
    • Select which database is to be used, as well as which species is the target.
  2. Narrow the dataset with filters
    • Narrow down the dataset by applying filters based on region, a list of IDs, function, phenotypes, and more.
    • Multiple filters can be set, with all conditions needing to be satisfied.
  3. Select the attributes to be returned
    • Specify which category of data to be queried from and the specific attributes desired
    • Categories: Features, Structures, Homologues, Variant (Germline), Variant (Somatic), Sequences
  4. Retrieve the results
    • Clicking the results button displays a preview of the results and provides download options
    • Available download formats are based on the selected attributes and include HTML, TSV, CSV, XLS, and FASTA formats

The associated video, Figure 1, demonstrates an example of navigating through these steps.

 
Figure 1: Demonstration of a performing a BioMart query, as of April 2024

Tips[14] edit

  • To make subsequent queries from a different attribute category, selecting the Attributes tab directly will allow the user to skip steps 1 and 2
  • When selecting attributes, the order of selection is the order the result's table columns appear in
  • Sequences output will always report in FASTA format
  • BioMart is available for vertebrates, plants, fungi, metazoan, and protists

Gene ID conversion edit

BioMart has available very powerful gene ID conversion functionality. By leveraging BioMart queries users can easily and conveniently collect many different IDs and other identifiers for hundreds of genes simultaneously, saving the user a significant time when compared to other tools.[16] When making a new BioMart query, filtering based on external reference IDs can be applied.[17] With this filtering option users can input external identifiers for genes including IDs from the databases of other organizations, gene names, and more. To decide which results are displayed, users select attributes corresponding to their desired identifiers. Attributes include IDs from many Ensembl databases, IDs from external databases such as those of NCBI, as well as many other identifiers.[17] Once the desired attributes are selected, obtaining the results of the BioMart query will display the corresponding identifiers for the input gene reference IDs.[16]

Guide edit

  1. Selecting a database and a dataset
    • Begin a new BioMart query, and click the “Dataset” link in the column on the left side of the page. Here select the desired database and dataset from their dropdowns.
  2. Inputting gene reference IDs
    • Click the “Filters” link in the column on the left side of the page.
    • Expand the “GENE” section.
    • Select the option to input external reference IDs.
    • Select the type of reference input from the dropdown.
    • Input your references either directly in the provided textbox, or upload a file. The recommended limit is 500.[17]
  3. Selecting attributes
    • Click on the “Attributes” link in the column on the left side of the page.
    • Expand sections, such as “GENE” and “EXTERNAL”, and select the desired identifiers.
  4. Obtaining results
    • Click the “Results” button above the column on the left side of the page.
    • The corresponding Identifiers found for the input gene references will now be displayed.
    • The results can be displayed as HTML, TSV, or CSV.
    • The results can be exported as HTML, TSV, CSV, or XLS. They can also be compressed and exported, or compressed and emailed.

Below is a video that provides a demonstration of gene ID conversion with BioMart.

 
Figure 2: Demonstration of gene ID conversion with BioMart, as of April 2024. Uses gene names to obtain Ensembl and NCBI gene IDs.

Combining species datasets edit

The BioMart interface on Ensembl allows for the combining of multiple data sets to allow a user to see the combined results of the multiple datasets selected. This User-friendly tool simplifies complex data integration by providing a simple to use interface to query and combine data from various biological data sources. This tool saves researchers considerable time and effort by automatically transforming data into standardized formats. BioMart also can apply powerful filters to the data set and has options to customize queries. Without BioMart, researchers would have to manually gather data from multiple sources, integrate that data with other sources and develop a way to process all the combined data. To use BioMart one can follow the step-by-step guide.

  1. Navigate to BioMart:
    • Start by visiting the Ensembl website and locating the BioMart tab. Click on the BioMart tab to access the BioMart interface.
  2. Select first dataset:
    • Once on the BioMart page, you'll see options to select datasets. Click on the "Dataset" button to choose your first dataset of interest. This allows you to select the database and dataset you want to examine.
  3. Apply Filters and Attributes:
    • After selecting your desired database and dataset, additional options will appear on the left-hand side of the page. Here, you can apply filters and select attributes you'd like to examine within your chosen dataset.
  4. Select Additional Dataset:
    • Next, scroll down to the bottom of the page and click on the "Dataset" button again. This time, choose the dataset you want to combine with your initially selected dataset. You'll have the opportunity to apply filters and select attributes for this new dataset as well.
  5. View Combined Results:
    • Once you've selected both datasets, you'll see the combined results on the page. You can explore the data and ensure that it meets your requirements.
  6. Export Results:
    • In the results section, you'll find various options for exporting your data. You can export your results to a file format of your choice, such as HTML, CSV, TSV, or XLS. Select the desired export format and destination, then click on the "Go" button to export your data.
 
Combining multiple data sets with BioMart (Figure 3)


By following these steps (Figure 3), you'll be able to effectively combine multiple species datasets using BioMart in Ensembl and export the combined results for further analysis. [18]

Displaying Introns and Exons edit

 
An example of viewing the extrons and introns for the Chicken MAN2A2 Transcript: ENSGALT00010000068.1 (Figure 6)

Users are able to view translated sequences, flanking sequences, introns, exons, and untranslated regions of genes in BioMart. [19] Users are able to tailor sequences to fit their use case, like including introns and UTRs for studying homology or for creating phylogenetic trees. The table output from this process also includes start and end positions and other helpful information. See Figure 6 for the screens used in this process and the output.

To display introns and exons for a transcript [20] edit

  1. In the "Results" section, click on a Transcript ID to access the desired transcript
    • Alternatively, if you're already viewing a gene, select a Transcript ID from there
  2. Click "Exons" on the left side of the page, under "Sequence"
    • To see introns as well, click "Configure this page" and enable the option "Show full intronic sequence:"

Exporting sequences edit

 
An example of exporting the FASTA format of the coding sequence of the Chicken MAN2A2 gene (Figure 7)

When viewing genes and transcripts, users can export a sequence to FASTA format with multiple options for what to include. The user can choose to display introns, exons, coding sequences, and/or untranslated regions, among other options. [21] This is useful in conjunction with other tools that take data in the FASTA format. A user can select and control what parts of a sequence they want to export, depending on what is being studied/researched, and output in the FASTA format. This output can then be used in other tools; such as sequence alignment tools, multiple sequence alignment tools, or phylogenetic tree building software.

To export the coding sequence [20] edit

  1. Navigate to the gene or transcript of interest and select the "Export Data" button on the left side of the page.
  2. Select/deselect options as desired for FASTA output
  3. Click the "Next >" button
  4. Select the desired output format from the available options (HTML, Text, or Compressed text)

References edit

  1. ^ Kasprzyk A, Keefe D, Smedley D, London D, Spooner W, Melsopp C, et al. (January 2004). "EnsMart: a generic system for fast and flexible access to biological data". Genome Research. 14 (1): 160–169. doi:10.1101/gr.1645104. PMC 314293. PMID 14707178.
  2. ^ Lander ES, Linton LM, Birren B, Nusbaum C, Zody MC, Baldwin J, et al. (February 2001). "Initial sequencing and analysis of the human genome". Nature. 409 (6822): 860–921. Bibcode:2001Natur.409..860L. doi:10.1038/35057062. PMID 11237011.
  3. ^ a b Smedley D, Haider S, Ballester B, Holland R, London D, Thorisson G, et al. (January 2009). "BioMart--biological queries made easy". BMC Genomics. 10: 22. doi:10.1186/1471-2164-10-22. PMC 2649164. PMID 19144180.
  4. ^ Kasprzyk A (2011). "BioMart: driving a paradigm change in biological data management". Database. 2011: bar049. doi:10.1093/database/bar049. PMC 3215098. PMID 22083790.
  5. ^ a b Haider S, Ballester B, Smedley D, Zhang J, Rice P, Kasprzyk A (July 2009). "BioMart Central Portal--unified access to biological data". Nucleic Acids Research. 37 (Web Server issue): W23–W27. doi:10.1093/nar/gkp265. PMC 2703988. PMID 19420058.
  6. ^ a b Guberman JM, Ai J, Arnaiz O, Baran J, Blake A, Baldock R, et al. (2011). "BioMart Central Portal: an open database network for the biological community". Database. 2011: bar041. doi:10.1093/database/bar041. PMC 3263598. PMID 21930507.
  7. ^ An Introduction to BioMart. Retrieved 2024-04-18 – via www.youtube.com.
  8. ^ Zhang J, Haider S, Baran J, Cros A, Guberman JM, Hsu J, et al. (2011). "BioMart: a data federation framework for large collaborative projects". Database. 2011: bar038. doi:10.1093/database/bar038. PMC 3175789. PMID 21930506.
  9. ^ Durinck S, Moreau Y, Kasprzyk A, Davis S, De Moor B, Brazma A, et al. (August 2005). "BioMart and Bioconductor: a powerful link between biological databases and microarray data analysis". Bioinformatics. 21 (16): 3439–3440. doi:10.1093/bioinformatics/bti525. PMID 16082012.
  10. ^ Liu B, Madduri RK, Sotomayor B, Chard K, Lacinski L, Dave UJ, et al. (June 2014). "Cloud-based bioinformatics workflow platform for large-scale next-generation sequencing analyses". Journal of Biomedical Informatics. 49: 119–133. doi:10.1016/j.jbi.2014.01.005. PMC 4203338. PMID 24462600.
  11. ^ Lopes CT, Franz M, Kazi F, Donaldson SL, Morris Q, Bader GD (September 2010). "Cytoscape Web: an interactive web-based network browser". Bioinformatics. 26 (18): 2347–2348. doi:10.1093/bioinformatics/btq430. PMC 2935447. PMID 20656902.
  12. ^ Wolstencroft K, Haines R, Fellows D, Williams A, Withers D, Owen S, et al. (July 2013). "The Taverna workflow suite: designing and executing workflows of Web Services on the desktop, web or in the cloud". Nucleic Acids Research. 41 (Web Server issue): W557–W561. doi:10.1093/nar/gkt328. PMC 3692062. PMID 23640334.
  13. ^ "BioMart". www.biomart.org. Retrieved 14 July 2016.
  14. ^ a b c Virtual Workshop - The Ensembl Genome Browser - (2021): Webinar 6 - BioMart and Data Visualisation. Retrieved 2024-04-16 – via www.youtube.com.
  15. ^ Smedley D, Haider S, Ballester B, Holland R, London D, Thorisson G, et al. (January 2009). "BioMart--biological queries made easy". BMC Genomics. 10 (1): 22. doi:10.1186/1471-2164-10-22. PMC 2649164. PMID 19144180.
  16. ^ a b Joshua B, Francis B, David K, Steven H (July 2022). "GeneToList: A Web Application to Assist with Gene Identifiers for the Non-Bioinformatics-Savvy Scientist". Biology (Basel). 11 (8): 1113. doi:10.3390/biology11081113. PMC 9332626. PMID 35892968.
  17. ^ a b c Clip: Gene ID conversion with BioMart. Retrieved 2024-04-16 – via www.youtube.com.
  18. ^ "Combining multiple species datasets". www.ensembl.org. 2024-04-18.
  19. ^ Clip: Exons and Introns. Retrieved 2024-04-17 – via www.youtube.com.
  20. ^ a b "Tutorials". ensembl.org. Retrieved 2024-04-20.
  21. ^ Clip: Export Sequence. Retrieved 2024-04-17 – via www.youtube.com.

External links edit