User:Ekspiulo/Sandbox

Abstract Three hormone receptors, the estrogen receptor (ER), the androgen receptor (AR), and glucocorticoid receptor (GR) play an important role in regulating the cellular differentiation tissue development of skin, bone, the brain, and the endocrine system; therefore, there is a strong scientific need to identify and characterize hormone receptor transcriptional regulation. Given that the vast amount of regulatory data for hormone being produced by ChIP-based high-throughput experiments is widely scattered in disparate, poorly cross-indexed data stores, a flexible platform for organizing and relating these data would provide significant value. We created a data management system called the Hormone Receptor Target Binding Loci, HRTBLDb <http://motif.bmi.ohio-state.edu/hrtbldb>, to address this problem. This database contains hormone receptor binding regions (binding loci) from in vivo ChIP-based high-throughput experiments as well as in silico, computationally predicted, binding motifs and cis-regulatory modules for the co-occurring transcription factor binding motifs, which are within a binding locus. It also contains individual binding sites whose regulatory action has been verified by in vitro experiments. The current version contains 44,673 binding elements with 114 hormone response elements which are verified by in vitro experiments; 75 binding motifs which occur with a hormone response element and whose co-regulatory action is verified by in vitro experiments; 18,472 binding loci from in vivo experiments; and 26,012 computationally predicted binding motifs.

Introduction Hormone receptors are steroid hormone activated transcription factors which are a part of the nuclear receptor super-family consisting of more than 60 members. Three hormone receptors, the estrogen receptor (ER), the androgen receptor (AR), and glucocorticoid receptor (GR) play an important role in regulating the cellular differentiation tissue development of skin, bone, the brain, and the endocrine system (1). Many studies (2-4) have shown that all three of HRs regulates gene expression for many significant phenotypes. Dysregulation of any of these hormone signaling pathways is linked to many different types of diseases. For example, malignancy of the ER has been associated with endocrine diseases such as breast and prostate cancers, and the action of ERs is linked to the growth promotion and invasion of breast cancer cells (5). Dysregulation in the production of androgens can affect different organ systems, in which it caused the diseases such as androgen insensitivity syndromes, prostate cancer, hepatocellular carcinomas, acne and male pattern alopecia (6,7). GR activation is often associated with inducing apoptosis in lymphocytes and lymphomas, and paradoxically inhibiting the apoptotic response to several cell stressors including growth factor withdrawal (8) in breast cancer cell lines. HRs modulate the transcription of target genes through hormone activation, and they do so either through a) direct binding, in which HRs directly bind to hormone response elements; b) indirect binding, where HRs interact with other transcription factors, which instead bind to the DNA themselves; or c) co-occurrent binding, where HRs bind with lower affinity to DNA binding sequences and interact with other TFs through co-regulators. These modes of binding are shown at <http://motif.bmi.ohio-state.edu/hrtbldb/>. ARs bind to the androgen response element 5'-GGAACAnnnTGTTCT-3' (9), ERs bind with the highest affinity to the estrogen response element consensus sequence 5'-AGGTCAnnnTGACCT-3' (10), and GRs bind to the glucocorticoid response element 5'-GTTACAnnnTGTTCT-3' (11). Since hormone receptors are key regulators of growth, differentiation, and metabolism in multiple organs, the characterization of the transcriptional regulation of their target genes is quite important. Several new high throughput technologies such as ChIP-chip (12,13), ChIP-seq (14,15), and ChIP-PET (16) are producing a large amount of whole genome wide hormone receptor regulatory data, in many different cell types and tissues (see Supplementary Table 1 for all literature as well as (17)). Integrating these data into a single resource will allow expedited cross-reference, literature discovery, and aid in the validation of novel data. To that end we have created a database called the Hormone Receptor Target Binding Loci, HRTBLDb. This database mainly contains in vivo binding loci identified from ChIP-based high throughput technologies and computationally predicted binding motifs and the cis-regulatory modules for the binding loci. It also collected in vitro experimentally defined binding elements. The data visualization focuses on a whole genome-wide scale rather than traditional promoter regions and is accessed through simple yet expressive data filtering.

Database Content In the current version of our HRTBLDb database, it contains two types of data, a) in vitro binding elements, which were experimentally verified in vitro to bind one of the three hormone receptors using either EMSA or ChIP assays; b) in vivo binding loci identified from the ChIP-based high throughput data using our peak detecting programs fdrPeak (18) for the ChIP-chip data and WGWPeak (19) for the ChIP-seq data, and binding motifs underneath of the binding loci using our ChIPMotifs (20) as well as cis-regulatory modules identified by our ChIPModules program (21). The database contains the nucleotide sequence for every indexed binding site. For experimentally verified binding sites, the sequence and coordinates given in the paper are used when possible. In papers for which the sequence is not given, the coordinates given in the paper were used to obtain the sequence from the genome specified in the paper. For predicted binding sites, the sequence was taken from the genome used in the database for that species by the predictive algorithm used, after the coordinates from the paper were converted from the genome used in the paper to the version used in the database. For simplicity, the database only contains the annotation for one genome assembly per species. The human assembly used is HG18, and data was taken from papers in the HG18 and HG17 assemblies: HG17 data was converted to HG18 using UCSC's liftOver program. Mouse data is in the MM8 assembly, and rat data is in the RN4 assembly. The HRTBLDb contains AR, ER, and GR in vitro data for all three species; however, all of the in vivo data is human. The total number of different data and sources by hormone receptor is shown in Figure 1.

Database Implementation The database is designed to have a close mapping to the real objects that it represents, (Supplementary Materials, and <http://motif.bmi.ohio-state.edu/hrtbldb/about#schema>). One benefit of this is that entries in the database express the natural relationships between their real world counterparts. This was done to make complex queries to the database simpler and more intuitive. Since the database stores in vivo ChIP-based data, with computationally predicted binding motifs, and in vitro experimentally verified individual binding motifs, these data were split into two separate entities, binding loci and binding motifs. ChIP-based experiments produce many larger regions on the chromosome without fine interior details, aside from predicted binding sites; therefore, these regions are stored in a separate table, Loci. Both the predicted binding motifs within binding loci and the experimentally verified binding motifs are stored in the Binding Sites table. Each locus and binding site is also associated with an entry in the Experiments table which contains the meta-data common to all data from a given resource, reference information for the scientific paper, cell line, hormone receptor, etc. The database supports the separation of all data by species; however, for speed and efficiency, no meta-data on the species is stored in the database. Binding loci have a one-to-many relationship with binding sites: i.e. a locus contains multiple binding sites, but no binding site is a part of more than one locus, if any. Overlapping loci from different experiments may share common predicted binding sites; however, these are represented as separate entities in the database. Each binding site is a collection of information about a single predicted or experimentally verified transcription factor binding site. They contain the coordinates, name, and the exact sequence of that individual binding site. Loci and Binding are associated with a single “nearest gene,” which is the gene whose five prime or three prime end is nearest to the mid-point of that loci. Based on its proximity to this gene they are labeled with a region, which is a short description of its location relative to the gene, such as “within gene”, “5’ promoter”, “3’ distal”, etc. For the details of the distances from the gene that each region represents and how they are calculated, see Supplementary Materials and our website < http://motif.bmi.ohio-state.edu/hrtbldb/about#regions>. The publications which were used as data sources are recorded in the Experiments table; however, there is a one to many mapping between source publications and experiment entries. Each experiment entry represents a combination of a publication, a species, a cell line, and an experimental method. If one publication has data sets for which these characteristics differ, there will be one experiment entry for each data set. This database is implemented in MySQL using the MyISAM engine. All of the data are available for download as either a MySQL dump or as collection of tab separated text files, along with the database schema and the queries used to implement the schema, at <http://motif.bmi.ohio-state.edu/hrtbldb/downloads> (once the manuscript is accepted).

Web interface and data visualization We have developed a data visualization tool, the Portable Epigenetic Regulator Framework (PEGRF, Figure 2); it is a comprehensive, highly modular application for the storage of genomic and epigenomic data, data search, and graphical display, all over the internet, through a web browser. The graphical rendering is done using the PHP Light-weight Regulator Visualization Tool (PLRVT), a module to PEGRF which we developed in tandem. Both are implemented entirely in the PHP scripting language and run on a web server. The primary benefits of this approach are that the application is both operating system and database agnostic. This results in PEGRF being highly portable across a wide variety of systems, and easily integrated into the existing web presence for many organizations (Supplementary Materials). There are multiple methods of searching through the data in the HRTBLDb. The simplest, 'Basic Search', is similar to the search engine of UCSC genome browser in that a location on the genome in the format chrX:0000-9999 is entered to view all data in the database within that region. Basic search also supports the display of all data near the transcription start site of a gene searched for by its gene symbol, RefSeq ID, or its GenBank Gene ID. For example, a search results for the region of chr19:56047091-56052091 shows KLK3 (PSA) gene has both in vitro validated androgen response element and the predicted androgen response element from in vivo ChIP-chip AR binding loci in the same location. The more complex search, 'Advanced Search', filters the data in the database by various criteria. The data is filtered by species and experimental type in all searches. The experimental types are divides first between 'in vivo' and 'in vitro' techniques. Within in vivo, one can further filter the results by the type of in vivo technique used. The results may then be further filtered by the cell line used in the experiment and/or the hormone receptor whose binding was being measured. A ‘Basic Search’ takes the user directly to a visualization of the target region of the genome and all of the data in the database within that region; whereas, an Advanced Search shows a list of all the binding elements that meet the specified criteria which may be browsed and then selected for visualization of the target element and all elements near it. The data visualization is of a region on a chromosome in a single genome assembly. All of the data in the database is shown graphically to give their relative spatial relationships, and all of the details of each element are tabularized below the graphical visualization for further examination or export. This visualization is the only mechanism to display the binding elements spatially near a given element on the chromosome. Each data source is given a separate track in the visualization, and the data elements from it are described in detail in a source specific table below the visualization in the order they're displayed. The data visualization also provides a way to visually browse the genome for target data by moving the region being displayed up or down the chromosome and zooming in and out to see more or less data (Figure 3).

Discussion The HRTBLDb is a comprehensive data management system for the aggregation, discovery, and analysis of hormone receptor binding sites in the mammalian genomes including human, mouse, and rat. Given that its primary purpose is to allow other researchers to build upon the information generated by high-throughput experiments, the current version contains 44,673 binding elements with 114 hormone response elements which are verified in in vitro experiments; 75 binding motifs which occur with a hormone response element and whose co-regulatory action is verified in in vitro experiments; 18,472 binding loci from in vivo experiments; and 26,012 computationally predicted binding motifs. The first of several distinguishing features is that HRTBLDb is focused on a whole genome wide scale of HR regulatory information and classifies the binding sites based on their proximity to genes from RefGene databases, such as HG18 RefSeq, MM8 RefSeq and RN4 RefSeq <http://genome.ucsc.edu/>. The importance of this feature is that it accommodates new conclusions from the ENCODE consortium (22) that transcription factor binding sites may be located anywhere in the human genome (See the Supplementary Materials and <http://motif.bmi.ohio-state.edu/hrtbldb/about#regions> for the classification scheme used). The second feature is that we not only collected the binding loci identified from in vivo high throughput data, but also computationally predicted binding motifs which the TFs could possibly bind to and cis-regulatory modules for co-occurrent binding with other TFs. The third feature is that, to adequately verify the regulatory mechanisms at work on several genes in our database, we included direct and indirect binding patterns from in vitro EMSA and ChIP assay experiments. These patterns show regulatory action along with the reported motifs where available and publication references. For example, cathepsin D (CTSD) is reported to be involved with a direct binding estrogen response element in (23) and an endrogen response element and SP1 indirect binding in (24); therefore, this data was added to our database for comparison with in vivo binding data. In summary, the abundance of data in our HRTBLDb will hopefully contribute toward the progression of the ongoing investigation in to HR gene regulation and its mechanisms. Although we have migrated all the data from our previous ERTargetDB (Jin et al 2005) into our new database, there are significant differences between two databases: 1). HRTBLDb focuses on whole genome wide binding sites and visualization; whereas, ERTargetDB only contained promoter regions; 2) The new database contains data for three major hormone receptors (estrogen receptor, androgen receptor, and glucocorticoid receptor). It contains binding loci from both in vivo ChIP-based high throughput data and in vitro experimentally validated binding motifs from EMSA and ChIP assay. 3) It will display all possible binding sites, different types of experiment, HRs, and cell lines, in the queried region. 4). The visualization is done by a modular binding element visualization tool we developed in tandem, the PHP Light-weight Regulator Visualization Tool (PLRVT), and ERTargetDB uses the GDVTK tool written in Java; 5) The internal database schema are different as well in that the HRTBLDb is designed to be generic and store many different forms of DNA binding elements for present and future research. We should also point out that our HRTBLDb is different from other similar publicly available databases, such as MRbase and NuReBase (25), which both contain the chemical and biological properties of Hormones and their receptors, focus on all properties of nuclear receptors, and their data are manually collected, hence these databases include a limited amount of information, but our database focuses on a subset of hormone receptors and their targets’ information and uses computational processing to predict additional binding motifs and transcriptional regulatory modules. This information can provide direction for future research which would likely be fruitful. Our software used to make these binding motif predictions are published on Genome Research, ChIPMotifs (20) for indentified binding motifs as well as ChIPModules (21) for identified cis-regulatory modules. Ours is different from ERGDB (26) and KBERG (27) which were developed by the same group. Their databases only focus on ER responsive genes, and functional classifications, gene expression data. We think ChIP-based data with direct targets, can help determine transcriptional regulatory networks, therefore we created the HRTBLDb. As part of the Integrative Cancer Biology Program of the National Cancer Institute (28), we have developed a ChIP-seq based computational method to characterize the epigenetic mechanisms involved in the target genes of the ER pathway (29). The generated ER target binding loci have been deposited into this database. We will add epigenetic regulatory data such as DNA methylation and histone modification patterns across the loci for these HRs to the database in the future. Other hormone receptors data such as the progesterone receptor are also planned to be added to our database in the near future.