Database and Software Names Biolography

	Bioconductor - aCGH. [ bib \| http ]
	Bioconductor - affy. [ bib \| http ]
	Bioconductor - graph. [ bib \| http ]
	Bioconductor - ROC. [ bib \| http ]
	Free Phylogenetic Network Software. [ bib \| http ]
	Home - PubMed - NCBI. [ bib \| http ]
	Medical Subject Headings - Home Page. [ bib \| http ]
	MySQL :: The world's most popualr open source database. [ bib \| http ]
	Q - Analysis Software for Market Research. [ bib \| http ]
	software based on libsequence. [ bib \| http ]
	Stephen F Altschul, Warren Gish, Webb Miller, Eugene W Myers, David J Lipman, et al. Basic local alignment search tool. Journal of molecular biology, 215(3):403-410, 1990. [ bib ]
	M Ashburner, C A Ball, J A Blake, D Botstein, H Butler, J M Cherry, A P Davis, K Dolinski, S S Dwight, J T Eppig, M A Harris, D P Hill, L Issel-Tarver, A Kasarskis, S Lewis, J C Matese, J E Richardson, M Ringwald, G M Rubin, and G Sherlock. Gene ontology: tool for the unification of biology. The Gene Ontology Consortium. Nature genetics, 25(1):25-9, May 2000. [ bib \| http ] Keywords: Animals,Computer Communication Networks,Databases, Factual,Eukaryotic Cells,Eukaryotic Cells: physiology,Genes,Humans,Metaphysics,Mice,Molecular Biology,Molecular Biology: trends,Sequence Analysis, DNA,Terminology as Topic
	A Bairoch and R Apweiler. The SWISS-PROT protein sequence data bank and its new supplement TREMBL. Nucleic acids research, 24(1):21-5, January 1996. [ bib \| http ] SWISS-PROT is a curated protein sequence database which strives to provide a high level of annotation (such as the description of the function of a protein, its domain structure, post-translational modifications, variants, etc), a minimal level of redundancy and a high level of integration with other databases. Recent developments of the database include: an increase in the number and scope of model organisms; cross-references to seven additional databases; a variety of new documentation files; the creation of TREMBL, and unannotated supplement to SWISS-PROT. This supplement consists of entries in SWISS-PROT-like format derived from the translation of all coding sequences (CDS) in the EMBL nucleotide sequence database, except CDS already included in SWISS-PROT. Keywords: Amino Acid Sequence,Animals,CD-ROM,Computer Communication Networks,Databases, Factual,Genome, Bacterial,Genome, Fungal,Genome, Plant,Humans,Protein Processing, Post-Translational,Proteins,Proteins: chemistry,Proteins: metabolism,Systems Integration
	A Bairoch and B Boeckmann. The SWISS-PROT protein sequence data bank: current status. Nucleic acids research, 22(17):3578-80, September 1994. [ bib \| http ] SWISS-PROT is an annotated protein sequence database established in 1986 and maintained collaboratively, since 1988, by the Department of Medical Biochemistry of the University of Geneva and the EMBL Data Library. The SWISS-PROT protein sequence data bank consist of sequence entries. Sequence entries are composed of different lines types, each with their own format. For standardization purposes the format of SWISS-PROT follows as closely as possible that of the EMBL Nucleotide Sequence Database. A sample SWISS-PROT entry is shown in Figure 1. Keywords: Amino Acid Sequence,Animals,Computer Communication Networks,Databases, Factual,Genetic Diseases, Inborn,Genetic Diseases, Inborn: genetics,Humans,Molecular Sequence Data,Polymorphism, Genetic,Proteins,Proteins: chemistry,Proteins: genetics
	Tanya Barrett, Dennis B Troup, Stephen E Wilhite, Pierre Ledoux, Carlos Evangelista, Irene F Kim, Maxim Tomashevsky, Kimberly A Marshall, Katherine H Phillippy, Patti M Sherman, Rolf N Muertter, Michelle Holko, Oluwabukunmi Ayanbule, Andrey Yefanov, and Alexandra Soboleva. NCBI GEO: archive for functional genomics data sets-10 years on. Nucleic acids research, 39(Database issue):D1005-10, January 2011. [ bib \| http ] A decade ago, the Gene Expression Omnibus (GEO) database was established at the National Center for Biotechnology Information (NCBI). The original objective of GEO was to serve as a public repository for high-throughput gene expression data generated mostly by microarray technology. However, the research community quickly applied microarrays to non-gene-expression studies, including examination of genome copy number variation and genome-wide profiling of DNA-binding proteins. Because the GEO database was designed with a flexible structure, it was possible to quickly adapt the repository to store these data types. More recently, as the microarray community switches to next-generation sequencing technologies, GEO has again adapted to host these data sets. Today, GEO stores over 20,000 microarray- and sequence-based functional genomics studies, and continues to handle the majority of direct high-throughput data submissions from the research community. Multiple mechanisms are provided to help users effectively search, browse, download and visualize the data at the level of individual genes or entire studies. This paper describes recent database enhancements, including new search and data representation tools, as well as a brief review of how the community uses GEO data. GEO is freely accessible at http://www.ncbi.nlm.nih.gov/geo/. Keywords: Databases, Genetic,Gene Expression Profiling,Genomics,Oligonucleotide Array Sequence Analysis,User-Computer Interface
	Dennis A Benson, Ilene Karsch-Mizrachi, David J Lipman, James Ostell, and Eric W Sayers. GenBank. Nucleic acids research, 39(Database issue):D32-7, January 2011. [ bib \| http ] GenBank® is a comprehensive database that contains publicly available nucleotide sequences for more than 380,000 organisms named at the genus level or lower, obtained primarily through submissions from individual laboratories and batch submissions from large-scale sequencing projects, including whole genome shotgun (WGS) and environmental sampling projects. Most submissions are made using the web-based BankIt or standalone Sequin programs, and accession numbers are assigned by GenBank staff upon receipt. Daily data exchange with the European Nucleotide Archive (ENA) and the DNA Data Bank of Japan (DDBJ) ensures worldwide coverage. GenBank is accessible through the NCBI Entrez retrieval system that integrates data from the major DNA and protein sequence databases along with taxonomy, genome, mapping, protein structure and domain information, and the biomedical journal literature via PubMed. BLAST provides sequence similarity searches of GenBank and other sequence databases. Complete bimonthly releases and daily updates of the GenBank database are available by FTP. To access GenBank and its related retrieval and analysis services, begin at the NCBI Homepage: www.ncbi.nlm.nih.gov. Keywords: Databases, Nucleic Acid,Expressed Sequence Tags,Genomics,High-Throughput Nucleotide Sequencing,Metagenomics,Molecular Sequence Annotation,Software
	H. M. Berman. The Protein Data Bank. Nucleic Acids Research, 28(1):235-242, January 2000. [ bib \| http ] The Protein Data Bank (PDB; http://www.rcsb.org/pdb/ ) is the single worldwide archive of structural data of biological macromolecules. This paper describes the goals of the PDB, the systems in place for data deposition and access, how to obtain further information, and near-term plans for the future development of the resource.
	Jiten Bhagat, Franck Tanoh, Eric Nzuobontane, Thomas Laurent, Jerzy Orlowski, Marco Roos, Katy Wolstencroft, Sergejs Aleksejevs, Robert Stevens, Steve Pettifer, Rodrigo Lopez, and Carole A Goble. BioCatalogue: a universal catalogue of web services for the life sciences. Nucleic acids research, 38(Web Server issue):W689-94, July 2010. [ bib \| http ] The use of Web Services to enable programmatic access to on-line bioinformatics is becoming increasingly important in the Life Sciences. However, their number, distribution and the variable quality of their documentation can make their discovery and subsequent use difficult. A Web Services registry with information on available services will help to bring together service providers and their users. The BioCatalogue (http://www.biocatalogue.org/) provides a common interface for registering, browsing and annotating Web Services to the Life Science community. Services in the BioCatalogue can be described and searched in multiple ways based upon their technical types, bioinformatics categories, user tags, service providers or data inputs and outputs. They are also subject to constant monitoring, allowing the identification of service problems and changes and the filtering-out of unavailable or unreliable resources. The system is accessible via a human-readable 'Web 2.0'-style interface and a programmatic Web Service interface. The BioCatalogue follows a community approach in which all services can be registered, browsed and incrementally documented with annotations by any member of the scientific community. Keywords: Biological Science Disciplines,Catalogs as Topic,Computational Biology,Internet,Software,User-Computer Interface
	Daniel Blankenberg, Gregory Von Kuster, Nathaniel Coraor, Guruprasad Ananda, Ross Lazarus, Mary Mangan, Anton Nekrutenko, and James Taylor. Galaxy: a web-based genome analysis tool for experimentalists. Current protocols in molecular biology / edited by Frederick M. Ausubel ... [et al.], Chapter 19:Unit 19.10.1-21, January 2010. [ bib \| http ] High-throughput data production has revolutionized molecular biology. However, massive increases in data generation capacity require analysis approaches that are more sophisticated, and often very computationally intensive. Thus, making sense of high-throughput data requires informatics support. Galaxy (http://galaxyproject.org) is a software system that provides this support through a framework that gives experimentalists simple interfaces to powerful tools, while automatically managing the computational details. Galaxy is distributed both as a publicly available Web service, which provides tools for the analysis of genomic, comparative genomic, and functional genomic data, or a downloadable package that can be deployed in individual laboratories. Either way, it allows experimentalists without informatics or programming expertise to perform complex large-scale analysis with just a Web browser. Keywords: Animals,Computational Biology,Computational Biology: methods,Genetic Techniques,Genome,Humans,Internet,Software Design
	James Casbon and Mansoor A S Saqi. S4: structure-based sequence alignments of SCOP superfamilies. Nucleic acids research, 33(Database issue):D219-22, January 2005. [ bib \| http ] S4 is an automatically generated database of multiple structure-based sequence alignments of protein superfamilies in the SCOP database. All structural domains that do not share more than 40% sequence identity as defined by the ASTRAL compendium of protein structures are included. The alignments are constructed using pairwise structural alignments to generate residue equivalences that are then integrated into multiple alignments using sequence alignment tools. We describe the database and give examples showing how the automatically generated S4 alignments compare favourably to hand-crafted alignments. Available at: http://compbio.mds.qmw.ac.uk/S4.html. Keywords: Algorithms,Databases, Protein,Protein Structure, Tertiary,Proteins,Proteins: chemistry,Proteins: classification,Sequence Alignment,Sequence Analysis, Protein,Software
	Peter J A Cock, Tiago Antao, Jeffrey T Chang, Brad A Chapman, Cymon J Cox, Andrew Dalke, Iddo Friedberg, Thomas Hamelryck, Frank Kauff, Bartek Wilczynski, and Michiel J L de Hoon. Biopython: freely available Python tools for computational molecular biology and bioinformatics. Bioinformatics (Oxford, England), 25(11):1422-3, June 2009. [ bib \| http ] SUMMARY: The Biopython project is a mature open source international collaboration of volunteer developers, providing Python libraries for a wide range of bioinformatics problems. Biopython includes modules for reading and writing different sequence file formats and multiple sequence alignments, dealing with 3D macro molecular structures, interacting with common tools such as BLAST, ClustalW and EMBOSS, accessing key online databases, as well as providing numerical methods for statistical learning. AVAILABILITY: Biopython is freely available, with documentation and source code at (www.biopython.org) under the Biopython license. Keywords: Computational Biology,Computational Biology: methods,Databases, Factual,Internet,Programming Languages,Software
	Robert C Edgar. MUSCLE: multiple sequence alignment with high accuracy and high throughput. Nucleic acids research, 32(5):1792-7, January 2004. [ bib \| http ] We describe MUSCLE, a new computer program for creating multiple alignments of protein sequences. Elements of the algorithm include fast distance estimation using kmer counting, progressive alignment using a new profile function we call the log-expectation score, and refinement using tree-dependent restricted partitioning. The speed and accuracy of MUSCLE are compared with T-Coffee, MAFFT and CLUSTALW on four test sets of reference alignments: BAliBASE, SABmark, SMART and a new benchmark, PREFAB. MUSCLE achieves the highest, or joint highest, rank in accuracy on each of these sets. Without refinement, MUSCLE achieves average accuracy statistically indistinguishable from T-Coffee and MAFFT, and is the fastest of the tested methods for large numbers of sequences, aligning 5000 sequences of average length 350 in 7 min on a current desktop computer. The MUSCLE program, source code and PREFAB test data are freely available at http://www.drive5. com/muscle. Keywords: Algorithms,Amino Acid Motifs,Amino Acid Sequence,Internet,Molecular Sequence Data,Reproducibility of Results,Sequence Alignment,Sequence Alignment: methods,Sequence Analysis, Protein,Sequence Analysis, Protein: methods,Software,Time Factors
	Robert C Gentleman, Vincent J Carey, Douglas M Bates, Ben Bolstad, Marcel Dettling, Sandrine Dudoit, Byron Ellis, Laurent Gautier, Yongchao Ge, Jeff Gentry, et al. Bioconductor: open software development for computational biology and bioinformatics. Genome biology, 5(10):R80, 2004. [ bib \| http ] The Bioconductor project is an initiative for the collaborative creation of extensible software for computational biology and bioinformatics. The goals of the project include: fostering collaborative development and widespread use of innovative software, reducing barriers to entry into interdisciplinary scientific research, and promoting the achievement of remote reproducibility of research results. We describe details of our aims and methods, identify current challenges, compare Bioconductor to other open bioinformatics projects, and provide working examples.
	Belinda Giardine, Cathy Riemer, Ross C Hardison, Richard Burhans, Laura Elnitski, Prachi Shah, Yi Zhang, Daniel Blankenberg, Istvan Albert, James Taylor, Webb Miller, W James Kent, and Anton Nekrutenko. Galaxy: a platform for interactive large-scale genome analysis. Genome research, 15(10):1451-5, October 2005. [ bib \| http ] Accessing and analyzing the exponentially expanding genomic sequence and functional data pose a challenge for biomedical researchers. Here we describe an interactive system, Galaxy, that combines the power of existing genome annotation databases with a simple Web portal to enable users to search remote resources, combine data from independent queries, and visualize the results. The heart of Galaxy is a flexible history system that stores the queries from each user; performs operations such as intersections, unions, and subtractions; and links to other computational tools. Galaxy can be accessed at http://g2.bx.psu.edu. Keywords: Biological Evolution,Databases, Genetic,Genome,Internet,Promoter Regions, Genetic
	Jeremy Goecks, Anton Nekrutenko, and James Taylor. Galaxy: a comprehensive approach for supporting accessible, reproducible, and transparent computational research in the life sciences. Genome biology, 11(8):R86, January 2010. [ bib \| http ] Increased reliance on computational approaches in the life sciences has revealed grave concerns about how accessible and reproducible computation-reliant results truly are. Galaxy http://usegalaxy.org, an open web-based platform for genomic research, addresses these problems. Galaxy automatically tracks and manages data provenance and provides support for capturing the context and intent of computational methods. Galaxy Pages are interactive, web-based documents that provide users with a medium to communicate a complete computational analysis. Keywords: Algorithms,Animals,Computational Biology,Computational Biology: methods,Databases, Nucleic Acid,Genomics,Genomics: methods,Humans,Internet
	T Hubbard, Daniel Barker, Ewan Birney, Graham Cameron, Yuan Chen, L Clark, T Cox, J Cuff, Val Curwen, Thomas Down, et al. The ensembl genome database project. Nucleic acids research, 30(1):38-41, 2002. [ bib \| http ] The Ensembl (http://www.ensembl.org/) database project provides a bioinformatics framework to organise biology around the sequences of large genomes. It is a comprehensive source of stable automatic annotation of the human genome sequence, with confirmed gene predictions that have been integrated with external data sources, and is available as either an interactive web site or as flat files. It is also an open source software engineering project to develop a portable system able to handle very large genomes and associated requirements from sequence analysis to data storage and visualisation. The Ensembl site is one of the leading sources of human genome sequence annotation and provided much of the analysis for publication by the international human genome project of the draft genome. The Ensembl system is being installed around the world in both companies and academic sites on machines ranging from supercomputers to laptops.
	Minoru Kanehisa, Susumu Goto, Yoko Sato, Miho Furumichi, and Mao Tanabe. KEGG for integration and interpretation of large-scale molecular data sets. Nucleic acids research, 40(Database issue):D109-14, January 2012. [ bib \| http ] Kyoto Encyclopedia of Genes and Genomes (KEGG, http://www.genome.jp/kegg/ or http://www.kegg.jp/) is a database resource that integrates genomic, chemical and systemic functional information. In particular, gene catalogs from completely sequenced genomes are linked to higher-level systemic functions of the cell, the organism and the ecosystem. Major efforts have been undertaken to manually create a knowledge base for such systemic functions by capturing and organizing experimental knowledge in computable forms; namely, in the forms of KEGG pathway maps, BRITE functional hierarchies and KEGG modules. Continuous efforts have also been made to develop and improve the cross-species annotation procedure for linking genomes to the molecular networks through the KEGG Orthology system. Here we report KEGG Mapper, a collection of tools for KEGG PATHWAY, BRITE and MODULE mapping, enabling integration and interpretation of large-scale data sets. We also report a variant of the KEGG mapping procedure to extend the knowledge base, where different types of data and knowledge, such as disease genes and drug targets, are integrated as part of the KEGG molecular networks. Finally, we describe recent enhancements to the KEGG content, especially the incorporation of disease and drug information used in practice and in society, to support translational bioinformatics.
	M A Larkin, G Blackshields, N P Brown, R Chenna, P A McGettigan, H McWilliam, F Valentin, I M Wallace, A Wilm, R Lopez, J D Thompson, T J Gibson, and D G Higgins. Clustal W and Clustal X version 2.0. Bioinformatics (Oxford, England), 23(21):2947-8, November 2007. [ bib \| http ] SUMMARY: The Clustal W and Clustal X multiple sequence alignment programs have been completely rewritten in C++. This will facilitate the further development of the alignment algorithms in the future and has allowed proper porting of the programs to the latest versions of Linux, Macintosh and Windows operating systems. AVAILABILITY: The programs can be run on-line from the EBI web server: http://www.ebi.ac.uk/tools/clustalw2. The source code and executables for Windows, Linux and Macintosh computers are available from the EBI ftp site ftp://ftp.ebi.ac.uk/pub/software/clustalw2/ Keywords: Algorithms,Amino Acid Sequence,Cluster Analysis,Computer Graphics,Molecular Sequence Data,Programming Languages,Sequence Alignment,Sequence Alignment: methods,Sequence Analysis, Protein,Sequence Analysis, Protein: methods,Software,User-Computer Interface
	A G Murzin, S E Brenner, T Hubbard, and C Chothia. SCOP: a structural classification of proteins database for the investigation of sequences and structures. Journal of molecular biology, 247(4):536-40, April 1995. [ bib \| http ] To facilitate understanding of, and access to, the information available for protein structures, we have constructed the Structural Classification of Proteins (scop) database. This database provides a detailed and comprehensive description of the structural and evolutionary relationships of the proteins of known structure. It also provides for each entry links to co-ordinates, images of the structure, interactive viewers, sequence data and literature references. Two search facilities are available. The homology search permits users to enter a sequence and obtain a list of any structures to which it has significant levels of sequence similarity. The key word search finds, for a word entered by the user, matches from both the text of the scop database and the headers of Brookhaven Protein Databank structure files. The database is freely accessible on World Wide Web (WWW) with an entry point to URL http: parallel scop.mrc-lmb.cam.ac.uk magnitude of scop. Keywords: Amino Acid Sequence,Databases, Factual,Protein Folding,Proteins,Proteins: chemistry,Proteins: classification,Sequence Analysis
	R Development Core Team. R: A Language and Environment for Statistical Computing. R Foundation for Statistical Computing, Vienna, Austria, 2011. ISBN 3-900051-07-0. [ bib \| http ]
	Mathias Sprinzl and Konstantin S Vassilenko. Compilation of tRNA sequences and sequences of tRNA genes. Nucleic acids research, 33(Database issue):D139-40, January 2005. [ bib \| http ] Maintained at the Universitat Bayreuth, Bayreuth, Germany, the Compilation of tRNA Sequences and Sequences of tRNA Genes is accessible at the URL http://www.tRNA.uni-bayreuth.de with mirror site located at the Institute of Protein Research, Pushchino, Russia (http://alpha.protres.ru/trnadbase). The compilation is a searchable, periodically updated database of currently available tRNA sequences. The present version of the database contains a new Genomic tRNA Compilation including the sequences of tRNA genes from genomic sequences published up to July 2003. It consists of about 5800 tRNA gene sequences from 111 organisms covering archaea, bacteria, higher and lower eukarya. The former Compilation of tRNA Genes (up to the end of 1998) and the updated Compilation tRNA Sequences (561 entries) are also supported by the new software. The database can be explored by using multiple search criteria and sequence templates. The database provides a service that allows to obtain statistical information on the occurrences of certain bases at given positions of the tRNA sequences. This allows phylogenic studies and search for identity elements in respect to interactions of tRNAs with various enzymes. Keywords: Animals,Base Sequence,Databases, Nucleic Acid,Genomics,Phylogeny,RNA, Transfer,RNA, Transfer: chemistry,RNA, Transfer: classification,RNA, Transfer: genetics,Sequence Alignment
	The UniProt Consortium. Reorganizing the protein space at the Universal Protein Resource (UniProt). Nucleic acids research, 40(Database issue):D71-5, January 2012. [ bib \| http ] The mission of UniProt is to support biological research by providing a freely accessible, stable, comprehensive, fully classified, richly and accurately annotated protein sequence knowledgebase, with extensive cross-references and querying interfaces. UniProt is comprised of four major components, each optimized for different uses: the UniProt Archive, the UniProt Knowledgebase, the UniProt Reference Clusters and the UniProt Metagenomic and Environmental Sequence Database. A key development at UniProt is the provision of complete, reference and representative proteomes. UniProt is updated and distributed every 4 weeks and can be accessed online for searches or download at http://www.uniprot.org.

This file has been generated by bibtex2html 1.54

	Bioconductor - aCGH. [ bib \| http ]
	Bioconductor - affy. [ bib \| http ]
	Bioconductor - graph. [ bib \| http ]
	Bioconductor - ROC. [ bib \| http ]
	Free Phylogenetic Network Software. [ bib \| http ]
	Home - PubMed - NCBI. [ bib \| http ]
	Medical Subject Headings - Home Page. [ bib \| http ]
	MySQL :: The world's most popualr open source database. [ bib \| http ]
	Q - Analysis Software for Market Research. [ bib \| http ]
	software based on libsequence. [ bib \| http ]
	Stephen F Altschul, Warren Gish, Webb Miller, Eugene W Myers, David J Lipman, et al. Basic local alignment search tool. Journal of molecular biology, 215(3):403-410, 1990. [ bib ]
	M Ashburner, C A Ball, J A Blake, D Botstein, H Butler, J M Cherry, A P Davis, K Dolinski, S S Dwight, J T Eppig, M A Harris, D P Hill, L Issel-Tarver, A Kasarskis, S Lewis, J C Matese, J E Richardson, M Ringwald, G M Rubin, and G Sherlock. Gene ontology: tool for the unification of biology. The Gene Ontology Consortium. Nature genetics, 25(1):25-9, May 2000. [ bib \| http ] Keywords: Animals,Computer Communication Networks,Databases, Factual,Eukaryotic Cells,Eukaryotic Cells: physiology,Genes,Humans,Metaphysics,Mice,Molecular Biology,Molecular Biology: trends,Sequence Analysis, DNA,Terminology as Topic
	A Bairoch and R Apweiler. The SWISS-PROT protein sequence data bank and its new supplement TREMBL. Nucleic acids research, 24(1):21-5, January 1996. [ bib \| http ] SWISS-PROT is a curated protein sequence database which strives to provide a high level of annotation (such as the description of the function of a protein, its domain structure, post-translational modifications, variants, etc), a minimal level of redundancy and a high level of integration with other databases. Recent developments of the database include: an increase in the number and scope of model organisms; cross-references to seven additional databases; a variety of new documentation files; the creation of TREMBL, and unannotated supplement to SWISS-PROT. This supplement consists of entries in SWISS-PROT-like format derived from the translation of all coding sequences (CDS) in the EMBL nucleotide sequence database, except CDS already included in SWISS-PROT. Keywords: Amino Acid Sequence,Animals,CD-ROM,Computer Communication Networks,Databases, Factual,Genome, Bacterial,Genome, Fungal,Genome, Plant,Humans,Protein Processing, Post-Translational,Proteins,Proteins: chemistry,Proteins: metabolism,Systems Integration
	A Bairoch and B Boeckmann. The SWISS-PROT protein sequence data bank: current status. Nucleic acids research, 22(17):3578-80, September 1994. [ bib \| http ] SWISS-PROT is an annotated protein sequence database established in 1986 and maintained collaboratively, since 1988, by the Department of Medical Biochemistry of the University of Geneva and the EMBL Data Library. The SWISS-PROT protein sequence data bank consist of sequence entries. Sequence entries are composed of different lines types, each with their own format. For standardization purposes the format of SWISS-PROT follows as closely as possible that of the EMBL Nucleotide Sequence Database. A sample SWISS-PROT entry is shown in Figure 1. Keywords: Amino Acid Sequence,Animals,Computer Communication Networks,Databases, Factual,Genetic Diseases, Inborn,Genetic Diseases, Inborn: genetics,Humans,Molecular Sequence Data,Polymorphism, Genetic,Proteins,Proteins: chemistry,Proteins: genetics
	Tanya Barrett, Dennis B Troup, Stephen E Wilhite, Pierre Ledoux, Carlos Evangelista, Irene F Kim, Maxim Tomashevsky, Kimberly A Marshall, Katherine H Phillippy, Patti M Sherman, Rolf N Muertter, Michelle Holko, Oluwabukunmi Ayanbule, Andrey Yefanov, and Alexandra Soboleva. NCBI GEO: archive for functional genomics data sets-10 years on. Nucleic acids research, 39(Database issue):D1005-10, January 2011. [ bib \| http ] A decade ago, the Gene Expression Omnibus (GEO) database was established at the National Center for Biotechnology Information (NCBI). The original objective of GEO was to serve as a public repository for high-throughput gene expression data generated mostly by microarray technology. However, the research community quickly applied microarrays to non-gene-expression studies, including examination of genome copy number variation and genome-wide profiling of DNA-binding proteins. Because the GEO database was designed with a flexible structure, it was possible to quickly adapt the repository to store these data types. More recently, as the microarray community switches to next-generation sequencing technologies, GEO has again adapted to host these data sets. Today, GEO stores over 20,000 microarray- and sequence-based functional genomics studies, and continues to handle the majority of direct high-throughput data submissions from the research community. Multiple mechanisms are provided to help users effectively search, browse, download and visualize the data at the level of individual genes or entire studies. This paper describes recent database enhancements, including new search and data representation tools, as well as a brief review of how the community uses GEO data. GEO is freely accessible at http://www.ncbi.nlm.nih.gov/geo/. Keywords: Databases, Genetic,Gene Expression Profiling,Genomics,Oligonucleotide Array Sequence Analysis,User-Computer Interface
	Dennis A Benson, Ilene Karsch-Mizrachi, David J Lipman, James Ostell, and Eric W Sayers. GenBank. Nucleic acids research, 39(Database issue):D32-7, January 2011. [ bib \| http ] GenBank® is a comprehensive database that contains publicly available nucleotide sequences for more than 380,000 organisms named at the genus level or lower, obtained primarily through submissions from individual laboratories and batch submissions from large-scale sequencing projects, including whole genome shotgun (WGS) and environmental sampling projects. Most submissions are made using the web-based BankIt or standalone Sequin programs, and accession numbers are assigned by GenBank staff upon receipt. Daily data exchange with the European Nucleotide Archive (ENA) and the DNA Data Bank of Japan (DDBJ) ensures worldwide coverage. GenBank is accessible through the NCBI Entrez retrieval system that integrates data from the major DNA and protein sequence databases along with taxonomy, genome, mapping, protein structure and domain information, and the biomedical journal literature via PubMed. BLAST provides sequence similarity searches of GenBank and other sequence databases. Complete bimonthly releases and daily updates of the GenBank database are available by FTP. To access GenBank and its related retrieval and analysis services, begin at the NCBI Homepage: www.ncbi.nlm.nih.gov. Keywords: Databases, Nucleic Acid,Expressed Sequence Tags,Genomics,High-Throughput Nucleotide Sequencing,Metagenomics,Molecular Sequence Annotation,Software
	H. M. Berman. The Protein Data Bank. Nucleic Acids Research, 28(1):235-242, January 2000. [ bib \| http ] The Protein Data Bank (PDB; http://www.rcsb.org/pdb/ ) is the single worldwide archive of structural data of biological macromolecules. This paper describes the goals of the PDB, the systems in place for data deposition and access, how to obtain further information, and near-term plans for the future development of the resource.
	Jiten Bhagat, Franck Tanoh, Eric Nzuobontane, Thomas Laurent, Jerzy Orlowski, Marco Roos, Katy Wolstencroft, Sergejs Aleksejevs, Robert Stevens, Steve Pettifer, Rodrigo Lopez, and Carole A Goble. BioCatalogue: a universal catalogue of web services for the life sciences. Nucleic acids research, 38(Web Server issue):W689-94, July 2010. [ bib \| http ] The use of Web Services to enable programmatic access to on-line bioinformatics is becoming increasingly important in the Life Sciences. However, their number, distribution and the variable quality of their documentation can make their discovery and subsequent use difficult. A Web Services registry with information on available services will help to bring together service providers and their users. The BioCatalogue (http://www.biocatalogue.org/) provides a common interface for registering, browsing and annotating Web Services to the Life Science community. Services in the BioCatalogue can be described and searched in multiple ways based upon their technical types, bioinformatics categories, user tags, service providers or data inputs and outputs. They are also subject to constant monitoring, allowing the identification of service problems and changes and the filtering-out of unavailable or unreliable resources. The system is accessible via a human-readable 'Web 2.0'-style interface and a programmatic Web Service interface. The BioCatalogue follows a community approach in which all services can be registered, browsed and incrementally documented with annotations by any member of the scientific community. Keywords: Biological Science Disciplines,Catalogs as Topic,Computational Biology,Internet,Software,User-Computer Interface
	Daniel Blankenberg, Gregory Von Kuster, Nathaniel Coraor, Guruprasad Ananda, Ross Lazarus, Mary Mangan, Anton Nekrutenko, and James Taylor. Galaxy: a web-based genome analysis tool for experimentalists. Current protocols in molecular biology / edited by Frederick M. Ausubel ... [et al.], Chapter 19:Unit 19.10.1-21, January 2010. [ bib \| http ] High-throughput data production has revolutionized molecular biology. However, massive increases in data generation capacity require analysis approaches that are more sophisticated, and often very computationally intensive. Thus, making sense of high-throughput data requires informatics support. Galaxy (http://galaxyproject.org) is a software system that provides this support through a framework that gives experimentalists simple interfaces to powerful tools, while automatically managing the computational details. Galaxy is distributed both as a publicly available Web service, which provides tools for the analysis of genomic, comparative genomic, and functional genomic data, or a downloadable package that can be deployed in individual laboratories. Either way, it allows experimentalists without informatics or programming expertise to perform complex large-scale analysis with just a Web browser. Keywords: Animals,Computational Biology,Computational Biology: methods,Genetic Techniques,Genome,Humans,Internet,Software Design
	James Casbon and Mansoor A S Saqi. S4: structure-based sequence alignments of SCOP superfamilies. Nucleic acids research, 33(Database issue):D219-22, January 2005. [ bib \| http ] S4 is an automatically generated database of multiple structure-based sequence alignments of protein superfamilies in the SCOP database. All structural domains that do not share more than 40% sequence identity as defined by the ASTRAL compendium of protein structures are included. The alignments are constructed using pairwise structural alignments to generate residue equivalences that are then integrated into multiple alignments using sequence alignment tools. We describe the database and give examples showing how the automatically generated S4 alignments compare favourably to hand-crafted alignments. Available at: http://compbio.mds.qmw.ac.uk/S4.html. Keywords: Algorithms,Databases, Protein,Protein Structure, Tertiary,Proteins,Proteins: chemistry,Proteins: classification,Sequence Alignment,Sequence Analysis, Protein,Software
	Peter J A Cock, Tiago Antao, Jeffrey T Chang, Brad A Chapman, Cymon J Cox, Andrew Dalke, Iddo Friedberg, Thomas Hamelryck, Frank Kauff, Bartek Wilczynski, and Michiel J L de Hoon. Biopython: freely available Python tools for computational molecular biology and bioinformatics. Bioinformatics (Oxford, England), 25(11):1422-3, June 2009. [ bib \| http ] SUMMARY: The Biopython project is a mature open source international collaboration of volunteer developers, providing Python libraries for a wide range of bioinformatics problems. Biopython includes modules for reading and writing different sequence file formats and multiple sequence alignments, dealing with 3D macro molecular structures, interacting with common tools such as BLAST, ClustalW and EMBOSS, accessing key online databases, as well as providing numerical methods for statistical learning. AVAILABILITY: Biopython is freely available, with documentation and source code at (www.biopython.org) under the Biopython license. Keywords: Computational Biology,Computational Biology: methods,Databases, Factual,Internet,Programming Languages,Software
	Robert C Edgar. MUSCLE: multiple sequence alignment with high accuracy and high throughput. Nucleic acids research, 32(5):1792-7, January 2004. [ bib \| http ] We describe MUSCLE, a new computer program for creating multiple alignments of protein sequences. Elements of the algorithm include fast distance estimation using kmer counting, progressive alignment using a new profile function we call the log-expectation score, and refinement using tree-dependent restricted partitioning. The speed and accuracy of MUSCLE are compared with T-Coffee, MAFFT and CLUSTALW on four test sets of reference alignments: BAliBASE, SABmark, SMART and a new benchmark, PREFAB. MUSCLE achieves the highest, or joint highest, rank in accuracy on each of these sets. Without refinement, MUSCLE achieves average accuracy statistically indistinguishable from T-Coffee and MAFFT, and is the fastest of the tested methods for large numbers of sequences, aligning 5000 sequences of average length 350 in 7 min on a current desktop computer. The MUSCLE program, source code and PREFAB test data are freely available at http://www.drive5. com/muscle. Keywords: Algorithms,Amino Acid Motifs,Amino Acid Sequence,Internet,Molecular Sequence Data,Reproducibility of Results,Sequence Alignment,Sequence Alignment: methods,Sequence Analysis, Protein,Sequence Analysis, Protein: methods,Software,Time Factors
	Robert C Gentleman, Vincent J Carey, Douglas M Bates, Ben Bolstad, Marcel Dettling, Sandrine Dudoit, Byron Ellis, Laurent Gautier, Yongchao Ge, Jeff Gentry, et al. Bioconductor: open software development for computational biology and bioinformatics. Genome biology, 5(10):R80, 2004. [ bib \| http ] The Bioconductor project is an initiative for the collaborative creation of extensible software for computational biology and bioinformatics. The goals of the project include: fostering collaborative development and widespread use of innovative software, reducing barriers to entry into interdisciplinary scientific research, and promoting the achievement of remote reproducibility of research results. We describe details of our aims and methods, identify current challenges, compare Bioconductor to other open bioinformatics projects, and provide working examples.
	Belinda Giardine, Cathy Riemer, Ross C Hardison, Richard Burhans, Laura Elnitski, Prachi Shah, Yi Zhang, Daniel Blankenberg, Istvan Albert, James Taylor, Webb Miller, W James Kent, and Anton Nekrutenko. Galaxy: a platform for interactive large-scale genome analysis. Genome research, 15(10):1451-5, October 2005. [ bib \| http ] Accessing and analyzing the exponentially expanding genomic sequence and functional data pose a challenge for biomedical researchers. Here we describe an interactive system, Galaxy, that combines the power of existing genome annotation databases with a simple Web portal to enable users to search remote resources, combine data from independent queries, and visualize the results. The heart of Galaxy is a flexible history system that stores the queries from each user; performs operations such as intersections, unions, and subtractions; and links to other computational tools. Galaxy can be accessed at http://g2.bx.psu.edu. Keywords: Biological Evolution,Databases, Genetic,Genome,Internet,Promoter Regions, Genetic
	Jeremy Goecks, Anton Nekrutenko, and James Taylor. Galaxy: a comprehensive approach for supporting accessible, reproducible, and transparent computational research in the life sciences. Genome biology, 11(8):R86, January 2010. [ bib \| http ] Increased reliance on computational approaches in the life sciences has revealed grave concerns about how accessible and reproducible computation-reliant results truly are. Galaxy http://usegalaxy.org, an open web-based platform for genomic research, addresses these problems. Galaxy automatically tracks and manages data provenance and provides support for capturing the context and intent of computational methods. Galaxy Pages are interactive, web-based documents that provide users with a medium to communicate a complete computational analysis. Keywords: Algorithms,Animals,Computational Biology,Computational Biology: methods,Databases, Nucleic Acid,Genomics,Genomics: methods,Humans,Internet
	T Hubbard, Daniel Barker, Ewan Birney, Graham Cameron, Yuan Chen, L Clark, T Cox, J Cuff, Val Curwen, Thomas Down, et al. The ensembl genome database project. Nucleic acids research, 30(1):38-41, 2002. [ bib \| http ] The Ensembl (http://www.ensembl.org/) database project provides a bioinformatics framework to organise biology around the sequences of large genomes. It is a comprehensive source of stable automatic annotation of the human genome sequence, with confirmed gene predictions that have been integrated with external data sources, and is available as either an interactive web site or as flat files. It is also an open source software engineering project to develop a portable system able to handle very large genomes and associated requirements from sequence analysis to data storage and visualisation. The Ensembl site is one of the leading sources of human genome sequence annotation and provided much of the analysis for publication by the international human genome project of the draft genome. The Ensembl system is being installed around the world in both companies and academic sites on machines ranging from supercomputers to laptops.
	Minoru Kanehisa, Susumu Goto, Yoko Sato, Miho Furumichi, and Mao Tanabe. KEGG for integration and interpretation of large-scale molecular data sets. Nucleic acids research, 40(Database issue):D109-14, January 2012. [ bib \| http ] Kyoto Encyclopedia of Genes and Genomes (KEGG, http://www.genome.jp/kegg/ or http://www.kegg.jp/) is a database resource that integrates genomic, chemical and systemic functional information. In particular, gene catalogs from completely sequenced genomes are linked to higher-level systemic functions of the cell, the organism and the ecosystem. Major efforts have been undertaken to manually create a knowledge base for such systemic functions by capturing and organizing experimental knowledge in computable forms; namely, in the forms of KEGG pathway maps, BRITE functional hierarchies and KEGG modules. Continuous efforts have also been made to develop and improve the cross-species annotation procedure for linking genomes to the molecular networks through the KEGG Orthology system. Here we report KEGG Mapper, a collection of tools for KEGG PATHWAY, BRITE and MODULE mapping, enabling integration and interpretation of large-scale data sets. We also report a variant of the KEGG mapping procedure to extend the knowledge base, where different types of data and knowledge, such as disease genes and drug targets, are integrated as part of the KEGG molecular networks. Finally, we describe recent enhancements to the KEGG content, especially the incorporation of disease and drug information used in practice and in society, to support translational bioinformatics.
	M A Larkin, G Blackshields, N P Brown, R Chenna, P A McGettigan, H McWilliam, F Valentin, I M Wallace, A Wilm, R Lopez, J D Thompson, T J Gibson, and D G Higgins. Clustal W and Clustal X version 2.0. Bioinformatics (Oxford, England), 23(21):2947-8, November 2007. [ bib \| http ] SUMMARY: The Clustal W and Clustal X multiple sequence alignment programs have been completely rewritten in C++. This will facilitate the further development of the alignment algorithms in the future and has allowed proper porting of the programs to the latest versions of Linux, Macintosh and Windows operating systems. AVAILABILITY: The programs can be run on-line from the EBI web server: http://www.ebi.ac.uk/tools/clustalw2. The source code and executables for Windows, Linux and Macintosh computers are available from the EBI ftp site ftp://ftp.ebi.ac.uk/pub/software/clustalw2/ Keywords: Algorithms,Amino Acid Sequence,Cluster Analysis,Computer Graphics,Molecular Sequence Data,Programming Languages,Sequence Alignment,Sequence Alignment: methods,Sequence Analysis, Protein,Sequence Analysis, Protein: methods,Software,User-Computer Interface
	A G Murzin, S E Brenner, T Hubbard, and C Chothia. SCOP: a structural classification of proteins database for the investigation of sequences and structures. Journal of molecular biology, 247(4):536-40, April 1995. [ bib \| http ] To facilitate understanding of, and access to, the information available for protein structures, we have constructed the Structural Classification of Proteins (scop) database. This database provides a detailed and comprehensive description of the structural and evolutionary relationships of the proteins of known structure. It also provides for each entry links to co-ordinates, images of the structure, interactive viewers, sequence data and literature references. Two search facilities are available. The homology search permits users to enter a sequence and obtain a list of any structures to which it has significant levels of sequence similarity. The key word search finds, for a word entered by the user, matches from both the text of the scop database and the headers of Brookhaven Protein Databank structure files. The database is freely accessible on World Wide Web (WWW) with an entry point to URL http: parallel scop.mrc-lmb.cam.ac.uk magnitude of scop. Keywords: Amino Acid Sequence,Databases, Factual,Protein Folding,Proteins,Proteins: chemistry,Proteins: classification,Sequence Analysis
	R Development Core Team. R: A Language and Environment for Statistical Computing. R Foundation for Statistical Computing, Vienna, Austria, 2011. ISBN 3-900051-07-0. [ bib \| http ]
	Mathias Sprinzl and Konstantin S Vassilenko. Compilation of tRNA sequences and sequences of tRNA genes. Nucleic acids research, 33(Database issue):D139-40, January 2005. [ bib \| http ] Maintained at the Universitat Bayreuth, Bayreuth, Germany, the Compilation of tRNA Sequences and Sequences of tRNA Genes is accessible at the URL http://www.tRNA.uni-bayreuth.de with mirror site located at the Institute of Protein Research, Pushchino, Russia (http://alpha.protres.ru/trnadbase). The compilation is a searchable, periodically updated database of currently available tRNA sequences. The present version of the database contains a new Genomic tRNA Compilation including the sequences of tRNA genes from genomic sequences published up to July 2003. It consists of about 5800 tRNA gene sequences from 111 organisms covering archaea, bacteria, higher and lower eukarya. The former Compilation of tRNA Genes (up to the end of 1998) and the updated Compilation tRNA Sequences (561 entries) are also supported by the new software. The database can be explored by using multiple search criteria and sequence templates. The database provides a service that allows to obtain statistical information on the occurrences of certain bases at given positions of the tRNA sequences. This allows phylogenic studies and search for identity elements in respect to interactions of tRNAs with various enzymes. Keywords: Animals,Base Sequence,Databases, Nucleic Acid,Genomics,Phylogeny,RNA, Transfer,RNA, Transfer: chemistry,RNA, Transfer: classification,RNA, Transfer: genetics,Sequence Alignment
	The UniProt Consortium. Reorganizing the protein space at the Universal Protein Resource (UniProt). Nucleic acids research, 40(Database issue):D71-5, January 2012. [ bib \| http ] The mission of UniProt is to support biological research by providing a freely accessible, stable, comprehensive, fully classified, richly and accurately annotated protein sequence knowledgebase, with extensive cross-references and querying interfaces. UniProt is comprised of four major components, each optimized for different uses: the UniProt Archive, the UniProt Knowledgebase, the UniProt Reference Clusters and the UniProt Metagenomic and Environmental Sequence Database. A key development at UniProt is the provision of complete, reference and representative proteomes. UniProt is updated and distributed every 4 weeks and can be accessed online for searches or download at http://www.uniprot.org.