Bioinformatics,
Edition 1
Managing Scientific Data
Editors:
Edited by Zoé Lacroix and Terence Critchlow
Publication Date:
18 Jul 2003
Life science data integration and interoperability is one of the most challenging problems facing bioinformatics today. In the current age of the life sciences, investigators have to interpret many types of information from a variety of sources: lab instruments, public databases, gene expression profiles, raw sequence traces, single nucleotide polymorphisms, chemical screening data, proteomic data, putative metabolic pathway models, and many others. Unfortunately, scientists are not currently able to easily identify and access this information because of the variety of semantics, interfaces, and data formats used by the underlying data sources.
Bioinformatics: Managing Scientific Data tackles this challenge head-on by discussing the current approaches and variety of systems available to help bioinformaticians with this increasingly complex issue. The heart of the book lies in the collaboration efforts of eight distinct bioinformatics teams that describe their own unique approaches to data integration and interoperability. Each system receives its own chapter where the lead contributors provide precious insight into the specific problems being addressed by the system, why the particular architecture was chosen, and details on the system's strengths and weaknesses. In closing, the editors provide important criteria for evaluating these systems that bioinformatics professionals will find valuable.
Bioinformatics: Managing Scientific Data tackles this challenge head-on by discussing the current approaches and variety of systems available to help bioinformaticians with this increasingly complex issue. The heart of the book lies in the collaboration efforts of eight distinct bioinformatics teams that describe their own unique approaches to data integration and interoperability. Each system receives its own chapter where the lead contributors provide precious insight into the specific problems being addressed by the system, why the particular architecture was chosen, and details on the system's strengths and weaknesses. In closing, the editors provide important criteria for evaluating these systems that bioinformatics professionals will find valuable.
Key Features
* Provides a clear overview of the state-of-the-art in data integration and interoperability in genomics, highlighting a variety of systems and giving insight into the strengths and weaknesses of their different approaches. * Discusses shared vocabulary, design issues, complexity of use cases, and the difficulties of transferring existing data management approaches to bioinformatics systems, which serves to connect computer and life scientists. * Written by the primary contributors of eight reputable bioinformatics systems in academia and industry including: BioKris, TAMBIS, K2, GeneExpress, P/FDM, MBM, SDSC, SRS, and DiscoveryLink.
1 Introduction
Zoe Lacroix and Terence Critchlow
1.1 Overview
1.2 Problem and Scope
1.3 Biological Data Integration
1.4 Developing a Biological Data Integration System
1.4.1 Specifications
1.4.2 Translating Specifications into a Technical Approach
1.4.3 Development Process
1.4.4 Evaluation of the System
References
2 Challenges Faced in the Integration of Biological
Information
Su Yun Chung and John C. Wooley
2.1 The Life Science Discovery Process
2.2 An Information Integration Environment for Life Science Discovery
2.3 The Nature of Biological Data
2.3.1 Diversity
2.3.2 Variability
2.4 Data Sources in Life Science
2.4.1 Biological Databases Are Autonomous
2.4.2 Biological Databases Are Heterogeneous in Data Formats
2.4.3 Biological Data Sources Are Dynamic
2.4.4 Computational Analysis Tools Require Specific
Input/Output Formats and Broad Domain Knowledge
2.5 Challenges in Information Integration
2.5.1 Data Integration
2.5.2 Meta-Data Specification
2.5.3 Data Provenance and Data Accuracy
2.5.4 Ontology
2.5.5 Web Presentations
Conclusion
References
3 A Practitioner's Guide to Data Management and Data
Integration in Bioinformatics
Barbara A. Eckman
3.1 Introduction
3.2 Data Management in Bioinformatics
3.2.1 Data Management Basics
3.2.2 Two Popular Data Management Strategies
and Their Limitations
3.2.3 Traditional Database Management
3.3 Dimensions Describing the Space of Integration Solutions
3.3.1 A Motivating Use Case for Integration
3.3.2 Browsing vs. Querying
3.3.3 Syntactic vs. Semantic Integration
3.3.4 Warehouse vs. Federation
3.3.5 Declarative vs. Procedural Access
3.3.6 Generic vs. Hard-Coded
3.3.7 Relational vs. Non-Relational Data Model
3.4 Use Cases of Integration Solutions
3.4.1 Browsing-Driven Solutions
3.4.2 Data Warehousing Solutions
3.4.3 Federated Database Systems Approach
3.4.4 Semantic Data Integration
3.5 Strengths and Weaknesses of the Various Approaches to Integration
3.5.1 Browsing and Querying: Strengths and Weaknesses
3.5.2 Warehousing and Federation: Strengths and Weaknesses
3.5.3 Procedural Code and Declarative Query Language:
Strengths and Weaknesses
3.5.4 Generic and Hard-Coded Approaches:
Strengths and Weaknesses
3.5.5 Relational and Non-Relational Data Models: Strengths
and Weaknesses
3.5.6 Conclusion: A Hybrid Approach to Integration Is Ideal
3.6 Tough Problems in Bioinformatics Integration
3.6.1 Semantic Query Planning Over Web Data Sources
3.6.2 Schema Management
3.7 Summary
Acknowledgments
References
4 Issues to Address While Designing a Biological
Information System
Zoe Lacroix
4.1 Legacy
4.1.1 Biological Data
4.1.2 Biological Tools and Workflows
4.2 A Domain in Constant Evolution
4.2.1 Traditional Database Management and Changes
4.2.2 Data Fusion
4.2.3 Fully Structured vs. Semi-Structured
4.2.4 Scientific Object Identity
4.2.5 Concepts and Ontologies
4.3 Biological Queries
4.3.1 Searching and Mining
4.3.2 Browsing
4.3.3 Semantics of Queries
4.3.4 Tool-Driven vs. Data-Driven Integration
4.4 Query Processing
4.4.1 Biological Resources
4.4.2 Query Planning
4.4.3 Query Optimization
4.5 Visualization
4.5.1 Multimedia Data
4.5.2 Browsing Scientific Objects
4.6 Conclusion
Acknowledgments
References
5 SRS: An Integration Platform for Databanks
and Analysis Tools in Bioinformatics
Thure Etzold, Howard Harris, and Simon Beaulah
5.1 Integrating Flat File Databanks
5.1.1 The SRS Token Server
5.1.2 Subentry Libraries
5.2 Integration of XML Databases
5.2.1 What Makes XML Unique?
5.2.2 How Are XML Databanks Integrated into SRS?
5.2.3 Overview of XML Support Features
5.2.4 How Does SRS Meet the Challenges of XML?
5.3 Integrating Relational Databases
5.3.1 Whole Schema Integration
5.3.2 Capturing the Relational Schema
5.3.3 Selecting a Hub Table
5.3.4 Generation of SQL
5.3.5 Restricting Access to Parts of the Schema
5.3.6 Query Performance to Relational Databases
5.3.7 Viewing Entries from a Relational Databank
5.3.8 Summary
5.4 The SRS Query Language
5.4.1 SRS Fields
5.5 Linking Databanks
5.5.1 Constructing Links
5.5.2 The Link Operators
5.6 The Object Loader
5.6.1 Creating Complex and Nested Objects
5.6.2 Support for Loading from XML Databanks
5.6.3 Using Links to Create Composite Structures
5.6.4 Exporting Objects to XML
5.7 Scientific Analysis Tools
5.7.1 Processing of Input and Output
5.7.2 Batch Queues
5.8 Interfaces to SRS
5.8.1 The Web Interface
5.8.2 SRS Objects
5.8.3 SOAP and Web Services
5.9 Automated Server Maintenance with SRS Prisma
5.10 Conclusion
References
6 The Kleisli Query System as a Backbone for
Bioinformatics Data Integration and Analysis
Jing Chen, Su Yun Chung, and Limsoon Wong
6.1 Motivating Example
6.2 Approach
6.3 Data Model and Representation
6.4 Query Capability
6.5 Warehousing Capability
6.6 Data Sources
6.7 Optimizations
6.7.1 Monadic Optimizations
6.7.2 Context-Sensitive Optimizations
6.7.3 Relational Optimizations
6.8 User Interfaces
6.8.1 Programming Language Interface
6.8.2 Graphical Interface
6.9 Other Data Integration Technologies
6.9.1 SRS
6.9.2 DiscoveryLink
6.9.3 Object-Protocol Model (OPM)
6.10 Conclusions
References
7 Complex Query Formulation Over Diverse
Information Sources in TAMBIS
Robert Stevens, Carole Goble, Norman W. Paton,
Sean Bechhofer, Gary Ng, Patricia Baker, and Andy Brass
7.1 The Ontology
7.2 The User Interface
7.2.1 Exploring the Ontology
7.2.2 Constructing Queries
7.2.3 The Role of Reasoning in Query Formulation
7.3 The Query Processor
7.3.1 The Sources and Services Model
7.3.2 The Query Planner
7.3.3 The Wrappers
7.4 Related Work
x Contents
7.4.1 Information Integration in Bioinformatics
7.4.2 Knowledge Based Information Integration
7.4.3 Biological Ontologies
7.5 Current and Future Developments in TAMBIS
7.5.1 Summary
Acknowledgments
References
8 The Information Integration System K2
Val Tannen, Susan B. Davidson, and Scott Harker
8.1 Approach
8.2 Data Model and Languages
8.3 An Example
8.4 Internal Language
8.5 Data Sources
8.6 Query Optimization
8.7 User Interfaces
8.8 Scalability
8.9 Impact
8.10 Summary
Acknowledgments
References
9 P/FDM Mediator for a Bioinformatics Database
Federation
Graham J. L. Kemp and Peter M. D. Gray
9.1 Approach
9.1.1 Alternative Architectures for Integrating Databases
9.1.2 The Functional Data Model
9.1.3 Schemas in the Federation
9.1.4 Mediator Architecture
9.1.5 Example
9.1.6 Query Capabilities
9.1.7 Data Sources
9.2 Analysis
9.2.1 Optimization
9.2.2 User Interfaces
9.2.3 Scalability
9.3 Conclusions
Acknowledgment
References
10 Integration Challenges in Gene Expression Data
Management
Victor M. Markowitz, John Campbell, I-Min A. Chen,
Anthony Kosky, Krishna Palaniappan,
and Thodoros Topaloglou
10.1 Gene Expression Data Management: Background
10.1.1 Gene Expression Data Spaces
10.1.2 Standards: Benefits and Limitations
10.2 The GeneExpress System
10.2.1 GeneExpress System Components
10.2.2 GeneExpress Deployment and Update Issues
10.3 Managing Gene Expression Data: Integration Challenges
10.3.1 Gene Expression Data: Array Versions
10.3.2 Gene Expression Data: Algorithms and Normalization
10.3.3 Gene Expression Data: Variability
10.3.4 Sample Data
10.3.5 Gene Annotations
10.4 Integrating Third-Party Gene Expression Data in GeneExpress
10.4.1 Data Exchange Formats
10.4.2 Structural Data Transformation Issues
10.4.3 Semantic Data Mapping Issues
10.4.4 Data Loading Issues
10.4.5 Update Issues
10.5 Summary
Acknowledgments
Trademarks
References
11 DiscoveryLink
Laura M. Haas, Barbara A. Eckman, Prasad Kodali,
Eileen T. Lin, Julia E. Rice, and Peter M. Schwarz
11.1 Approach
11.1.1 Architecture
11.1.2 Registration
11.2 Query Processing Overview
11.2.1 Query Optimization
11.2.2 An Example
11.2.3 Determining Costs
11.3 Ease of Use, Scalability, and Performance
11.4 Conclusions
References
12 A Model-Based Mediator System for Scientific Data
Management
Bertram Ludascher, Amarnath Gupta,
and Maryann E. Martone
12.1 Background
12.2 Scientific Data Integration Across Multiple Worlds: Examples
and Challenges from the Neurosciences
12.2.1 From Terminology and Static Knowledge
to Process Context
12.3 Model-Based Mediation
12.3.1 Model-Based Mediation: The Protagonists
12.3.2 Conceptual Models and Registration
of Sources at the Mediator
12.3.3 Interplay Between Mediator and Sources
12.4 Knowledge Representation for Model-Based Mediation
12.4.1 Domain Maps
12.4.2 Process Maps
12.5 Model-Based Mediator System and Tools
12.5.1 The KIND Mediator Prototype
12.5.2 The Cell-Centered Database and SMART Atlas:
Retrieval and Navigation Through
Multi-Scale Data
12.6 Related Work and Conclusion
12.6.1 Related Work
12.6.2 Summary: Model-Based Mediation
and Reason-Able Meta-Data
Acknowledgments
References
13 Compared Evaluation of Scientific Data
Management Systems
Zoe Lacroix and Terence Critchlow
13.1 Performance Model
13.1.1 Evaluation Matrix
13.1.2 Cost Model
13.1.3 Benchmarks
13.1.4 User Survey
13.2 Evaluation Criteria
13.2.1 The Implementation Perspective
13.2.2 The User Perspective
13.3 Tradeoffs
13.3.1 Materialized vs. Non-Materialized
13.3.2 Data Distribution and Heterogeneity
13.3.3 Semi-Structured Data vs. Fully Structured Data
13.3.4 Text Retrieval
13.3.5 Integrating Applications
13.4 Summary
References
Concluding Remarks
Summary
Looking Toward the Future
Appendix: Biological Resources
Glossary
System Information
SRS
Kleisli
TAMBIS
K2
P/FDM Mediator
GeneExpress
DiscoveryLink
KIND
Index
Zoe Lacroix and Terence Critchlow
1.1 Overview
1.2 Problem and Scope
1.3 Biological Data Integration
1.4 Developing a Biological Data Integration System
1.4.1 Specifications
1.4.2 Translating Specifications into a Technical Approach
1.4.3 Development Process
1.4.4 Evaluation of the System
References
2 Challenges Faced in the Integration of Biological
Information
Su Yun Chung and John C. Wooley
2.1 The Life Science Discovery Process
2.2 An Information Integration Environment for Life Science Discovery
2.3 The Nature of Biological Data
2.3.1 Diversity
2.3.2 Variability
2.4 Data Sources in Life Science
2.4.1 Biological Databases Are Autonomous
2.4.2 Biological Databases Are Heterogeneous in Data Formats
2.4.3 Biological Data Sources Are Dynamic
2.4.4 Computational Analysis Tools Require Specific
Input/Output Formats and Broad Domain Knowledge
2.5 Challenges in Information Integration
2.5.1 Data Integration
2.5.2 Meta-Data Specification
2.5.3 Data Provenance and Data Accuracy
2.5.4 Ontology
2.5.5 Web Presentations
Conclusion
References
3 A Practitioner's Guide to Data Management and Data
Integration in Bioinformatics
Barbara A. Eckman
3.1 Introduction
3.2 Data Management in Bioinformatics
3.2.1 Data Management Basics
3.2.2 Two Popular Data Management Strategies
and Their Limitations
3.2.3 Traditional Database Management
3.3 Dimensions Describing the Space of Integration Solutions
3.3.1 A Motivating Use Case for Integration
3.3.2 Browsing vs. Querying
3.3.3 Syntactic vs. Semantic Integration
3.3.4 Warehouse vs. Federation
3.3.5 Declarative vs. Procedural Access
3.3.6 Generic vs. Hard-Coded
3.3.7 Relational vs. Non-Relational Data Model
3.4 Use Cases of Integration Solutions
3.4.1 Browsing-Driven Solutions
3.4.2 Data Warehousing Solutions
3.4.3 Federated Database Systems Approach
3.4.4 Semantic Data Integration
3.5 Strengths and Weaknesses of the Various Approaches to Integration
3.5.1 Browsing and Querying: Strengths and Weaknesses
3.5.2 Warehousing and Federation: Strengths and Weaknesses
3.5.3 Procedural Code and Declarative Query Language:
Strengths and Weaknesses
3.5.4 Generic and Hard-Coded Approaches:
Strengths and Weaknesses
3.5.5 Relational and Non-Relational Data Models: Strengths
and Weaknesses
3.5.6 Conclusion: A Hybrid Approach to Integration Is Ideal
3.6 Tough Problems in Bioinformatics Integration
3.6.1 Semantic Query Planning Over Web Data Sources
3.6.2 Schema Management
3.7 Summary
Acknowledgments
References
4 Issues to Address While Designing a Biological
Information System
Zoe Lacroix
4.1 Legacy
4.1.1 Biological Data
4.1.2 Biological Tools and Workflows
4.2 A Domain in Constant Evolution
4.2.1 Traditional Database Management and Changes
4.2.2 Data Fusion
4.2.3 Fully Structured vs. Semi-Structured
4.2.4 Scientific Object Identity
4.2.5 Concepts and Ontologies
4.3 Biological Queries
4.3.1 Searching and Mining
4.3.2 Browsing
4.3.3 Semantics of Queries
4.3.4 Tool-Driven vs. Data-Driven Integration
4.4 Query Processing
4.4.1 Biological Resources
4.4.2 Query Planning
4.4.3 Query Optimization
4.5 Visualization
4.5.1 Multimedia Data
4.5.2 Browsing Scientific Objects
4.6 Conclusion
Acknowledgments
References
5 SRS: An Integration Platform for Databanks
and Analysis Tools in Bioinformatics
Thure Etzold, Howard Harris, and Simon Beaulah
5.1 Integrating Flat File Databanks
5.1.1 The SRS Token Server
5.1.2 Subentry Libraries
5.2 Integration of XML Databases
5.2.1 What Makes XML Unique?
5.2.2 How Are XML Databanks Integrated into SRS?
5.2.3 Overview of XML Support Features
5.2.4 How Does SRS Meet the Challenges of XML?
5.3 Integrating Relational Databases
5.3.1 Whole Schema Integration
5.3.2 Capturing the Relational Schema
5.3.3 Selecting a Hub Table
5.3.4 Generation of SQL
5.3.5 Restricting Access to Parts of the Schema
5.3.6 Query Performance to Relational Databases
5.3.7 Viewing Entries from a Relational Databank
5.3.8 Summary
5.4 The SRS Query Language
5.4.1 SRS Fields
5.5 Linking Databanks
5.5.1 Constructing Links
5.5.2 The Link Operators
5.6 The Object Loader
5.6.1 Creating Complex and Nested Objects
5.6.2 Support for Loading from XML Databanks
5.6.3 Using Links to Create Composite Structures
5.6.4 Exporting Objects to XML
5.7 Scientific Analysis Tools
5.7.1 Processing of Input and Output
5.7.2 Batch Queues
5.8 Interfaces to SRS
5.8.1 The Web Interface
5.8.2 SRS Objects
5.8.3 SOAP and Web Services
5.9 Automated Server Maintenance with SRS Prisma
5.10 Conclusion
References
6 The Kleisli Query System as a Backbone for
Bioinformatics Data Integration and Analysis
Jing Chen, Su Yun Chung, and Limsoon Wong
6.1 Motivating Example
6.2 Approach
6.3 Data Model and Representation
6.4 Query Capability
6.5 Warehousing Capability
6.6 Data Sources
6.7 Optimizations
6.7.1 Monadic Optimizations
6.7.2 Context-Sensitive Optimizations
6.7.3 Relational Optimizations
6.8 User Interfaces
6.8.1 Programming Language Interface
6.8.2 Graphical Interface
6.9 Other Data Integration Technologies
6.9.1 SRS
6.9.2 DiscoveryLink
6.9.3 Object-Protocol Model (OPM)
6.10 Conclusions
References
7 Complex Query Formulation Over Diverse
Information Sources in TAMBIS
Robert Stevens, Carole Goble, Norman W. Paton,
Sean Bechhofer, Gary Ng, Patricia Baker, and Andy Brass
7.1 The Ontology
7.2 The User Interface
7.2.1 Exploring the Ontology
7.2.2 Constructing Queries
7.2.3 The Role of Reasoning in Query Formulation
7.3 The Query Processor
7.3.1 The Sources and Services Model
7.3.2 The Query Planner
7.3.3 The Wrappers
7.4 Related Work
x Contents
7.4.1 Information Integration in Bioinformatics
7.4.2 Knowledge Based Information Integration
7.4.3 Biological Ontologies
7.5 Current and Future Developments in TAMBIS
7.5.1 Summary
Acknowledgments
References
8 The Information Integration System K2
Val Tannen, Susan B. Davidson, and Scott Harker
8.1 Approach
8.2 Data Model and Languages
8.3 An Example
8.4 Internal Language
8.5 Data Sources
8.6 Query Optimization
8.7 User Interfaces
8.8 Scalability
8.9 Impact
8.10 Summary
Acknowledgments
References
9 P/FDM Mediator for a Bioinformatics Database
Federation
Graham J. L. Kemp and Peter M. D. Gray
9.1 Approach
9.1.1 Alternative Architectures for Integrating Databases
9.1.2 The Functional Data Model
9.1.3 Schemas in the Federation
9.1.4 Mediator Architecture
9.1.5 Example
9.1.6 Query Capabilities
9.1.7 Data Sources
9.2 Analysis
9.2.1 Optimization
9.2.2 User Interfaces
9.2.3 Scalability
9.3 Conclusions
Acknowledgment
References
10 Integration Challenges in Gene Expression Data
Management
Victor M. Markowitz, John Campbell, I-Min A. Chen,
Anthony Kosky, Krishna Palaniappan,
and Thodoros Topaloglou
10.1 Gene Expression Data Management: Background
10.1.1 Gene Expression Data Spaces
10.1.2 Standards: Benefits and Limitations
10.2 The GeneExpress System
10.2.1 GeneExpress System Components
10.2.2 GeneExpress Deployment and Update Issues
10.3 Managing Gene Expression Data: Integration Challenges
10.3.1 Gene Expression Data: Array Versions
10.3.2 Gene Expression Data: Algorithms and Normalization
10.3.3 Gene Expression Data: Variability
10.3.4 Sample Data
10.3.5 Gene Annotations
10.4 Integrating Third-Party Gene Expression Data in GeneExpress
10.4.1 Data Exchange Formats
10.4.2 Structural Data Transformation Issues
10.4.3 Semantic Data Mapping Issues
10.4.4 Data Loading Issues
10.4.5 Update Issues
10.5 Summary
Acknowledgments
Trademarks
References
11 DiscoveryLink
Laura M. Haas, Barbara A. Eckman, Prasad Kodali,
Eileen T. Lin, Julia E. Rice, and Peter M. Schwarz
11.1 Approach
11.1.1 Architecture
11.1.2 Registration
11.2 Query Processing Overview
11.2.1 Query Optimization
11.2.2 An Example
11.2.3 Determining Costs
11.3 Ease of Use, Scalability, and Performance
11.4 Conclusions
References
12 A Model-Based Mediator System for Scientific Data
Management
Bertram Ludascher, Amarnath Gupta,
and Maryann E. Martone
12.1 Background
12.2 Scientific Data Integration Across Multiple Worlds: Examples
and Challenges from the Neurosciences
12.2.1 From Terminology and Static Knowledge
to Process Context
12.3 Model-Based Mediation
12.3.1 Model-Based Mediation: The Protagonists
12.3.2 Conceptual Models and Registration
of Sources at the Mediator
12.3.3 Interplay Between Mediator and Sources
12.4 Knowledge Representation for Model-Based Mediation
12.4.1 Domain Maps
12.4.2 Process Maps
12.5 Model-Based Mediator System and Tools
12.5.1 The KIND Mediator Prototype
12.5.2 The Cell-Centered Database and SMART Atlas:
Retrieval and Navigation Through
Multi-Scale Data
12.6 Related Work and Conclusion
12.6.1 Related Work
12.6.2 Summary: Model-Based Mediation
and Reason-Able Meta-Data
Acknowledgments
References
13 Compared Evaluation of Scientific Data
Management Systems
Zoe Lacroix and Terence Critchlow
13.1 Performance Model
13.1.1 Evaluation Matrix
13.1.2 Cost Model
13.1.3 Benchmarks
13.1.4 User Survey
13.2 Evaluation Criteria
13.2.1 The Implementation Perspective
13.2.2 The User Perspective
13.3 Tradeoffs
13.3.1 Materialized vs. Non-Materialized
13.3.2 Data Distribution and Heterogeneity
13.3.3 Semi-Structured Data vs. Fully Structured Data
13.3.4 Text Retrieval
13.3.5 Integrating Applications
13.4 Summary
References
Concluding Remarks
Summary
Looking Toward the Future
Appendix: Biological Resources
Glossary
System Information
SRS
Kleisli
TAMBIS
K2
P/FDM Mediator
GeneExpress
DiscoveryLink
KIND
Index
ISBN:
9781558608290
Page Count: 464
Retail Price
:
£63.99
Bergeron, Bioinformatics Computing, PH, 0131008250
Fogel/Corne, Evolutionary Computation in Bioinformatics, 1-55860-797-8
Fogel/Corne, Evolutionary Computation in Bioinformatics, 1-55860-797-8
Bioinformaticians involved in data management (development, design, management, etc) at corporations and research companies. CS and life science students in bioinformatics programs.
Related Titles