Integration of the Anatomical Therapeutic Chemical Classification System and DrugBank using OWL and text-mining

Dietrich Rebholz-Schuhmann

The field of drug discovery has experienced in the recent years a growth in the number of computational methods used to predict new biological actions for chemical compounds. In order to evaluate predictions and to gain insights into the potential usage of drugs, the Anatomical Therapeutic Chemical Classification System (ATC) serves as an internationally accepted gold standard. However, this classification was not initially developed to accomplish the mentioned task and therefore lacks connections with other ...

IMISE-REPORTS Herausgegeben von Professor Dr. Markus Löffler M. Boeker, H. Herre, R. Hoehndorf, F. Loebe (Eds.) OBML 2012 Workshop Proceedings Dresden, September 27-28, 2012 IMISE-REPORT Nr. 1/2012 Medizinische Fakultät Impressum Herausgeber: Universität Leipzig Medizinische Fakultät Institut für Medizinische Informatik, Statistik und Epidemiologie (IMISE) Härtelstraße 16-18, 04107 Leipzig Prof. Dr. Markus Löffler Editoren: Martin Boeker, Heinrich Herre, Robert Hoehndorf, Frank Loebe Redakteur: Frank Loebe Kontakt: Telefon: (0341) 97-16100, Fax: (0341) 97-16109 Internet: http://www.imise.uni-leipzig.de Redaktionsschluss: 17. September 2012 Druck: Inhalt: Universitätsklinikum Leipzig AöR, Bereich 2 - Abteilung Zentrale Vervielfältigung/Formularwesen Einband: Buch- und Offsetdruckerei Herbert Kirsten  IMISE 2012 (Report als Sammelband). Das Copyright der Einzelartikel verbleibt bei den Autoren. Alle Rechte vorbehalten. Nachdruck nur mit ausdrücklicher Genehmigung des Herausgebers bzw. der jeweiligen Autoren und mit Quellenangabe gestattet. ISSN 1610-7233 Proceedings of the 4th WORKSHOP OF THE GI WORKGROUP “ONTOLOGIES IN BIOMEDICINE AND LIFE SCIENCES” (OBML) Dresden, Germany September 27-28, 2012 Group Website: https://wiki.imise.uni-leipzig.de/Gruppen/OBML Organizers Martin Boeker Heinrich Herre Robert Hoehndorf Frank Loebe (chair) University Medical Center Freiburg University of Leipzig University of Cambridge, UK University of Leipzig Local Organizer Michael Schroeder Technical University Dresden Keynote Speakers Francisco M. Couto Georgios V. Gkoutos Michael Schroeder University of Lisbon, Portugal Aberystwyth University, UK Technical University Dresden Program Committee Martin Boeker (program-chair) Heinrich Herre (program-chair) Patryk Burek Fred Freitas Georgios V. Gkoutos Giancarlo Guizzardi Robert Hoehndorf Ludger Jansen Janet Kelso Toralf Kirsten Frank Loebe Axel Ngonga-Ngomo Anika Oellrich Roberto Poli Dietrich Rebholz-Schuhmann Peter Robinson Daniel Schober Paul N. Schofield Michael Schroeder Stefan Schulz Luca Toldo George Tsatsaronis University Medical Center Freiburg University of Leipzig University of Leipzig Federal University of Pernambuco, Recife, Brazil Aberystwyth University, UK Federal University of Espirito Santo, Brazil University of Cambridge, UK University of Rostock Max Planck Institute for Evolutionary Anthropology, Leipzig University of Leipzig University of Leipzig University of Leipzig European Bioinformatics Institute, Cambridge, UK University of Trento, Italy European Bioinformatics Institute, Cambridge, UK Charité Berlin University Medical Center Freiburg University of Cambridge, UK Technical University Dresden Medical University of Graz, Austria Merck KGaA, Darmstadt Technical University Dresden Additional Reviewers John Hancock Oliver Kutz Filipe Santana da Silva Medical Research Council, Harwell, Oxfordshire, UK University of Bremen Federal University of Pernambuco, Recife, Brazil ii Authors Maria Anderberg Clemens Beckstein Martin Boeker Bernard de Bono Mathias Brochhausen Daniel L. Cook Francisco M. Couto Samuel Croset Mikael Eriksson Martin N. Fransson Ana T. Freitas Georg Fuellen John H. Gennari Georgios V. Gkoutos Christoph Grabmüller Pierre Grenon Niels Grewe Robert Hoehndorf William R. Hogan Ludger Jansen Paul Kasteleyn Sanela Kjellqvist (form. Kurtovic) Christian Knüpfer Natallia Kokash Andreas Kurtz Ulf Leser Jan-Eric Litton Catia M. Machado Colin McKerlie Roxana Merino-Martinez Maxwell Neal Loreana Norlin Anika Oellrich Dome Potikanond Dietrich Rebholz-Schuhmann Johannes Röhl Daniel Schober Paul N. Schofield Beth A. Sundberg John P. Sundberg Fons Verbeek Karolinska Institute, Stockholm, Sweden University of Jena University Medical Center Freiburg European Bioinformatics Institute, Cambridge, UK University of Arkansas for Medical Sciences, Little Rock, AR, USA University of Washington, Seattle, USA University of Lisbon, Portugal European Bioinformatics Institute, Cambridge, UK Karolinska Institute, Stockholm, Sweden Karolinska Institute, Stockholm, Sweden Technical University of Lisbon, Portugal University of Rostock University of Washington, Seattle, USA Aberystwyth University, UK European Bioinformatics Institute, Cambridge, UK European Bioinformatics Institute, Cambridge, UK University of Rostock University of Cambridge, UK University of Arkansas for Medical Sciences, Little Rock, AR, USA University of Rostock Leiden University, The Netherlands Karolinska Institute, Stockholm, Sweden University of Jena Leiden University, The Netherlands Charité Berlin Humboldt University of Berlin Karolinska Institute, Stockholm, Sweden University of Lisbon, Portugal Hospital for Sick Children, Toronto, Canada Karolinska Institute, Stockholm, Sweden University of Washington, Seattle, USA Karolinska Institute, Stockholm, Sweden European Bioinformatics Institute, Cambridge, UK Leiden University, The Netherlands European Bioinformatics Institute, Cambridge, UK University of Rostock University Medical Center Freiburg University of Cambridge, UK The Jackson Laboratory, Bar Harbor, Maine, USA The Jackson Laboratory, Bar Harbor, Maine, USA Leiden University, The Netherlands iii Preliminary Program as of September 16, 2012 iv THURSDAY Sep 27, 2012 (12:00-13:00) (Getting together / Registration / COFFEE ) 13:00 - 13:20 13:20 - 14:15 R. Hoehndorf M. Schroeder 14:15 - 14:30 COFFEE Session 1 14:30 - 14:50 14:50 - 15:10 15:10 - 15:30 Ontologies in the Clinical Environment Chair: M. Boeker M. Brochhausen Developing an Ontology for Sharing Biobank Data based on the BBMRI Minimum Data Set MIABIS C. Machado Enrichment Analysis Applied to Disease Prognosis S. Croset Integration of the Anatomical Therapeutic Chemical Classification System and DrugBank using OWL and Text-Mining 15:30 - 16:00 COFFEE Session 2 16:00 - 16:20 16:20 - 16:40 16:40 - 17:00 Special Topic: Representations of Phenotype and Pathology L. Jansen Using Ontologies to Study Cell Transitions A. Oellrich Auto atically Tra sfor i g Pre‐ to Post‐Co posed Phe otypes: EQ‐lisi g HPO a d MP P. Schofield The Mouse Pathology Ontology, MPATH; Structure and Applications 17:00 - 17:20 COFFEE 17:20 - 18:15 G. Gkoutos 19:30 DINNER Welcome Remarks GoPubMed: Semantic Search for the Life Sciences Keynote on Special Topic: The Importance of Physiology in Translational Research Chair: R. Hoehndorf Preliminary Program as of September 16, 2012 FRIDAY Sep 28, 2012 09:00 - 09:10 COFFEE Session 3 09:10 - 09:30 09:30 - 09:50 09:50 - 10:10 10:10 - 10:30 Special Topic: From Function to Physiology J. Röhl Functions, Roles and Dispositions Revisited. A New Classification of Realizables C. Knüpfer Function of Bio-Models: Linking Structure to Behaviour D. Cook PhysioMaps of Physiological Processes and their Participants B. de Bono Tissue Motifs and Multi-Scale Transport Physiology 10:30 - 11:00 COFFEE Session 4 Methods and Tools 11:00 - 12:00 F. Couto Keynote: Semantic Similarity in Biomedical Ontologies: Measurements, Assessment and Applications 12:00 - 12:20 12:20 - 12:40 N. Grewe D. Schober Comparing Closely Related, Semantically Rich Ontologies: The GoodOD Similarity Evaluator ZooA i als.owl: A Didactically Sou d E a ple‐O tology for Teachi g Descriptio Logics i OWL 2 12:40 - 14:00 LUNCH Chair: H. Herre Chair: N.N. v 14:00 - 15:00 Open Discussion Table of Contents Keynote Abstracts Page Semantic Similarity in Biomedical Ontologies: Measurement, Assessment and Applications Francisco M. Couto viii The Importance of Physiology in Translational Research Georgios V. Gkoutos viii GoPubMed: Semantic Search for the Life Sciences Michael Schroeder viii workshop papers according to program ordering Paper ID Nr. of Pages Developing an Ontology for Sharing Biobank Data based on the BBMRI Minimum Data Set MIABIS Mathias Brochhausen, Martin N. Fransson, Mikael Eriksson, Roxana Merino-Martinez, Loreana Norlin, Sanela Kjellqvist (form. Kurtovic), Maria Anderberg, Umit Topaloglu, William R. Hogan and Jan-Eric Litton A 6 Enrichment Analysis Applied to Disease Prognosis Catia M. Machado, Ana T. Freitas and Francisco M. Couto B 5 Integration of the Anatomical Therapeutic Chemical Classification System and DrugBank using OWL and Text-Mining Samuel Croset, Robert Hoehndorf and Dietrich Rebholz-Schuhmann C 4 Using Ontologies to Study Cell Transitions Ludger Jansen, Georg Fuellen, Ulf Leser and Andreas Kurtz D 5 Automatically Transforming Pre‐ to Post‐Composed Phe otypes: EQ‐lisi g HPO and MP Anika Oellrich, Christoph Grabmüller and Dietrich Rebholz-Schuhmann E 5 The Mouse Pathology Ontology, MPATH; Structure and Applications Paul N. Schofield, John P. Sundberg, Beth A. Sundberg, Colin McKerlie and George V. Gkoutos F 7 Functions, Roles and Dispositions Revisited. A New Classification of Realizables Johannes Röhl and Ludger Jansen G 6 Function of Bio-Models: Linking Structure to Behaviour Clemens Beckstein and Christian Knüpfer H 5 Ontologies in the Clinical Environment Special Topic: Representations of Phenotype and Pathology Special Topic: From Function to Physiology the papers of this session continue on the next page vi PhysioMaps of Physiological Processes and their Participants Daniel L. Cook, Maxwell L. Neal, Robert Hoehndorf, Georgios V. Gkoutos and John H. Gennari I 4 Tissue Motifs and Multi-Scale Transport Physiology Bernard de Bono, Paul Kasteleyn, Dome Potikanond, Natallia Kokash, Fons Verbeek and Pierre Grenon J 4 Comparing Closely Related, Semantically Rich Ontologies: The GoodOD Similarity Evaluator Niels Grewe, Daniel Schober and Martin Boeker K 6 ZooA i als.owl: A Didactically Sou d E a ple‐O tology for Teachi g Descriptio Logics in OWL 2 Daniel Schober, Niels Grewe, Johannes Röhl and Martin Boeker L 5 Methods and Tools vii Keynote Abstracts Francisco M. Couto, University of Lisbon, Portugal Semantic Similarity in Biomedical Ontologies: Measurement, Assessment and Applications The analysis of complex biomedical entities and events, such as disease and epidemiological models, is challenging due to their multiple domain features, and thus to accurately describe them we need to use concepts from multiple biomedical ontologies, such as gene mutations, protein functions, anatomical parts and phenotypes. The usefulness of ontological annotations to interlink and interpret biomedical information is widely recognized, particularly for retrieving related information. This relatedness can be captured by semantic similarity measures that return a numerical value reflecting the closeness in meaning between two ontology concepts or two annotated entities. These measures have been successfully applied to biomedical ontologies, particularly to the Gene Ontology, for comparing proteins based on the similarity of their functions. This talk will discuss ongoing efforts to calculate semantic similarity, describe popular techniques used to calculate their effectiveness, and present existing biomedical applications that use semantic similarity measures to improve their performance. Georgios V. Gkoutos, Aberystwyth University, UK The Importance of Physiology in Translational Research The targeted mutation of animal models and the systematic study of phenotypes resulting from these mutations have resulted in several significant breakthroughs over the recent years. Currently, associations between genotype and phenotype in animal models are being recorded systematically for multiple species. The aim is to consistently describe the phenotypes associated with mutations in every function-bearing gene, reveal the ge es’ fu ctio s, the structure a d dy a ics of physiological pathways as well as reveal the pathobiology of disease in humans and other animals. A consistent representation of physiology is one key challenge towards achieving such a goal. The major focus of this presentation is to highlight the importance of a physiology representation in relation to recent efforts to systematically compare phenotypes across species and translate insights from animal model research into an understanding of human traits and disease. Michael Schroeder, Technical University Dresden, Germany GoPubMed: Semantic Search for the Life Sciences In the talk I give an overview over recent work on ontology generation and semantic search with applications in drug discovery and trend analysis. viii Developing an Ontology for Sharing Biobank Data based on the BBMRI Minimum Data Set MIABIS Mathias Brochhausen*1, Martin N. Fransson2, Mikael Eriksson2, Roxana MerinoMartinez2, Loreana Norlin2, Sanela Kjellqvist¤2, Maria Anderberg2, Umit Topaloglu1, William R. Hogan1 and Jan-Eric Litton 2 1 University of Arkansas for Medical Sciences, Little Rock, AR, USA 2 Karolinska Institutet, Stockholm, Sweden ¤ Formerly Sanela Kurtovic ABSTRACT Sharing data about heterogeneous collections of specimens stored in biobanks is an important topic with respect to making optimal use of sparse resources. Based on a minimum data set for biobank data sharing created by the European Biobanking and Biomolecular Resources Research Infrastructure (BBMRI), we developed an ontology coded in OWL2. This ontology provides shared semantics to guide data collection and enables ontology-assisted querying of the data. The development of this application ontology for sharing data across biobanks follows the criteria laid out by the OBO Foundry. The goal of using ontologies within BBMRI is to ensure semantic data integration and to enable reasoning over data. 1 INTRODUCTION The move towards a universal information infrastructure for biobanking is directly connected to the issues of semantic interoperability through standardized message formats and controlled terminologies. Several efforts have been made toward the integration of biobank and research data. Harmonization of biomedical data has been addressed in large collaborations but most of the efforts have been done in a project-driven fashion; focused on the harmonization of the specific information needed by a particular project [1,2]. In contrast, one of the aims of the European Biobanking and Biomolecular Resources Research Infrastructure (BBMRI) is to provide the necessary formats to compare biobank information at different levels of detail [3]. Following the preparatory phase of BBMRI, several national BBMRI nodes were initiated with the same aim as the European BBMRI, but on a national scale. In recent years, ontologies have become increasingly important to biological and biomedical research. The widespread adoption of Gene Ontology [4] is only one example * To whom correspondence should be addressed. classes printed bold, object properties in italics and OPERATORS all caps. Definitions of classes referred to here can be found in Tab. 1 1 of usage of ontologies in the biomedical arena. Ontologies stand out from other semantic resources, such as terminologies and controlled vocabularies, by (among others) two characteristics: a logical structure to support algorithmic processing and a focus on the empirical knowledge regarding the phenomena that are the basis of the data [5]. Considering current differences in biobank concepts among different countries and the inability to compare concepts across languages, an ontology is crucial for researchers looking for semantically comparable and international data sets. Moreover, an ontology that includes formal restrictions would help to overcome semantic underspecification, even in a single language environment. The use of semantic annotation, such as axioms coded in OWL, will also facilitate the design and implementation of an informatics model for biobank data sharing. The implementation of this model in an informatics system will lead to the integration of biobank and research data and a first step towards global knowledge discovery within the biobank domain. One of our main use cases lies outside the BBMRI effort: aggregating and querying data from several, independently operated biobanks at the University of Arkansas for Medical Sciences (UAMS) and the Arkansas Children’s Hospital Research Institute (ACHRI). UAMS has a Tissue Procurement Facility and several relatively smaller individual research labs (i.e. the Myeloma Institute, the Spit for the Cure project) that collect and store specimens for research purposes. In addition, ACHRI has several independent labs similarly managing specimens, including the Center for Birth Defects Research, autism research, and the Women's Mental Health Program. Both institutions would like their collected specimens and annotated data to be used for research purposes while keeping operations of each lab independent. Also, to facilitate finding specimens of interest, it is a requirement to integrate biobank data with electronic health record (EHR) data. In another concurrent effort, UAMS has invested in creating an Enterprise Data Warehouse (EDW) to facilitate access to and integration of clini- Paper A 1 M. Brochhausen et al. cal, basic-science, and other data for research and quality reporting. Informatics for Integrating Biology and the Bedside (i2b2) [6,7] is an open-source platform for retrieving de-identified data from the UAMS EDW. i2b2 was designed primarily for cohort identification, allowing users to perform queries to determine the existence of a set of patients meeting certain inclusion or exclusion criteria. To ensure semantic integration of data from numerous biobanks with EHR data, an ontology is needed to which the data will be mapped in the i2b2 Ontology Cell. The initial development of OMIABIS as presented here will provide the core of this ontology. The aim of the data integration is to allow data queries over the entire data within the data warehouse regardless of the initial data structure in the data source. Because the management, the operations, and the data collected in the biobanks are distinct, it will be a challenge to manually map all of the biobanks’ data into a single i2b2 instance. Currently the biobanks use caTissue [8], an opensource biospecimen management tool. Despite using a single software application, integration of data is not guaranteed because each biobank creates its own specimen annotation forms with different data elements. To ensure integration, we will incorporate OMIABIS into caTissue’s annotation forms for all UAMS/ACHRI biobanks and the biobank administration data model. Then, the data in separate caTissue instances for the biobanks can be easily ETL (Extract, Transform and Load) to the EDW i2b2 instance, and queried with common semantics.. We present the development of an ontology based on the "Minimum Information About BIobank data Sharing" (MIABIS) for sharing key biobank attributes focused on administrative aspects of biobanks. This is a first logical step towards open and stimulating research collaboration. 2 MATERIAL As a first step towards harmonization of biobank data, a minimum data set was proposed during the preparatory phase of the European BBMRI in 2008-2010. The data set consisted of twenty-one attributes describing information pertaining to biobanks and related entities, i.e., individual subjects, cases and samples. The BBMRI minimum data set was further elaborated within the Swedish node BBMRI.se. To avoid any legal or ethical issues associated with the lowest level of data, i.e., individual subjects, cases and samples, this level was removed during development. The updated version is called MIABIS – Minimum Information About BIobank data Sharing – and consists of fiftytwo attributes considered important for establishing a system of data discovery for biobanks and sample collections. The data set employs existing standards, e.g., the Sample PREanalytical Code (SPREC), ICD Codes [http://apps.who.int/classifications/icd10/browse/2010/en] and definitions developed by the Public Population Project 2 in Genomics (P3G), [http://www.p3g.org] and the International Society for Biological and Environmental Repositories (ISBER) [http://www.isber.org/]. MIABIS is being used in a structured Scandinavian survey to gather information about sample collections stored in biobanks in a searchable database. MIABIS was developed in the context of several use cases described by invited researchers. Two examples of use cases used during development by the MIABIS are described below, although the latter one would require inclusion of individual level data, similar to the use-case from UAMS/ACHRI, described previously: 1) Search for tissue samples from donors diagnosed with nemaline myopathy. Determine the age group. What are the sample storage conditions? Contact the biobank for detailed information about the biopsy samples and whether myoblast cell cultures have been grown from these samples. 2) Search for sample collections having at least 10 cases with tissue from thoracic aorta as well as blood, serum, or plasma from the same donor. Also check if clinical data has been registered for the donors such as physical measurements. Contact the person responsible for the sample collection to obtain detailed information on the specific kind of thoracic aorta biopsies of interest. Also assure that the biopsies were performed +/- one week in relation to the blood sampling. For reasons mentioned above personal data protection considerations are not addressed here. Furthermore, an ontology that supports security and privacy of data and user authorization, authentication, and permissions, is beyond the scope of this work. Our use-case necessitates a semantically rich resource that allows the mapping of data from heterogeneous sources unto the ontology that will act as a global schema. 3 METHODS Our aim is to provide a shared semantic schema as the foundation for MIABIS to ensure semantic integration of biobank data. An additional benefit of such a schema will guide data entry to unify and regiment language use across multiple researchers. To make the ontology easily accessible and implementable, we chose Web Ontology Language (OWL) 2 [9] for implementation. To facilitate re-use and harmonization across ontologies, we used Basic Formal Ontology (BFO) [http://www.ifomis.org/bfo] as the Upper Ontology [5,10]. In addition the entire ontology development followed the principles of best practice in ontology development as set forth by the OBO Foundry [11,12]. In doing so we took the first step towards collaboration with the Open Biological and Biomedical Ontologies group to ensure re-use and uptake of our efforts across the domain. Paper A Developing an Ontology for Sharing Biobank Data based on the BBMRI Minimum Data Set MIABIS Re-use of preexisting ontologies is key among the OBO Foundry principles. In creating Ontologized MIABIS (OMIABIS), we imported the Proper Name Ontology (PNO) [http://purl.obolibrary.org/obo/iao/pno.owl] in its entirety. PNO is based on the Information Artifact Ontology (IAO) [http://purl.obolibrary.org/obo/iao.owl]. Thus, OMIABIS is an extension of IAO. In addition, multiple entities from other ontologies, namely the Ontology of Biomedical Investigations (OBI) [http://purl.obofoundry.org/obo/obi.owl] and the Ontology of Medically Relevant Social Entities (OMRSE) [http://code.google.com/p/omrse] are imported using a tool based on the MIREOT methodology [13], which was developed in a joint endeavor between the University of Arkansas for Medical Sciences and the University of Arkansas at Little Rock. We choose to re-use the ontologies mentioned above based on the fact that they are members of the OBO Foundry and, thus, are built according to the same basic principles, partly extending the same upper ontology (BFO). Our aim is to provide ontological representations to facilitate integration of biobank data with biomedical research data. The latter often is annotated with terms from GO or OBI. Thus, choosing ontologies from the very same orthogonal ontology library (OBO Foundry) of which the latter are members seems to be the best strategy to cater to potential users. Besides the use of the ontology as a means to achieve integration of pre-existing, heterogeneous data, we foresee the possibility to use the global schema to guide development of new biobank data resources and ensure the validity of data entry. The first step in providing ontological representations of the biobank domain as captured by MIABIS was to categorize the MIABIS data attributes according to their basic ontological commitment as shown in Figure 1. From this analysis it is obvious that OMIABIS needs to bring together two rather different domains: the domain of administrative information and the domain of biomedical research. This need is reflected by the imports specified in the previous paragraph. Based on the multi-domain character of the data elements in MIABIS it is obvious that OMIABIS is not a domain ontology, but brings together pre-existing classes from domain ontologies to represent and formally define the MIABIS attributes to provide semantic integration of biobank related data. 4 RESULTS The latest release of OMIABIS in OWL can be downloaded from http://purl.obolibrary.org/obo/omiabis.owl. For the initial release our aim is to represent the first column of Figure 1, which comprises all classes and object properties closely related to biobanks and their administration. This initial step of development includes most of the innovations with regard to applied ontology, since we foresee that for the following steps we will rely heavily on previous representations by OBO Foundry ontologies. In order to represent the biobank related attributes of MIABIS we have to represent types of things that are not yet represented in OBO Foundry ontologies. OMIABIS includes a total of 249 classes and 64 object properties. 35 classes and object properties are newly created for the initial version of OMIABIS. A textual definition is given for all newly created classes and object properties. Fig. 1: Categorizing MIABIS data attributes according to ontological commitments The central class of OMIABIS is "biobank". Its textual definition is: "A biobank is a collection of samples of biological substances (e.g. tissue, blood, DNA) which are linked to data about the donors of the material. They have a dual nature as collections of samples and data." The definition is derived from the definition for human biobank by [14]. The class is formally restricted to be the equivalent to1: "object_aggregate AND has_part SOME (object_aggregate AND (has_part ONLY (specimen AND participates_in SOME storage))) AND has_part SOME (material information bearer AND (participates_in SOME (digital curation AND (has_specified_output SOME (data set AND (is about SOME (object_aggregate 1 classes printed bold, object properties in italics and OPERATORS all caps. Definitions of classes referred to here can be found in Tab. 1 Paper A 3 M. Brochhausen et al. AND (has_part ONLY specimen))))))))2" Notably OMIABIS, likewise MIABIS, differentiates between a biobank and the organization running a biobank. This is important because one organization can run multiple biobanks with a different focus and different management respectively. Therefore, OMIABIS also represents "biobank organization". Its textual definition is: "A biobank organization is an organization bearing legal personality that owns or administrates a biobank". "Biobank organization" is equivalent to: "organization AND ((owns SOME biobank) OR (administrates SOME biobank)) AND (bearer_of SOME legal person role)" Referring to the class "legal person role" from OMRSE is necessary due to the fact that the definition of organization in OBI does not refer to legal personality. Any group of human beings that has some organizational rules fulfills the textual definition according to OBI. However, for our use case the issue of legal personality is crucial. Therefore, we add this aspect to the definition and description of "biobank organization". Table 1 lists all imported classes referred to by the definitions above, their textual definition and their source. The formal description of biobank organization uses two object properties which have been specifically created for OMIABIS: 1. "owns" Definition: "a owns b iff a is the bearer of the roles that concretizes the claims and obligations regarding b." Domain: Homo sapiens OR organization OR collection of humans OR aggregate of organizations Range: information content entity OR material_entity Characteristics: asymmetric 2. "administrates" Definition: "a administrates b if c owns b and some rights and obligations regarding b are transferred3 from c to a." Domain: Homo sapiens OR organization OR collection of humans OR aggregate of organizations Range: information content entity OR material_entity Characteristics: asymmetric 2 Note that this class description is based on object properties and classes from BFO, IAO and OBI. 3 The 'transfers' object property is represented in Document Acts Ontology (d-acts): http://purl.obolibrary.org/iao/d-acts.owl 4 !"#$%& !"#$%&'())*$)(&$+ 34$%.-$0+ 3&!*()$+ -(&$*.(/+.0;!*> -(&.!0+"$(*$*+ =.).&(/+%:*(&.!0+ =(&(+3$&+ /$)(/+4$*3!0+*!/$+ '$()*)+),*& ,+-(&$*.(/+$0&.&1+ 230(456(&$*.(/70&.&18+&9(&+.3+(+ -$*$!/!).%(/+3:-+!;+3$4(*(&$+ !"#$%&+230(45<"#$%&8+$0&.&.$3+ (0=+4!33$33$3+0!0>%!00$%&$=+ "!:0=(*.$3?+ ,+-(&$*.(/+$0&.&1+&9(&+9(3+&9$+ 34$%.-$0+*!/$?+ ,+-(.0&$0(0%$+4*!%$33+"1+C9.%9+ -(&$*.(/+$0&.&.$3+&9(&+(*$+0!&+ (%&.D$/1+-$&("!/.E.0)+(*$+ 4/(%$=+.0+C$//+.=$0&.;.$=+/!%(> &.!0+(0=+4!33."/1+:0=$*+%!0> &*!//$=+$0D.*!0-$0&+.0+(=>9!%+ =$D.%$3F3&*:%&:*$3+.0+!*=$*+&!+ 4*$3$*D$+(0=+4*!&$%&+&9$-+ ;*!-+=$%(1F(/&$*(&.!0+(0=+ -(.0&(.0+(D(./("./.&1+ ,0+.0;!*-(&.!0+"$(*$*+.3+(+-(&$> *.(/'$0&.&1+.0+C9.%9+(+%!0%*$&.> E(&.!0+!;+(0+.0;!*-(&.!0+%!0> &$0&+$0&.&1+.09$*$3?+ G.).&(/+%:*(&.!0+.3+&9$+4*!%$33+!;+ $3&("/.39.0)+(0=+=$D$/!4.0)+ /!0)+&$*-+*$4!3.&!*.$3+!;+=.).> &(/+(33$&3+;!*+%:**$0&+(0=+;:> &:*$+*$;$*$0%$+"1+*$3$(*%9$*3H+ 3%.$0&.3&3H+(0=+9.3&!*.(03H+(0=+ 3%9!/(*3+)$0$*(//1?+ ,+=(&(+.&$-+&9(&+.3+(0+())*$)(&$+ !;+!&9$*+=(&(+.&$-3+!;+&9$+3(-$+ &14$+&9(&+9(D$+3!-$&9.0)+.0+ %!--!0?+,D$*()$3+(0=+=.3&*.> ":&.!03+%(0+"$+=$&$*-.0$=+;!*+ =(&(+3$&3?+ ,+*!/$+"!*0$+"1+(+9:-(0+.0=.> D.=:(/+!*+"1+(+%!//$%&.!0+!;+ 9:-(03+*$)(*=$=+(3+4!3> 3$33.0)+*.)9&3+(0=+=:&.$3+$0> ;!*%$("/$+(&+/(C?+ -,./0$&& @A<+ <@B+ <@B+ B,<+ <@B+ B,<+ <6IJ7+ Table 1: Definitions and Source of classes referred to in Section 4 Figure 2 depicts the central classes for representing biobank related attributes from MIABIS. From Fig. 2 it becomes clear that the ontology allows inference of facts that are not computable based on the data attributes alone. The fact that the biobank contact person is a member of the biobank organization is one example. These inferences are based on class restrictions specified on the OWL file. Regarding our use case at UAMS/ACHRI, class restrictions for the classes within the domain, for instance the definition of biobank given above, are a crucial requirement. They allow automated comparison and integration of pre-existing data structures. This is a crucial part of our plan to use OMIABIS to facilitate entering biobank data into i2b2. Furthermore, once the entirety of MIABIS is represented in Paper A Developing an Ontology for Sharing Biobank Data based on the BBMRI Minimum Data Set MIABIS OMIABIS the class restrictions will technically enable querying for sex and age of donors, storage conditions (etc. freezer) and PI of the study the specimen was obtained for. Figure 2: Biobank, biobank organization and biobank contact person in OMIABIS 4.1 Discussion Taking into consideration the immediately biobank-related attributes in MIABIS we only found one problematic case for which we will need more information from the users of MIABIS: biobank type. The possible values for this attribute are Pathology, Cytology, Gynecology, Obstetrics, Transfusion, Transplant, Clinical Chemistry, IVF, Bacteriology, Virology. There are already strong indications from MIABIS users that this list is not exhaustive. The possible values for biobank type are under elaboration and will be updated as time progresses. The rationale behind these choices is to allow the submitter of data to easily select something that seems plausible to her. However, the downside of this approach is a certain difficulty for end users to find relevant biobanks and studies for her research. It is obvious that the currently proposed categories above are not disjoint from each other. A specimen collection by virtue of the type of specimens stored can be of interest to both pathologists and virologists, or gynecologists and cytologists, and so on. In order to provide useful ontological representation of these classes we need users to specify which characteristics of a biobank make it useful for which specialty of medicine or which research domain. MIABIS is not concerned with data privacy protection issues due to its focus on sharing metadata about biobanks. However, once sharing of biomaterial is considered these issues need to be addressed. As far as ontology is concerned this should be done in a separate ontology that can be modularized with OMIABIS to provide the shared semantic backbone necessary. For harmonizing and modularizing ontologies, clear quality criteria for the development of the ontologies in question are an asset. From the perspective of the BBMRI effort focusing on the descriptive level of biobank information, i.e., biobank and sample collection/study data, will give two benefits: 1) It will avoid any legal/ethical issues, and 2) it will facilitate a relatively easy way for data discovery, compared to requesting data for each individual sample. Hence, MIABIS may stand a better chance of being accepted globally. Further, it is plausible that biobank information will continue to be distributed; descriptive data may be distributed in an international setting, while the sample data may continue to be stored locally at the biobanks. Thus, a system intended for data discovery of biobank information should consider the distributed information content. In this setting a semantic web approach may be preferable over a query language for structured databases. For the UAMS/ACHRI use case we foresee that a more detailed ontology that technically enables sharing individual donor and specimen data will likely be needed. For this effort OMIABIS provides the core of a set of modularized ontologies to govern data exchange between the multiple biobanks. The fact that MIABIS is focused on administrative data regarding biobanks makes OMIABIS a perfect candidate for a central role in this endeavor. Our next step is to cooperate with other biobank projects to consolidate and expand OMIABIS working toward a domain ontology for biobanking. The developers will seek contact with the OBI consortium to discuss having OMIABIS as a domain specific extension of the Ontology of Biomedical Investigation. We plan to maintain and develop OMIABIS as an open-source artifact using subversion and tagging versions as release whenever we reach stability. 5 CONCLUSIONS Even though the domain of biobanking has not been a focus of ontology development and ontology-driven computing it is obvious that in creating an ontology for data sharing in biobanking, one can build on numerous pre-existing ontologies. The difficulty is that the domain of biobanking touches on multiple aspects of biomedical research, such as molecular biology, medicine, law, public health, cryobiology, and physics. Bringing these multiple fields together in one ontological representation is a huge challenge. First steps were taken in the initial development of OMIABIS, which builds upon a multilateral effort to achieve data integration in biobanking from the BBMRI consortium. ACKNOWLEDGMENTS The work is partially funded by the Arkansas Biosciences Institute, the major research component of the Arkansas Tobacco Settlement Proceeds Act of 2000 and by award number UL1TR000039 from the National Center for Ad- Paper A 5 M. Brochhausen et al. vancing Translational Sciences (NCATS). The content is solely the responsibility of the authors and does not necessarily represent the official views of NCATS or the National Institutes of Health. We would like to thank the people involved in the European BBMRI preparatory phase, financially supported by the European Commission (grant agreement 212111) and the Swedish Research Council for granting the BBMRI.se project (grant agreement 829-20096285). The authors would also like to thank Josh Hanna and three anonymous reviewers for their valuable comments. to support biomedical data integration. Nature Biotechnology 2007, 25(11): 1251-5. [13] Courtot M, Gibson F, Lister A, Malone J, Schober D, Brinkman R, Ruttenberg A..MIREOT: the Minimum Information to Reference an External Ontology Term. Available from Nature Precedings <http://dx.doi.org/10.1038/npre.2009.3576.1> (2009) [14] Deutscher Ethikrat. Human biobanks for research [Internet]. Berlin: 2010 [cited 2012 Aug 02]. Available from: http://www.ethikrat.org/themen/dateien/pdf/stellungnahmehumanbiobanken-fuer-die-forschung.pdf REFERENCES [1] Litton JE, Muilu J, Björklund A, Leinonen A, Pedersen NL. Data modeling and data communication in GenomEUtwin. Twin Res. 2003 Oct;6(5):383-90. [2] Muilu J, Peltonen L, Litton JE. The federated database--a basis for biobank-based post-genome studies, integrating phenome and genome data from 600,000 twin pairs in Europe. Eur J Hum Genet. 2007 Jul;15(7):718-23. [3] Yuille M, van Ommen GJ, Bréchot C, Cambon-Thomsen A, Dagher G, Landegren U, et al. Biobanking for Europe. Brief Bioinform. 2008 Jan;9(1):14-24. [4] Bodenreider O, Stevens R. Bio-Ontologies: current trends and future directions. Brief Bioinform. 2006 7 (3): 256-274. [5] Smith B, Brochhausen M: Putting biomedical ontologies to work. In: Blobel B, Pharow P, Nerlich M, editors. eHealth: Combining Health Telematics, Telemedicine, Biomedical Engineering and Bioinformatics to the Edge – Global Experts Summit Textbook. Amsterdam: IOS Press; 2008:135-40. [6] Gainer V, Hackett K, Mendis M, Kuttan R, Pan W, Phillips L, Chueh H, Murphy SN. Using the i2b2 Hive for clinical discovery: an example. AMIA Annu Symp Proc. 2007; p.959. PMID:18694059. [7] Mendis M, Wattanasin N, Kuttan R, Pan W, Hackett K, Gainer V, Chueh H, Murphy SN. Integration of Hive and Cell software in the i2b2 architecture. AMIA Annu Symp Proc.. 2007;p.1048. PMID:18694146. [8] London JW, Chatterjee D, Using the Semantically Interoperable Biospecimen Repository Application, caTissue: End User Deployment Lessons Learned, IEEE International Conference on BioInformatics and BioEngineering (BIBE), 2010. [9] W3C [Internet]. Cambridge (MAS), Sophia Antipolis, Tokyo: W3C; c2012 [cited 2012 Mar 9]. OWL 2 Web Ontology Language Document Overview [about 8 screens]. Available from: http://www.w3.org/TR/owl2-overview. [10] Spear AD. Ontology for the Twenty First Century: An Introduction with Recommendations [Internet]. Saarbrücken; 2006 [cited 2012 Mar 9]. Available from: http://www.ifomis.org/bfo/documents/manual.pdf. [11] The Open Biological and Biomedical Ontologies [Internet]. Berkeley (CA): Berkeley BOP [cited 2012 Mar 9]. OBO Foundry Principles [about 1 screens]. Available from: http://obofoundry.org/crit.shtml [12] Smith B, Ashburner M, Rosse C, Bard J, Bug W, Ceusters W, et al. The OBO Foundry: Coordinated evolution of ontologies 6 Paper A Enrichment analysis applied to disease prognosis Catia M Machado1,2*, Ana T Freitas2 and Francisco M Couto1 1 LaSIGE, Departmento de Informática, Faculdade de Ciências, Universidade de Lisboa, Lisboa, Portugal 2 Instituto de Engenharia de Sistemas e Computadores/Instituto Superior Técnico, Lisboa, Portugal ABSTRACT Enrichment analysis is normally used to identify relevant biological features that can be used to describe a set of genes under analysis that, for example, share a common expression profile. In this article we propose the exploitation of enrichment analysis for a different purpose: the evaluation of a disease prognosis. With this application of enrichment analysis we expect to identify clinical and biological features that best differentiate between patients that suffered a specific disease event from those that did not. The features thus identified will be used to create patient profiles, which will in turn be evaluated through similarity and supervised classification approaches to predict the occurrence of the event. This article presents the enrichment analysis methodology proposed for a prognosis study, in which we use the disease hypertrophic cardiomyopathy and its most severe manifestation, sudden cardiac death, as a case study. 1 INTRODUCTION Enrichment analysis is normally used for the functional analysis of large lists of genes identified with highthroughput technologies such as expression microarrays. It exploits the use of statistical methods over ontological gene annotations to identify biological features that are represented in the gene set under analysis more than would be expected by chance. Such biological features are said to be enriched, or overrepresented, and are then used to formulate a biological interpretation of the gene set. The ontology most commonly used in these analyses is the Gene Ontology (Ashburner et al. 2000, Robinson and Bauer 2011, Zhang et al. 2010), although other resources such as MeSH and KEGG are also explored (Leong and Kipling 2009). Strategies based on multiple vocabularies have also been developed, namely in pharmacogenomics, including the Human Disease Ontology and the Pharmacogenomics Knowledge Base (Hoehndorf et al. 2012). LePendu et al. propose a method to generate annotations when using vocabularies other that the Gene Ontology, testing its feasibility with the Disease Ontology (LePendu et al. 2011). * In terms of statistical methods, the most commonly used is the Fisher’s exact test (Robinson and Bauer 2011, Huang et al. 2009), with more recent implementations also using Bayesian techniques (Bauer et al. 2010). Enrichment analyses are normally divided in three categories: Singular Enrichment Analysis (SEA), Gene Set Enrichment Analysis (GSEA) and Modular Enrichment Analysis (MEA). SEA works with a user-selected gene set and iteratively tests the enrichment of each individual ontology concept in a linear mode. GSEA also evaluates the enrichment of ontology concepts individually, but considering all the genes in the experiment and not just a user-selected gene set. MEA works with a user-selected gene set, but incorporates into the analysis the relationships between concepts represented in the ontologies, thus evolving from a termcentric approach to a biological module-centric approach (Huang et al. 2009). Several tools have been developed that implement one or more of these approaches. Examples of these tools are Ontoexpress (Khatri et al. 2002), GSEA (Subramanian et al. 2005), and GOToolBox (Martin et al. 2004) (a detailed list of tools was collected by Huang et al. 2009). In this work we propose to adapt the enrichment analysis to develop a disease prognosis methodology, with the goal of predicting if specific events may or may not occur in a given patient. The enrichment analysis will be applied to identify the set of clinical and genetic features that might assist us in the differentiation of the patients for whom the event occurred from the patients for whom it did not. The identified features will then be used to create profiles for the individual patients. In order to differentiate between the two sets of patients, the profiles will be subjected to an evaluation step, in which we will explore a similarity and a classification approach. In the similarity approach, different semantic similarity measures (Pesquita et al. 2009) and a relatedness measure (Ferreira and Couto 2011) will be tested to compare the profiles, followed by machine learning algorithms such as clustering and nearest neighbors. In the classification approach, the patient profiles will be analyzed with supervised classification algorithms such as random forests (Breiman 2001) and Bayesian networks (Berner 2007) (see Fig. 1). To whom correspondence should be addressed. Paper B 1 C.M.Machado et al. The datasets to be used in the implementation of this methodology were collected by biomedical experts in the context of medical practice, and are thus characterized by a small number of clinical features and a high number of missing values, among other aspects. With this work, our purpose is to evaluate if the application of an enrichment analysis to this type of dataset can result in the extraction of relevant knowledge from controlled vocabularies to improve the quality of the dataset and, consequently, the quality of the predictions made from it. As a case study we will consider the disease hypertrophic cardiomyopathy (HCM). This is a genetic disease that is the most frequent cause of sudden cardiac death (SCD) among apparently healthy young people and athletes (Maron et al. 2009, Alcalai et al. 2008). It is characterized by a variable clinical presentation and onset, and there are approximately 900 mutations in more than 30 genes currently known to be associated with it (Harvard Sarcomere Mutation Database). Due to these characteristics, HCM is very difficult to diagnose. The prognosis is by no means easier, since the severity of the disease varies even between direct relatives. It has been observed that the presence of a given mutation can correspond to a benign manifestation in one individual and result in SCD in another (Maron et al. 2009, Alcalai et al. 2008). Similarity Approach Enrichment Profiles Analysis Classification Approach tween the application of this analysis in the context of gene expression data analysis and in the context of the prognosis methodology. Finally, we present how the enrichment analysis will be conducted with data from HCM patients, and how the patient profiles will be created from the results obtained. 2 DATASET The data necessary for the diagnosis and the prognosis of HCM has been represented in a semantic data model, with mappings established between the concepts in the model and four controlled vocabularies: the National Cancer Institute Thesaurus (NCIt) (version 10.03) (Sioutos et al. 2007), the Systematized Nomenclature of Medicine-Clinical Terms (SNOMED-CT) (version 2010_01_31) (SNOMED), the Gene Regulation Ontology (version 0.5, released on 04_20_2010) (Beisswanger et al. 2008) and the Sequence Ontology (released on 11_22_2011) (Eilbeck et al. 2005). A total of 85.8% of the clinical concepts represented in the model was mapped either to NCIt or SNOMED-CT, in identical proportion (42.9%). Table 1 contains all the clinical features to be used in the present work. With the exception of two of these features, Sporadic and Hypertrophy morphology, they are represented in the semantic model and have an established mapping with NCIt or SNOMED CT. Table 1. Clinical features considered in the enrichment analysis and their possible values. Feature • Data from patients • Ontologies Prognosis Fig. 1. Schematic representation of the prognosis methodology. The methodology is composed of two units: the first (left-side) receives as input data from patients mapped to biomedical ontologies (or controlled vocabularies in general). It will apply an enrichment analysis to identify a list of ontology terms considered to be enriched, which will be used to create profiles for the patients. These profiles will then be subjected to an evaluation step (the second unit, on the right-side) that will result in the evaluation of the prognosis for individual patients. For the implementation of the second unit, we will explore a similarity and a classification approach. Due to the importance of the prognosis of HCM in terms of SCD, this will be the event analyzed in our present study. This work is currently under development, and in the rest of the article the focus will be on our proposed application of enrichment analysis to disease prognosis. In the following sections we present the dataset and the methodology. In the methodology section we begin by drawing a parallel be- 2 Cardioverter defibrillator Non-obstructive HCM Obstructive HCM Resuscitated sudden death Sudden death Non-sudden death Sudden death family history Familial Sporadic Blood pressure Gender Age Hypertrophy morphology Possible values -1;1* normal; hypertension; hypotension male; female 1; 2; 3; 4 apical; centric; concentric * The values -1 and 1 correspond respectively to the absence or the presence of the feature in the patient. Familial and sporadic indicate if the patient has either the familial (hereditary) or the sporadic form of HCM. The age values correspond to the following intervals, in years: (1) [0,20]; (2) ]20,40]; (3) ]40,60]; (4) >60. The genetic features are the mutations associated with the disease, with possible values {-1,1}, i.e. absence or presence of the mutation in the genome of the patient. The genes in Paper B Enrichment analysis applied to disease prognosis which each mutation occurs are currently being mapped to the Gene Ontology. Both clinical and genetic features have been previously collected for 80 patients from Portuguese hospitals and molecular biology research laboratories, for the evaluation of associations between genetic and clinical factors. The clinical features presented in Table 1 are considered by the medical experts as the most relevant for the diagnostic and the prognosis of HCM, and were thus the only ones provided for our present study. Table 2 shows the percentage of patients that have a known value for each of the clinical features and the total of 569 mutations tested. the set of patients for whom the event occurred and also for the set of patients for whom it did not, each set will be in turn considered the study set. Fig. 2 shows an exemplificative representation of the population and study sets in a gene expression experiment, and their counterparts in the prognosis analysis. B A g4 g1 g3 g5 Population set Study set g2 g6 p4 p3 p1 p5 p6 p2 Table 2. Percentage of patients that have a known value for each of the clinical features and for the total number of mutations. Feature Cardioverter defibrillator Non-obstructive HCM Obstructive HCM Resuscitated sudden death Sudden death Non-sudden death Sudden death family history Familial Sporadic Blood pressure Gender Age Hypertrophy morphology Mutations 3 Fig. 2. Population set and study set in (A) a gene expression analysis and (B) the prognosis of disease-related events. In this example, the population set in A is composed of 6 genes, g1 to g6, and the study set of 2 genes, g3 and g5. In accordance, the population and study sets in B are composed of 6 and 2 patients, respectively. Patients (%) 96 36 36 96 100 100 37 96 96 39 96 60 96 76 3.1 METHODOLOGY The first step in any enrichment analysis is the definition of the list of entities to be analyzed. Considering the case of gene expression analysis, the complete list of genes under analysis is called the population set. As referred in the Introduction, the GSEA receives this list as input. However, both SEA and MEA require two sets of genes as input: a user-selected gene set, which is called the study set and is a sub-set of the population set; and the population set. The criterion used to select the study set can be (and normally is) the level of expression of the genes in the biological setting under analysis, meaning that the study set will be the set of genes that are considered to be overand/or under-expressed. The evaluation of the existence of enriched ontology terms is then made for the study set in respect to the entire population set. This means that for an annotated term to be considered as enriched its annotation rate has to be higher in the study set than in the population set. Considering the application of the enrichment analysis to the prognosis of disease-related events, the population set is the complete list of patients with the disease. Since we are interested in obtaining a list of enriched ontology terms for Definition of patient profiles Our aim is to define the patient profiles based on the result of individual enrichment analyses performed with different controlled vocabularies. In order to assess the feasibility of this methodology, we will begin by performing analyses with the Gene Ontology and the NCIt. Considering our case study of SCD occurrence in HCM patients, we intend to evaluate the existence of ontology terms that can assist us in separating patients with SCD from patients without SCD. When performing the analysis with the Gene Ontology, the terms which enrichment will be evaluated depend on the mutations the patients have. Firstly, the list of mutations that all the patients in the study set have (e.g. patients with SCD, with mutation value =1) is compiled; secondly, the list of non-redundant mutated genes is retrieved from the list of mutations; finally, the list of Gene Ontology terms used to annotate the mutated genes is retrieved. The terms annotated to the patients in the rest of the population are retrieved in the same manner. The frequency of occurrence of the annotations is then calculated based on the patients, i.e., how many patients in the study set and the population set are annotated with the term. For each term, a patient can only be counted once, even if he/she has more than one mutation through which the term can be identified. When performing the analysis with NCIt, the terms which enrichment will be evaluated depend on the values of the clinical features. For the features with possible values {1,1}, they will be considered if having value equal to 1 (thus being present in the patient); for the categorical features, all will be considered except when there are no known values for any of the patients in the set. The terms annotated to the Paper B 3 C.M.Machado et al. features are retrieved based on the mappings already defined between them and the NCIt. The following two features exemplify the procedure for boolean and categorical variables, respectively: • Non sudden death: when value =1, retrieve and use the term Non_Sudden_Cardiac_Death (and its parent terms). • Blood pressure: when value =hypertension, retrieve and use the term Hypertension (and its parent terms). The frequency of occurrence of the annotations is calculated as before, i.e., how many patients in the study set and the population set are annotated with the term. We will test both SEA and MEA approaches. Since GSEA produces a list of enriched terms for the entire set of entities, it is not as interesting for our study as the other two. The lists of enriched terms that result from the analysis with each controlled vocabulary will be compiled and used as a template-profile for the respective set of patients (e.g. with SCD). The individual profiles will be defined as follows: for each patient and each ontology term, it is checked if the patient is annotated with the term; if true, a pair variable/term is created for that patient. The complete set of pairs variable/term thus obtained is the profile for that specific patient. The pairs variable/term will substitute the original variables in the second unit of the prognosis methodology (Fig. 1). In this study we include in the group of patients with SCD both patients that died due to a sudden cardiac arrest and patients that suffered at least one resuscitated sudden cardiac arrest (which can be either alive or dead). The group of patients without SCD includes all the other patients. 4 DISCUSSION AND CONCLUSIONS In this article we present a novel prognosis prediction methodology based on an enrichment analysis. This type of analysis is normally used in contexts such as gene expression analysis for the identification of functional annotations that might be used to explain the differences in expression. Here we propose to use enrichment analysis for the identification of ontology terms that might be used to explain the differences between the group of patients for whom a given disease event occurred and the group of patients for whom it did not occur. The ontology terms considered to be enriched will assist in the creation of profiles for individual patients. These profiles will then be used to evaluate for new patients if the event might occur or not. An important aspect of the present analysis is the dataset: it contains data from patients, and was collected in the context of their medical evaluation. As such, it reflects two important aspects of the nature of clinical records: only the information deemed relevant by the medical experts is pre- 4 sent; not all of the information is available for all of the patients. Our interest is precisely in evaluating if it is feasible to extract relevant knowledge from controlled vocabularies that can enrich the dataset, and thus allow its exploitation with data mining algorithms. In a first approach, we will test only two vocabularies: the Gene Ontology and the NCIt. Although this means that some of the features will not be considered due to the inexistence of annotations, we expect to be able to evaluate the applicability of the methodology. The data under analysis in this study has been provided by several Portuguese institutions, including hospitals and molecular biology research laboratories. ACKNOWLEDGEMENTS This work was supported by the FCT through the Multiannual Funding Program, the doctoral grant SFRH/BD/65257/2009 and the SOMER project (PTDC/EIA-EIA/119119/2010). The authors would like to thank to Alexandra R. Fernandes, Susana Santos and Dr. Nuno Cardim for collecting and providing the dataset. REFERENCES Ashburner,M., Ball,C.A., Blake,J.A., Botstein,D., Butler,H., Cherry,J.M., Davis,A.P., Dolinski,K., Dwight,S.S., Eppig,J.T., Harris,M.A., Hill,D.P., Issel-Tarver,L., Kasarskis,A., Lewis,S., Matese,J.C., Richardson,J.E., Ringwald,M., Rubin,G.M. and Sherlock,G. (2000) Gene Ontology: tool for the unification of biology. Nat. Gene.t, 25, 25–29. Robinson,P.N. and Bauer,S. (2011) Introduction to BioOntologies. (Chapter 8) CRC Press Taylor & Francis Group. Zhang,S., Cao,J., Kong,Y.M. and Scheuermann,R.H. (2010) GOBayes: Gene Ontology-based overrepresentation analysis using a Bayesian approach. Bioinformatics, 26(7), 905-911. Leong,H.S. and Kipling,D. (2009) Text-based over-representation analysis of microarray gene lists with annotation bias. Nucleic Acids Res. 37(11), e79. Hoehndorf,R., Dumontier,M., Gkoutos,G.V. (2012) Identifying aberrant pathways through integrated analysis of knowledge in pharmacogenomics. Bioinformatics, 28(16), 2169-75. LePendu,P., Musen,M.A., Shah, N.H. (2011) Enabling enrichment analysis with the Human Disease Ontology. J. Biomed. Inform. 44(Suppl 1), S31-8. Huang,D.W., Sherman,B.T. and Lempicki, R.A. (2009) Bioinformatics enrichment tools: paths toward the comprehensive functional analysis of large gene lists. Nucleic Acids Res. 37(1), 113. Bauer,B., Gagneur,J. and Robinson,P.N. (2010) GOing Bayesian: model-based gene set analysis of genome-scale data. Nucleic Acids Res. 38(11), 3523-3532. Paper B Enrichment analysis applied to disease prognosis Khatri,P., Draghici,S., Ostermeier,G.C. and Krawetz,S.A. (2002) Profiling gene expression using onto-express. Genomics 79, 266–270. Subramanian,A., Tamayo,P., Mootha,V.K., Mukherjee,S., Ebert,B.L., Gillette,M.A., Paulovich,A., Pomeroy,S.L., Golub,T.R., Lander,E.S. and Mesirov,J.P. (2005) Gene set enrichment analysis: a knowledge-based approach for interpreting genome-wide expression profiles. Proc. Natl. Acad. Sci. USA 102, 15545–15550. Martin,D., Brun,C., Remy,E., Mouren,P., Thieffry,D. and Jacq,B. (2004) GOToolBox: functional analysis of gene datasets based on Gene Ontology. Genome Biol. 5, R101. Pesquita,C., Faria,D., Falcão,A.O., Lord,P. and Couto,F.M. (2009) Semantic Similarity in Biomedical Ontologies. PLoS Comput. Biol. 5(7): e1000443. Ferreira,J. and Couto,F. (2011) Generic semantic relatedness measure for biomedical ontologies. International Conference on Biomedical Ontologies (ICBO), 2011. Breiman,L. (2001). Random forests. Mach. Learn. 45, 5–32. Berner,E.S. (Editor) (2007) Clinical decision support systems: theory and practice. (Chapter 3) Health Informatics Series, 2nd Edition. Maron,B.J., Maron,M.S., Wigle,E.D. and Braunwald,E. (2009) The 50-Year History, Controversy, and Clinical Implications of Left Ventricular Outflow Tract Obstruction in Hypertrophic Cardiomyopathy: from Idiopathic Hypertrophic Subaortic Stenosis to Hypertrophic Cardiomyopathy. J. Am. Coll. Cardiol. 54, 191-200. Alcalai,R., Seidman,J.G. and Seidman,C.E. (2008) Genetic Basis of Hypertrophic Cardiomyopathy: from Bench to the Clinics. J. Cardiovasc. Electrophysiol. 19, 104-110. Harvard Sarcomere Mutation Database: http://genepath.med.harvard.edu/~seidman/cg3/ Sioutos,N., Coronado,S., Haber,M.W., Hartel,F.W., Shaiu,W.L. and Wright,L.W. (2007) NCI Thesaurus: a Semantic Model Integrating Cancer-Related Clinical and Molecular Information. J. Biomed. Inform. 40, 30-43. Systematized Nomenclature of Medicine-Clinical Terms (SNOMED), http://www.ihtsdo.org/snomed-ct/ Beisswanger,E., Lee,V., Kim,J., Rebholz-Schuhmann,D., Splendiani,A., Dameron, O., Schulz, S., Hahn,U. (2008) Gene Regulation Ontology (GRO): Design Principles and Use Cases. Stud. Health Technol. Inform. 136, 9–14. Eilbeck,K., Lewis,S.E., Mungall,C.J., Yandell,M., Stein,L., Durbin,R., Ashburner,M. (2005) The Sequence Ontology: a Tool for the Unification of Genome Annotations. Genome Biol. 6, R44 Machado,C.M., Couto,F.M., Fernandes,A.R., Santos,S. and Freitas,A.T. (2012) Toward a translational medicine approach for hypertrophic cardiomyopathy. International Conference on Information Technology in Bio- and Medical Informatics (ITBAM), 2012. Paper B 5 Integration of the Anatomical Therapeutic Chemical Classification System and DrugBank using OWL and text-mining Samuel Croset 1*, Robert Hoehndorf 2, Dietrich Rebholz-Schuhmann 1 1 European Bioinformatics Institute, Wellcome Trust Genome Campus, Cambridge, CB10 1SD UK 2 Department of Genetics, University of Cambridge, Downing Street, Cambridge, CB2 3EH, UK *Email: croset@ebi.ac.uk ABSTRACT The field of drug discovery has experienced in the recent years a growth in the number of computational methods used to predict new biological actions for chemical compounds. In order to evaluate predictions and to gain insights into the potential usage of drugs, the Anatomical Therapeutic Chemical Classification System (ATC) serves as an internationally accepted gold standard. However, this classification was not initially developed to accomplish the mentioned task and therefore lacks connections with other biological databases. In order to maximize its interoperability with DrugBank, a major provider of drug knowledge, we improved by over 13% the number of current mappings with the ATC. We then converted the classification into a Web Ontology Language (OWL) representation, in order to allow queries and exploration of its content via a reasoning engine. The work is accessible via a web interface (www.ebi.ac.uk/Rebholzsrv/atc) to assist future drug discovery and repurposing initiatives. 1 INTRODUCTION Since the early seventies, the Anatomical Therapeutic Chemical Classification System (ATC) has provided a standard and international taxonomy used to classify and compare therapeutic compounds (Ivanova et al., 2010). Among others, the classification helps to generate the metrics around the comparison of drug usage between countries or health care settings. The ATC features five levels of taxonomic hierarchy, from broader concepts to more specific one. The first level indicates the anatomical location at which the drug acts. The second level to the fourth level describes the pharmacological or chemical groups to which the drug belongs. The fifth level represents the chemical substance itself. As the therapeutic categories of the classified drugs are available from the ATC, this classification is also used as a gold standard for drug indications and mode of action by databases such as DrugBank (Knox et al., 2011) or PharmGKB (Hewett et al., 2002). Even though this usage * of the ATC differs from the original one, it seems likely that it will increase in importance as a means to analyze structured and normalized drug pharmacological information. For instance, computational drug discovery methods and drug repurposing attempts already evaluate the predictions made against the ATC (Iorio et al., 2010; Campillos et al., 2008; Gottlieb et al., 2011;Tatonetti et al., 2012). In order to improve the classification for this task and to enable its interoperability with other types of biological data it is necessary to map the therapeutic compounds described within the ATC to entries of drug-related databases. Among these, DrugBank provides a central hub and broad connectivity with other biological data providers such as Uniprot or Chembl (Knox et al., 2011). The Chemical compounds present within DrugBank are already partially curated and linked to ATC entries, but some drugs are not annotated or they are mapped to ATC categories that have been obsoleted. In order to improve the interoperability between the ATC and DrugBank, we have used a text-mining tool to generate new mappings not previously referenced. Moreover, the taxonomic structure of the ATC enables its representation with the Web Ontology Language (OWL), which facilitates current and future integration and allows querying of the data in an automated way. Our work extends the version already available (Hoehndorf et al., 2012) by providing a querying and browsing application as well as a thorough mapping to DrugBank. The outcome of the work is compatible with the standard Semantic Web tools and libraries and should facilitate future evaluations and analyses done around drug pharmacology. 2 MATERIALS AND METHODS The integration of DrugBank with the ATC is decomposed in three steps: Conversion of the repositories into Java objects, analysis of the current interoperability between the resources and conversion to an OWL representation of the integrated content. The source code is freely available at https://github.com/loopasam/ATCExplorer. To whom correspondence should be addressed. Paper C 1 S.Croset et al. 2.1 Parsing of the original data DrugBank has been downloaded from http://www.drugbank.ca/downloads (June 10, 2012), parsed and converted into Java objects for easier handling. The ATC has been purchased in ASCII format (January 2012) from http://www.whocc.no/atc_ddd_publications/order/ and also converted into Java objects. 2.2 3.1 Mapping of DrugBank compounds to ATC therapeutic categories A dictionary for DrugBank compounds has been created and contains information about the drugs synonyms. The synonyms are string of characters or numerical codes describing the different brand names of the drug and the CAS numbers, as appearing in DrugBank. The labels of the ATC entries describing drugs (fifth level) have been compared to each entry of the dictionary, by using the Java library LingPipe 4.1.0. A mapping was considered to be true when the string of the dictionary matched exactly the string of the label, without considering capital letters. In case of true mapping, the DrugBank identifier of the synonym was mapped to the corresponding ATC therapeutic category. 2.3 RESULTS The ATC and DrugBank both refer to therapeutic compounds using their own identification system. However, some of these drugs are structurally identical in the two datasets and a mapping between the two resources enables a better interoperability. Some DrugBank compounds have already been mapped to ATC categories by DrugBank’s 2 Interoperability between the ATC and DrugBank We have first compared the distribution of mappings already described in DrugBank in comparison to the textmining predictions. In the context of this analysis, we define ‘mapping’ as the following: A mapping is an association between a DrugBank entry and an ATC category. DrugBank compounds can be mapped to more than one ATC category, and an ATC category can encompass more than one drug. Figure 1 summarizes the distributions. Most of the mappings already present in DrugBank have been confirmed by OWL representation The categories of the ATC and DrugBank entries have been converted as OWL classes and merged together. The taxonomic structure of the classification has been preserved using SubClassOf axioms assertions between the children and the parent categories. Aside the parenthood assertions, no other OWL constructs are necessary to capture the original ATC hierarchy. The mappings between DrugBank compounds to their respective ATC therapeutic categories have been captured with OWL SubClassOf axioms, e.g.: atc:B01AE02 (label: Lepirudin) owl:SubClassOf drugank:DB00001 (label: Lepirudin) The OWL conversion and the question answering over the integrated dataset have been done via the OWL API (version 3.2.4) and the Elk reasoner (version 0.2.0 for the OWL API). The data integration and reasoning have been on a laptop computer featuring an Intel Core i7 CPU and 8 Gb of memory. The web application is built on the top of the Play! framework (version 1.2.4) and is freely available at www.ebi.ac.uk/Rebholz-srv/atc. 3 curation team, but some erroneous or missing information is still present within the repository. We have analyzed this discrepancy by text-mining and suggested a list of new mappings not previously referenced. The integrated datasets have then been converted into an OWL representation, thereby enabling semantic querying. Figure 1: Comparison of mappings between the ATC and DrugBank compounds. Set A shows the curated mappings only found in DrugBank. Set C represents the new mappings found fond text-mining. Set B is the intersection of set A and C, namely the mappings that are already present within DrugBank, confirmed by text-mining. Set D shows the DrugBank mappings to obsolete ATC categories. text-mining (1572 mappings, Figure 1 – Set B), showing an agreement between the two methodologies. Some of the mappings are difficult to identify by text-mining only, as they correspond to ATC categories with generic or broad names, such as ‘various’ (atc:M02AX10) or present some lexical variances. Such mappings are belonging to the Set A of the Figure 1. We found 272 new mappings with textmining (Set C) not previously indexed by DrugBank. No evaluation has been done over the text-mining predictions, but the list of new mappings has been handed over to the DrugBank curation team for inclusion in the database after Paper C internal manual verification. Some DrugBank compounds (8 – Set D) are linked to ATC categories no longer existing. This erroneous information will be removed from DrugBank by its curation team. This straightforward text-mining approach (exact matching on words) improved the interoperability between the two datasets by increasing by 13% the number of asserted relations. The new information will be added to the main DrugBank database for the community to profit. 3.2 Question-answering over the integrated data The taxonomic structure of the ATC can support a representation in OWL in order to benefit from the Semantic Web tools, particularly automated reasoning engines. Reasoners can classify and help to check for consistency among the integrated domain knowledge. They are also used for question-answering over ontologies. In order to achieve such a task with our integrated dataset, we have converted the 5717 classes of the ATC with their corresponding 1503 DrugBank mapping into OWL classes (giving a total of 7220 classes). The OWL ATC representation contains only subclass axioms; it therefore available in the OWL2 EL profile permitting the use of the Elk reasoner for time-efficient question-answering. Elk classifies the ontology in 3.01 seconds and can reclassify the ontology in less than a second with our settings (data not shown). This feature enables its use as a backend engine on a web server, for answering live queries from users. A web form accepts OWL expressions in Manchester syntax and retrieves the sub and super classes of the expression definitions. Figure 2 shows for instance the subclasses satisfying the OWL expression “G01AA and DrugBankCompound” as it appears on the web application. 4 DISCUSSION The modeling of domain knowledge in OWL is constrained by the semantics and the axioms of the language. As the ATC was originally designed as a classification to capture statistics about drug metrics, the intended relations between categories and sub-categories are not formally described. However, the taxonomic structure of the ATC can be interpreted as mathematical sets and subsets, which can themselves be transcribed as OWL classes and subclass relationships. The native semantics of the relations are not yet perfectly captured by OWL, but nevertheless facilitates an intuitive formal representation and serves well the purpose of exploring and querying the ATC. DrugBank OWL classes have been asserted as super classes of their corresponding ATC classes, e.g.: atc:B01AE02 (label: Lepirudin) owl:SubClassOf drugbank:DB00001 (label: Lepirudin) Figure 2: Screenshot of the web interface used to explore and query the ATC classification. An OWL equivalent class axiom seems at first more appropriate to capture this relationship. However, the ATC therapeutic categories capture the usage of the drug in regards to its anatomical effect on the body. Therefore, some drugs have multiple ATC codes and are present several times in the ATC in order to represent the possible different usages of the chemical. For instance the bio-activity of the Betamethasone is described by the ATC codes atc:A07EA04 (label: Betamethasone) and atc:C05AA05 (label: Betamethasone). This compound is also described in DrugBank, but with a unique accession identifier drugbank:DB00443 (label: Betamethasone), representing the compound as a generic artifact, without the usage information attached to it as in the ATC. An OWL equivalent class assertion would wrongly result in an equivalence between the ATC categories atc:A07EA04 and atc:C05AA05. The discrepancy between the different meanings of the two repositories while referring to the same chemical compounds (structure-wise) is solved with the subclass axioms. A compound as described in DrugBank is considered more generic than a compound the same compound as described in the ATC. The improved interoperability between DrugBank and the ATC paves the way for integrative analysis using the results of recent large-scale studies (Tatonetti et al., 2012; Gottlieb et al., 2011). Indeed, as ATC categories have been adopted as standards for pharmacological effect description, the query interface presented here can facilitate the comparison of predictions made from one dataset versus another. The ATC OWL classification is also useful to evaluate drugrepurposing predictions, which would be defined as a new mapping of a drug (from DrugBank) to an ATC category. For instance, a method could predict that a drug ‘A’ (referenced in DrugBank) would act as an ATC category ‘B’. The currently known effect of the compound ‘A’ could be re- Paper C 3 S.Croset et al. trieved via a reasoning engine over the OWL ATC or via the interface and compared to the one predicted by the method. 5 CONCLUSION We have improved the interoperability between two central biomedical repositories, DrugBank and the ATC. The result of the work is likely to profit the community using these resources as the new and corrected mappings should be available in the native database, thanks to the collaboration of the DrugBank curation team. Our OWL representation of the ATC extends the previously-described one by providing integration with DrugBank as well as a visual interface to explore and query the classification. The integrated dataset will also help computational drug discovery methodologies to evaluate and validate their results against the gold standard reference that is the ATC. REFERENCES Campillos,M. et al. (2008) Drug target identification using sideeffect similarity. Science, 321, 263–266. Gottlieb,A. et al. (2011) PREDICT: a method for inferring novel drug indications with application to personalized medicine. Molecular Systems Biology, 7, 496. Hewett,M. et al. (2002) PharmGKB: the Pharmacogenetics Knowledge Base. Nucleic Acids Research, 30, 163–165. Hoehndorf,R. et al. (2012) Identifying aberrant pathways through integrated analysis of knowledge in pharmacogenomics. Bioinformatics, 1–7. Iorio,F. et al. (2010) Discovery of drug mode of action and drug repositioning from transcriptional responses. Proceedings of the National Academy of Sciences of the United States of America, 107, 14621–14626. Ivanova,E.P. et al. (2010) Guidelines for ATC classification and DDD assignment. Vegetatio, 70, 93–103. Knox,C. et al. (2011) DrugBank 3.0: a comprehensive resource for “omics” research on drugs. Nucleic Acids Research, 39, D1035–D1041. Tatonetti,N.P. et al. (2012) Data-Driven Prediction of Drug Effects and Interactions. Science Translational Medicine, 4, 125ra31–125ra31. 4 Paper C Using Ontologies to Study Cell Transitions Ludger Jansen1 2, Georg Fuellen3, Ulf Leser4, Andreas Kurtz5 6 1 Philosophisches Seminar, RWTH Aachen University, Germany; 2 Institute of Philosophy, University of Rostock, Germany; IBIMA,University of Rostock, Germany; 4 Institute for Computer Science, Humboldt Universität zu Berlin, Germany; 5 BCRT, Charite Berlin, Germany; 6 Seoul National University, South Korea 3 ABSTRACT described cell reprogramming of fibroblasts back to pluripotency (also known as generation of iPS, induced pluripotent stem cells) [1], hundreds of papers have dissected the reprogramming process and the cellular disposition of pluripotency at an ever-increasing resolution, reviewed in, e.g., [2] and [3]. This corpus is currently underused as there is no formal representation of the findings. There exist already several ontologies in the domain of cell biology, e.g. the cell type ontology (CL; cf. [4], [5]). [5] proposed formal definitions for CL classes, referring to properties of cells such as expressed proteins, activated biological processes, or the phenotypic characteristics associated with a cell. The Virtual Physiological Human project (www.ricardo.eu) attempts to provide interoperability between different databases and tools related to human physiology and gene expression; the associated software Phenomeblast (code.google.com/p/phenomeblast) is an ontology-based tool for aligning and comparing phenotypes across species. However, most efforts focus on anatomical features and only rarely address the cell level (cf. [6], [7], [8], [9] and [10]). What is missing is a tool to represent and to compare cellular phenotypes and their dynamics. Cell transitions, be it reprogramming of somatic cells to pluripotency or trans-differentiation between cells, is a hot topic in current biomedical research. The large corpus of recent literature in this area is underused, as results are only represented in natural language, impeding their systematic aggregation and usage. Scientific understanding of the complex molecular mechanisms underlying cell transitions could be improved by making essential pieces of knowledge available in a formal (and thus computable) manner. We describe the outline of two ontologies for cell phenotypes and for cellular mechanisms which enable representation of data curated from the literature or obtained by bioinformatics analyses for building a knowledge base about phenotypes and mechanisms involved in cellular reprogramming. In particular, we discuss how comprehensive ontologies of cell phenotypes and of changes in mechanisms can be designed using the entityquality (EQ) model. Our design allows deep insights into the relationship between the continuants (cell phenotypes) and the occurrents (cell mechanism changes) involved in cellular reprogramming. Further, our ontologies allow the application of algorithms for similarity searches in the spaces of cell phenotypes and mechanisms, and, especially, changes of mechanisms during cellular transitions. 1 2 CELL MECHANISMS AND CHANGES We distinguish between two types of processes going on in a cell: microscale mechanisms and macroscale changes thereof. Microscale mechanisms are the interactions between molecules going on in a cell at a certain time, while a macroscale change is the transition from one set of microscale mechanisms going on at one point of time to another such set at a later time. In order to transfer ontologybased annotation and search strategies from phenotypes at the anatomical level [11] to the domain of cell phenotypes and mechanism changes we need to be able to formally describe both (a) cell phenotypes and (b) mechanism changes. Phenotypes are usually described by means of the entity-quality syntax (EQ; [13], [14]). To apply the EQ syntax to the cell level, we outlined two ontologies, an ontology of cell parts (Figure 1) and an ontology of microscale mechanisms (Figure 2) to be used in combination with a small set of standardized modifiers (as ‘qualities’). BACKGROUND The (artificial) induction of cell transitions has recently attracted a lot of attention. A cell phenotype (or cell type) can be defined by the cell’s repertoire of molecules and structural components, together with the specific morphology and function they bring with them. A cell transition is a transition of a cell from one phenotype to another. For example, the phenotype of epithelial cells is distinct from the phenotype of fibroblasts. Programming of cells is the induction of a cell phenotype transition, e.g. from fibroblast to epithelial cell. Reprogramming is the artificially induced transition of a cell to a cell phenotype, which it (or its predecessor) had in the past. Potency can be defined as the disposition of a cell to transition naturally into another cell phenotype; pluripotency is the ability of a cell to transition naturally into all cell (pheno)types of the body. Since Takahashi and Yamanaka Paper D 1 L. Jansen et al. n Figure 1. Outline of an ontology of cell parts and its use to describe cell phenotypes. The figure shows a structure by which cell phenotypes, here epithelial cells, mesenchymal cells and embryonic stem cells (ESC), can be formally represented, using entity terms (shown on the left hand side) and PATO-analogous quality modifiers (shown on the right hand side). Terms referring to cells are indicated in yellow, terms relating to structures in red, to ultrastructures in blue, and to molecules in green. cific change modifiers (as ‘qualities’); see Figure 2 for an example. In our framework, a pluripotent cell can be characterized by the disposition of pluripotency, but also by its expression data (about genes, proteins etc.), from which relevant microscale mechanisms can be inferred. A cell transition e.g., of a fibroblast into a pluripotent cell can be described by comparing the expression data of both cell phenotypes, which capture macroscale changes in microscale mechanisms. The latter includes the start-up of the interactions between genes/proteins relevant for the induction of pluripotency; such a start-up may happen because the cell starts to produce more instances of the molecule types participating in this type of interaction. In our framework, a pluripotent cell realizes dispositions for mechanisms relevant for pluripotency that may be described by a network of interactions. Further, a cell transition from fibroblast to pluripotent cell realizes dispositions for changes in mechanisms. After transition, the cell is characterized by the microscale mechanisms relevant for the pluripotent phenotype. Our ontologies are designed to be combined with specific modifiers within the EQ framework. As shown on the right-hand side of Figure 1, the ontology of cell phenotypes can be used to collect annotations for cell pheno- To describe cell phenotypes and the transitions between them we refer to entities belonging to distinct ontological categories [12]: (1) Independent continuants: Cells and their organelles as well as molecules are spatial entities existing as spatial wholes at every time they exist. (2) Dependent continuants: Any property of a cell or a molecule, be it a quality or a disposition, also exists as a whole at every time it exists. Any such property is ontologically dependent on its particular bearer: It cannot exist without it. (3) Occurrents: Interactions, inhibitions, stimulations as well as transitions are temporally extended processes and do not exist as a whole at any time at which they happen. Cell phenotype data describe as well continuants (like cellular components and dispositions) as occurrents, namely the molecular interactions going on in a cell at a certain time. Cell transition data primarily describe occurrents, namely macroscale changes of microscale mechanisms. Within the EQ framework, we can describe such macroscale changes of microscale mechanisms by pairing terms for microscale mechanisms (as ‘entities’) with spe- Paper D 2 Using Ontologies to Study Cell Transitions Figure 2. Outline of an ontology of cell mechanisms and its use to describe cell transitions. The figure shows a structure by which mechanism changes can be formally represented, using entity terms (shown on the left hand side) and quality modifiers (shown on the right hand side). The colour code follows the code used in Figure 3: occurrents relevant for cell phenotypes are indicated in yellow, occurrents relevant for ultrastructures in blue, and occurrents directly involving molecules in green. ‘Up’ and ‘down’ are intended to indicate relative changes: ‘Interaction Occludin-JAM Up’ states that there is a development in the cell to feature more interactions of this kind, no matter how many of them there have been before. Hence there is no direct connection between types of such relational changes and their start- and end-states. of the microscale mechanisms relevant for TJ, which are the macroscale changes associated with TJ formation. Within the framework of the EQ syntax, qualities help to describe these changes of the microscale mechanisms. The example hierarchy in Figure 2 reflects our example of the epithelial-mesenchymal transition (EMT) and its reversal (MET, observed during reprogramming). MET can be defined in terms of ‘network of mechanisms relevant for epithelial cell’ as the entity term and ‘up’ as the quality modifier; for a more complete description, the network of mechanisms relevant for mesenchymal cell’ goes ‘down’ simultaneously. types such as fibroblast, epithelial cell and pluripotent stem cells. We can set up annotation profiles of cells, consisting of sets of EQ pairs that describe them. For example, the profile of epithelial cells includes that the genes/proteins Occludin, JAM and Claudins as well as tight junctions (TJs) are ‘present’, and cell membranes are ‘joined’. For this purpose, we use a number of standardized modifiers like ‘present’, ‘absent’, ‘up’ and ‘down’, which will also be integrated within an ontology like PATO [13]. The ontology of cell mechanism changes, on the other hand, is designed to be combined with modifiers like ‘up’ and ‘down’ in order to yield descriptions for macroscale changes of the microscale mechanisms going on within a cell (‘up’ for start-up; ‘down’ for shutdown). The righthand side of Figure 2 features, e.g., the specific changes 3 DISCUSSION Both ontologies allow for annotation propagation. In [11], annotations for anatomical entities are propagated up Paper D 3 L.Jansen et al. lecular entities. In addition, we can also cluster the macroscale changes that transition cells from one phenotype to another. A cluster of cells shares aspects of components and microscale mechanisms. Generally speaking, similar phenotypes correspond to similar cells, and similar mechanism changes correspond to similar cell transitions. Thus, boundaries between clusters of cells that are ‘next neighbours’ (e.g. pluripotent embryonic stem cells and epiblast stem cells) as well as between cells on opposite ends of a developmental spectrum (e.g. mesenchymal cells and epithelial cells) can be defined by clustering based on expert annotations and on bioinformatics analyses of experimental data. Clustering of mechanism changes (that is, of macroscale changes in microscale mechanisms) will in turn generate clusters of similar mechanism changes with a large distance between them. The cause for this large distance then is the existence of strongly dissimilar cells. a hierarchy of is_a and part_of relations, such that a parent receives all the annotations of its children. However, given the usual all-some semantics, the mereological hierarchy cannot be used in the same way as in [11] for cell phenotypes and mechanism changes. Therefore, we decided to model mereological relations with has_part instead of part_of. We did this for the following reasons: First, in anatomical hierarchies, parts determine the wholes they belong to. E.g., a finger is always part of a hand. A molecular entity like Occludin, however, can belong to a wide range of cell phenotypes, while a certain phenotype (by definition) has to possess certain molecular entities. As here the whole determines the parts, we need to use the has_part relation. Second, cell mechanisms, as we understand them here, are occurrents, and initial temporal parts can happen without the event being completed. Again, we need to employ the has_part hierarchy, from whole processes to their necessary parts (e.g., from Network_of_mechanisms_relevant_for_TJ to the Interaction_Occludin_JAM). When employing annotation propagation, therefore, as a rule, a whole process will have a higher information content than its necessary parts). The ontologies outlined above enable similarity searches across cell phenotypes and mechanism changes in analogy to [11]. Such searches may compare, e.g., EMT/MET and reprogramming data. In simplified terms, an MET [15] consists in, first, the formation of adherens junctions (AJ) and, second, the formation of tight junctions (TJ). We represent the MET as the start-up of the microscale mechanisms relevant for an epithelial cell, which has as one of its parts TJ formation that is, in turn, represented as the start-up of the mechanisms relevant for a TJ. This is the inverse of an EMT (which happens in development, metastasis and fibrosis). ExprEssence and related tools ([16], [17], [18], [19]) can be employed for generating annotations about mechanism changes relevant for a certain transition by means of high-throughput data analysis. The more mechanisms are annotated, the better we can estimate how similar biological processes are. Ultimately, any set of cell transitions can be compared (using data coded in EQ syntax) with respect to the underlying mechanisms, demonstrating the power of our approach. Our ontology design principles thus enable a kind of BLAST search in the space of annotations (for mechanisms), with similar goals such as highlighting relationships (between mechanisms, based on basic mechanisms as building blocks), and eventually estimating their evolutionary history. While our individuation criteria for cell phenotype are very fine-grained (even a tiny change in the molecular repertoire constituted a change in phenotype), we can construct more coarse-grained cell types by clustering of cell phenotypes based on similarity, considering the presence or absence of (ultra-)structural components and mo- 4 4 CONCLUSION We outlined how to design ontologies that enable to (1) formally represent cell phenotypes and mechanism changes behind cell transitions such as (re-)programming, and to (2) develop algorithms exploiting this framework, including clustering and searching for similar cell phenotypes and mechanism changes. Both ontologies support manual curation of publication data, annotation propagation and information content measurement, as well as the inclusion of results from high-throughput data analysis. Our use of EQ-syntax allows the systematic encoding of annotation profiles of cell phenotypes and mechanism changes. The terms for both types of entities are organized in hierarchies ranging from molecular to (ultra)structural to morphological entities. Annotation profiles can then be obtained using (1) data curation from publications or by (2) high-throughput data analysis. In ontological terms, bioinformatics tools such as ExprEssence can be used as an instrument for deriving mechanistic information from high-throughput data, turning information about continuants into information about occurrents by differential analysis. The starting point for expert curation, possibly supported by text mining, must be a set of carefully selected papers. Given a rich annotated knowledge base, existing approaches for ontology-based similarity measurements [11] can be applied to the domains of cell phenotypes and cellular mechanism changes. This would yield two important functionalities: It allows clustering of cell phenotypes (and of mechanism changes) by similarity, providing important information for an operational definition of cell phenotypes, and it allows similarity search in the spaces of mechanism changes and of cell phenotypes. Paper D Using Ontologies to Study Cell Transitions 8. Hoehndorf, R., A. Oellrich, and D. RebholzSchuhmann, Interoperability between phenotype and anatomy ontologies. Bioinformatics, 2010. 26(24): p. 3112-8. 9. Hoehndorf, R., P.N. Schofield, and G.V. Gkoutos, PhenomeNET: a whole-phenome approach to disease gene discovery. Nucleic Acids Research, 2011. 39(18): p. e119. 10. Hoehndorf, R., et al., Interoperability between biomedical ontologies through relation expansion, upper-level ontologies and automatic reasoning. PLoS One, 2011. 6(7): p. e22006. 11. Washington, N.L., et al., Linking human diseases to animal models using ontology-based phenotype annotation. PLoS biology, 2009. 7(11): p. e1000247. 12. Jansen, L., Categories: The Top Level Ontology, in Applied Ontology, K. Munn and B. Smith, Editors. 2008, Ontos, Frankfurt. p. 173-196. 13. Gkoutos, G.V., et al., Using ontologies to describe mouse phenotypes. Genome Biol, 2005. 6(1): p. R8. 14. Mungall, C.J., et al., Integrating phenotype ontologies across multiple species. Genome biology, 2010. 11(1): p. R2. 15. Thiery, J.P. and J.P. Sleeman, Complex networks orchestrate epithelial-mesenchymal transitions. Nature reviews. Molecular cell biology, 2006. 7(2): p. 131-42. 16. Ideker, T., et al., Discovering regulatory and signalling circuits in molecular interaction networks. Bioinformatics, 2002. 18 Suppl 1: p. S233-40. 17. Guo, Z., et al., Edge-based scoring and searching method for identifying condition-responsive proteinprotein interaction sub-network. Bioinformatics, 2007. 23(16): p. 2121-8. 18. Warsow, G., et al., ExprEssence--revealing the essence of differential experimental data in the context of an interaction/regulation net-work. BMC systems biology, 2010. 4: p. 164. 19. Kim, Y., et al., Principal network analysis: identification of subnetworks representing major dynamics using gene expression data. Bioinformatics, 2011. 27(3): p. 391-8. To further refine and populate the ontologies, we currently explore the option to work together with collaborators in the DFG SPP 1356 (http://www.spp1356.de) on pluripotency and cellular reprogramming, and similar initiatives, and we are looking for funding. The size of the final artifacts is obviously a function of time and efforts invested in its development. While the number of relevant entities is limited for cell anatomy and cell types (several thousands), it is very large and virtually unlimited for molecular entities. Naturally, we cannot expect complete coverage here. To evaluate our approach, we intent to compare similarity search results based on high-throughput data analysis only to results based on employing the ontologies integrating high-throughput data, (ultra)structural data and morphological data, and further to compare both sets of results with the expectations of domain experts..To avoid a garbage-in, garbage-out scenario, the application domain must be strictly limited, e.g. to data describing reprogramming and EMT experiments, so that the input data can all be validated by domain experts. ACKNOWLEDGEMENTS Starting from research by GF, LJ and GF developed the ontologies while AK and UL provided domain knowledge. LJ wrote the first version of this paper, drawing on a larger unpublished manuscript by the authors. All authors read and approved the text. DFG support to AK and UL (AK 851/3-1, LE 1428/4-1), GF (FU 583/2-1, FU 583/2-2) and LJ (JA 1904/2-1) is gratefully acknowledged. REFERENCES 1. Takahashi, K. and S. Yamanaka, Induction of pluripotent stem cells from mouse embryonic and adult fibroblast cultures by defined factors. Cell, 2006. 126(4): p. 663-76. 2. Ho, R., C. Chronis, and K. Plath, Mechanistic insights into reprogramming to induced pluripotency. Journal of cellular physiology, 2011. 226(4): p. 868-78. 3. Jaenisch, R. and R. Young, Stem cells, the molecular circuitry of pluripotency and nuclear reprogramming. Cell, 2008. 132(4): p. 567-82. 4. Bard, J., S.Y. Rhee, and M. Ashburner, An ontology for cell types. Genome biology, 2005. 6(2): p. R21. 5. Meehan, T.F., et al., Logical development of the cell ontology. BMC Bioinformatics, 2011. 12: p. 6. 6. Gunsalus, K.C., et al., RNAiDB and PhenoBlast: web tools for genome-wide phenotypic mapping projects. Nucleic Acids Research, 2004. 32(Database issue): p. D406-10. 7. Hoehndorf, R., et al., Relations as patterns: bridging the gap between OBO and OWL. BMC Bioinformatics, 2010. 11: p. 441. Paper D 5 Automatically transforming pre- to post-composed phenotypes: EQ-lising HPO and MP Anika Oellrich1 , Christoph Grabmüller1 , and Dietrich Rebholz-Schuhmann1 1 European Bioinformatics Institute, Wellcome Trust Genome Campus, Hinxton, CB10 1SD, UK ABSTRACT Large-scale mutagenesis projects are ongoing to better understand human pathology and with that identify possible prevention and treatment mechanisms. Those projects do not only record the genotype but also the phenotype of genetically altered organisms. Consequently, phenotype association studies become more and more important with the ever increasing amount of available data. Thus far, phenotype data is stored in species-specific databases that are not inter-operable due to the differences in their phenotype representations. One suggested solution to the lack of integration is the use of Entity-Quality (EQ) representations. However, this requires a transformation of the phenotype data contained in the speciesspecific databases into corresponding EQ statements. If phenotypes are represented with an ontology, the transformation of data may happen based on the ontology, albeit this process may be slow if executed manually. Here, we report on our ongoing efforts to develop a method (called EQliser) for the automated generation of EQ representations from phenotype ontology concept labels. We implemented the suggested method in a prototype and applied it to a subset of Mammalian and Human Phenotype Ontology concepts. In the case of MP, we were able to identify the correct EQ representation in over 52% of structure and process phenotypes. However, applying the EQ-liser prototype to the Human Phenotype Ontology yields a correct EQ representation in only 13.3% of the investigated cases. With the application of the prototype to two phenotype ontologies, we were able to identify common patterns of mistakes when generating the EQ representation. Correcting those mistakes will pave the way to a speciesindependent solutions to automatically derive EQ representations from phenotype ontology concept labels. Furthermore, we were able to identify inconsistencies in the existing manually defined EQ representations of current phenotype ontologies. Correcting those inconsistencies will improve the quality of the manually defined EQ representations. 1 INTRODUCTION Improvements in sequencing technologies led to ambitious projects, aiming at the identification of a species phenome by targeted mutation of the genome, e.g. the International Mouse Phenotyping Consortium (IMPC) (Abbott 2010, Bogue and Grubb 2004). Phenotype data resulting from those mutagenesis experiments is captured in species-specific Model Organism Databases (MODs), allowing a structured representation of the contained phenotype data in support of comparative phenomics (Leonelli and Ankeny 2012). Together with the number of available MODs (Blake et al. 2011, Drysdale and Consortium 2008, Amberger et al. 2011), the number of species-specific phenotype ontologies increased. Applying these species-specific phenotype ontologies to represent phenotype data in MODs, hinders the integration of phenotype data across MODs. In order to facilitate integration across those MODs and enable a knowledge flow, mechanisms are required bridging the speciesspecific phenotype ontologies. In addition to ontology alignment algorithms, one suggested bridging mechanism found increasing application: the EntityQuality (EQ) representation of phenotypes (Mungall et al. 2010). Using the EQ representation to describe a phenotype means that this phenotype is decomposed into an affected entity which is further described with a quality, e.g. decreased body weight. Representing phenotypes as a composition of an entity and quality is also called post-composition. EQ descriptions have been successfully applied in a number of studies, focusing on cross-species phenotype integration (Washington et al. 2009, Chen et al. 2012, Hoehndorf et al. 2011). Even though EQ representations are only available for a subset of concepts of species-specific phenotype ontologies, those studies have shown already promising results. However, the studies could certainly benefit from more accessible data, ready to be integrated into their frameworks. Species-specific phenotype ontologies include, among others, the Mammalian Phenotype Ontology (MP) (Smith et al. 2005), the Human Phenotype Ontology (HPO) (Robinson et al. 2008) and the Worm Phenotype Ontology (WBPhenotype) (Schindelman et al. 2011). Those phenotype ontologies provide the concepts ready for annotation and are therefore also referred to as pre-composed phenotype ontologies. Thus far, post-composed phenotype representations are produced manually which leads presumably to a high quality but is slow. First, the species-specific pre-composed phenotypes are created, and once a version is finalised, the corresponding EQ statements are generated. Due to generating those EQ statements manually, only a subset of concepts of the pre-composed phenotype ontologies is available in EQ. Furthermore, as an ontology is a community effort, it is subject to change. Concepts evolve, get obsolete or simply change over time and keeping the EQ representations updated is a very important requirement. Since the EQ representations are determined in a manual process, the coverage of the resources is still very limited and mistakes introduced as part of manual curation process restrict the beneficial outcomes. Developing an automated method being capable of generating an EQ representation from a phenotype concept could help this process, ensure high quality of the EQ representations and keep up with the pace of the ontology development cycle. In this manuscript, we report about our ongoing efforts to develop a method (called EQ-liser) transforming pre-composed phenotype ontologies into a post-composed representation using EQ. After developing a prototype and applying it to MP and HPO concepts, we could derive a subset of areas which need improvement before a generalised transformation from pre-composed into post-composed phenotype representation is possible. Furthermore, applying our approach does not only provide a decomposition of phenotypes, it also facilitate the discovery of inconsistencies in the so far manually Paper E 1 Anika Oellrich1 , Christoph Grabmüller1 , and Dietrich Rebholz-Schuhmann1 assigned EQ statements. Moreover, the approach also elucidates inconsistencies in the concept labels of pre-composed phenotype ontologies. During evaluation, our prototype correctly generated the EQ statement for over 52% of the test set of MP concepts and could also identify a number of errors in the existing EQ representations for both HPO and MP. We were also able to identify a number of label inconsistencies within HPO, creating obstacles in an automated generation of EQ statements. Generated results as well as the implemented source code are available on our project web page1 , together with more information about the project as such. described in section 2.3. The composition of one particular PATO concept corresponds to all PATO concepts whose terms form a subset of the stemmed words contained in the concept name. After filtering special characters and removing stop words from the concept names and synonyms, the remaining textual content was stemmed using a Porter (Porter 1980) stemmer provided by Snowball3 . The stemmer was applied to all concept names and synonyms. Stemmed concept labels and synonyms were then pairwise compared and each concept entirely contained in another (either label or synonym) was recorded. Applying this process we retrieved 1,453 PATO concepts (out of 2,290) with a corresponding cross product. 2 2.3 METHOD AND MATERIALS Transforming pre-composed phenotype representations into postcomposed requires the identification of constituents in concept labels, i.e. the EQ-liser has to identify the entity and the quality. To illustrate the post-composition of the MP concept abnormal otolithic membrane (MP:0002895), the manually assigned EQ statement is provided here: [Term] id: MP:0002895 ! intersection_of: intersection_of: intersection_of: 2.1 Figure 1 shows the processing steps to derive the EQ statement from a MP or HPO phenotype concept. Each of the steps is explained in more detail in the following paragraphs. abnormal otolithic membrane PATO:0000001 ! quality inheres_in MA:0002842 ! otolithic membrane qualifier PATO:0000460 ! abnormal Input data In the existing, manually derived EQ statements, the entity constituent is represented with a number of OBO Foundry ontologies (Smith et al. 2007) and the quality constituent is always represented using the Phenotypic quality And Trait Ontology (PATO) (Mungall et al. 2010, Mabee et al. 2007). Entity constituent filling ontologies also differ with the species. Supporting all those ontologies would be out of scope of our preliminary study. We therefore limited our approach to two species-specific ontologies, HPO and MP, and also limited those further to phenotype concept being represented in EQ using the Mouse Anatomy Ontology (MA) (Hayamizu et al. 2005), the Gene Ontology (GO) (Ashburner et al. 2000), the Foundational Model of Anatomy Ontology (FMA) (Rosse and Mejino 2003) and PATO. We consider this to be corresponding to structural and process phenotypes. We downloaded a version of the two phenotype ontologies as .tbl files2 and their corresponding EQ representation on the 03.05.2012, with 9,795 HPO concepts and 9,127 MP concepts, and 4,783 and 6,579 respectively possessing an manually assigned EQ statement. After reduction based on their manually assigned EQ statements to structural and process phenotypes, the MP concepts were reduced to 3,761 and the HPO concepts were reduced to 3,268. 2.2 Deriving PATO cross products A subset of the PATO concepts constitute a composition of other PATO concepts. For instance, the concept decreased depth (PATO:0001472) could be represented using the PATO concept decreased (PATO:0001997) and depth (PATO:0001595). To achieve a term-wise composition of PATO concepts, we downloaded the PATO .tbl file and applied the filtering and stemming algorithm as 1 2 2 Overview EQ-liser prototype http://code.google.com/p/eqliser/ http://www.berkeleybop.org/ontologies/ Fig. 1: Shows the individual steps executed with EQ-liser to decompose a phenotype ontology based on concept names. The first step (see figure 1) in processing the ontology’s downloaded .tbl file was the filtering for special characters. Therefore, the concept labels contained in the downloaded .tbl files4 of the ontologies were analysed for their orthographic correctness (Schober et al. 2009), i.e. special characters, such as e.g. “%” or “-”, were excluded. Such special characters – often special punctuation – potentially cause problems when matching differently punctuated concept labels from several ontologies. Stop words, such as “in” 3 4 Paper E http://snowball.tartarus.org/ provides a tabular view an ontology’s data; generated from .obo files or “the” are part of the common English language, considered not to carry any discriminatory information and consequently can be removed before analysis to reduce noise and potential errors resulting from their inclusion. After character filtering and stop word removal from all the concept labels and their synonyms, we used LingPipe (Carpenter 2007) to recognise entities and qualities from MP and HPO concepts. The dictionaries for LingPipe were compiled by using the labels and synonyms provided by the ontology files for FMA, MA and PATO. For GO, we used an alternative approach described in (Gaudan et al. 2008) but also implemented as LingPipe annotation server. An individual tagging server was set up for each ontology. As those servers work all in parallel, they might assign overlapping annotations which could potentially result in too many annotations assigned by the automated method. E.g. in the case of enlarged dorsal root ganglion (MP:0008490), an MA annotation for dorsal root ganglion (MA:0000232) and a PATO annotation for dorsal (PATO:0001233) is assigned. To avoid this behaviour, we ran a filter process after assigning LingPipe annotations and removed those annotations being entirely encapsulated in another. Filtering GO annotations is not yet possible due to the current implementation of this server but will be supported in later versions. A last step, was the replacement of LingPipe’s PATO annotations and summarise those, where possible, with cross products 2.2. Consequently, in the before mentioned example of decreased palatal depth, the two LingPipe annotations would be replaced now with one single annotation decreased depth. In addition, absent (PATO:0000462) is replaced in all automated EQ statements with lacks all parts of type (PATO:0002000) which is commonly used in the manual assigned EQ descriptions. 2.4 Evaluation To evaluate our results, we introduced a two-step evaluation process. We first evaluated the obtained EQ statements to the available, manually assigned EQ representations of structural and process phenotypes. In a second step, we investigated a subset of 50 EQ statements of each ontology where automated method and manual curator do not assign any shared concepts. Common patterns were identified causing a disagreement in the automatic EQ representation and are discussed in sections 3.3 and 3.4, for MP and HPO respectively. 3 RESULTS AND DISCUSSION The transformation of the pre-composed representation into the post-composed representation requires decomposing the concept labels, identification of entities and qualities and the correct generation of the post-composed representation. The entities have to be matched to ontological concepts, which are provided from other OBO Foundry ontologies. As test scenario, we have tested the EQ-liser method on MP and HPO concept labels. Note, that all transformations are only concerned with phenotype representations that are concerned with phenotype structures and processes. 3.1 EQ-lising the Mammalian Phenotype Ontology When decomposing structure and process MP phenotypes based on their labels, 3,549 concept labels (out of 3,761) could be transformed. When comparing those 3,549 concepts to the manually assigned EQ statements, 23.7% received a correct post-composition assigned with EQ-liser, only based on concept labels. Exploiting also synonyms, we achieved an increase by 6.7 %. If we relax the criterion of what constitutes a correct match to allowing the automated method to assign more annotations than a manual curator would do, we achieve a correct EQ statement for 52.2% of MP concepts. This relaxed criterion is safe to apply as the automatically generated EQ statements will undergo curator approval and removing additionally assigned annotations is not a problem. Obtaining the required annotations for EQ statements from concept names in more than half of MP’s structure and process phenotypes constitutes a promising start for a generalised decomposition method. Completely erroneous post-composed representations have only been generated for 5.6 % of the concepts. These numbers demonstrate that the pre-composed concept labels of MP are already well structured and that the automatic transformation – with a grain of salt – does generate post-composed representations that correctly reflect the semantics of the pre-composed representation. 3.2 EQ-lising the Human Phenotype Ontology To determine whether the transformations performs also well on an alternative pre-composed phenotype ontology, we also applied EQ-liser to the HPO concept labels. Since HPO and MP are both designed for two mammal species, we expect that both ontologies share a subset of similar phenotype concepts and. Again, only structural and process phenotypes have been considered, in postcomposition represented with FMA, GO and PATO. We considered 3,268 pre-composed concepts, where 2,731 have obtained an automatically assigned EQ statement. From these, 231 (8.5 %) are exact matches to the 31 annotations that have been manually assigned. We can increase the yield to 249 cases (9.5 %) including synonyms. Again relaxing the matching criterion, we obtain the correct annotation in 13.3%. However, in 25.8% of the cases, automatically (EQ-liser) and manually (existing) assigned EQ statements do not share any annotations. As the results show, decomposing mouse phenotype concepts leads to better results than decomposing human phenotype concepts based on lexical features of their labels and synonyms. 3.3 Mismatches in EQ-lising MP Manually investigating 50 MP concepts where automated and manual EQ statements entirely disagree show common patterns for all three constituents: structure, process and quality. A number of mismatches were caused by assigning wrong PATO annotations due to particular extension or replacement patterns in the manually designed EQ statement which cannot yet be picked up with the automated procedure. E.g. the quality of increased mitochondrial proliferation (MP:0006038) is represented in the manually assigned EQ statements with an increased rate (PATO:0000912). However, the automated method assigns instead increased (PATO:0000470) as quality for this particular MP concept. Similarly, any concept name possessing the phrase increased activity will be annotated in the manually assigned EQ statements with increased rate (PATO:0000912) which is not yet possible automatically. Furthermore, any concept containing a description possessing the phrase increased ... number will be represented in the manual EQ statements with has extra parts of type (PATO:0002001). The same examples hold true replacing increased with decreased in the concept labels. All those examples provided here could be changed with conditional replacement rules for PATO Paper E 3 Anika Oellrich1 , Christoph Grabmüller1 , and Dietrich Rebholz-Schuhmann1 concepts and consequently lead to a reduction of the contradictory cases and more correctly identified EQ statements. Additional mismatches were caused due to mostly not or wrongly identifying the structural component of the phenotype. This happens when the naming of the affected anatomical structure differs in both MA and MP. This is mostly the case due to singular/plural divergence. E.g. the MA annotation lumbar vertebra (MA:0000312) is not automatically assigned to the MP concept increased lumbar vertebrae number (MP:0004650) due to vertebra being used in singular in MA but in MP in plural. Another source of mismatches in the anatomical structure is due to shortened expressions, e.g. MP uses coat while MA uses coat hair. The described mismatches could be addressed by adding additional terms to the dictionary underlying the LingPipe MA annotation server or applying a stemming to both concept labels and synonyms, and the underlying annotation dictionary. The third type of investigated mismatches concerned mismatches in the process constituent of the EQ statements. Mismatches caused in the process constituent were caused, e.g. due to not covering synonyms in the current implementation of the GO server. An example falling into this category are concept names containing salivation which are not recognised as saliva secretion. Other mismatches resulted from different word forms used to express the same concept, e.g. smooth muscle contractility and smooth muscle contraction. A small fraction in the process constituent mismatches are also caused by singular/plural-conflicts, e.g. MP uses plural cilia while GO uses singular cilium. Both synonyms and singular/pluralconflicts could be addressed by adding those to the dictionary underlying the current GO server or apply stemming before entity recognition. Among those 50 concepts, we could also identify a wrongly assigned manual EQ statements in two cases (corresponds to 4% of the investigated cases). Those cases were reported to the developers of the EQ statements and got corrected. Those wrongly assigned manual definitions were mainly due to old construction patterns and newly added concepts in the constituent ontologies. 3.4 3.5 Mismatches in EQ-lising HPO One source of causing mismatches on the quality defining part of a phenotype, are differences in either using a noun or an adjective to describe a certain aspect. E.g. all HPO concepts containing either abnormality or abnormalities are not automatically annotated with abnormal (PATO:0000460) due to the differences in word type used. Furthermore, all concepts containing abnormality or abnormalities in their names are manually assigned the PATO annotation quality (PATO:0000001) which cannot be derived from the concept name as such. Furthermore, some words inside the HPO concept names are extended in the manually assigned EQ statements. E.g. Irregular epiphysis of the middle phalanx of the 4th finger (HP:0009219) is manually annotated with irregular density (PATO:0002141). The identified mismatches can be corrected by adding special handling rules to the concept decomposition for particular HPO concept names. Mismatches in the structural components of phenotypes were partially due to differences in naming of the anatomical component in HPO and FMA. E.g., while FMA chose to name fingers (index finger or ring finger), HPO assigns numbers to fingers, such as 2nd finger or fourth finger. Further to that, HPO is not consistent in numbering enitities, e.g. it is thumb in all concepts concerning the 4 first finger and it is second toe versus 2nd finger. Furthermore, the HPO is not well standardised in the selection of singularity versus plurality with nouns (phalanges versus phalanx). Another group of mismatches in anatomical structures is arising from the generation of contractions when selecting concept labels suggested by FMA, e.g. premolar instead of premolar tooth or metatarsal instead of metatarsal bone. Most of the currently identified mismatches in the structural constituents of HPO concepts can be addressed by adding terms to the dictionary of the LingPipe FMA annotation server. Process constituents were in the chosen subset only, and as partially the case of MP mismatches, due to not supporting synonyms in the current implementation of the GO server. E.g. Abnormality of valine metabolism (HP:0010914) does not obtain the GO annotation valine metabolic process (GO:0006573). Those type of mismatches can be corrected in future versions of the EQ-liser method by including synonyms in the current version of the GO annotation server. One group of mismatches which did not occur in the MP concept name decomposition and only rarely occurred when applying the method to the HPO concept names, is the co-existence of identical concepts in different ontologies. Even though the OBO Foundry aims at the orthogonality of the ontologies, this criteria is not fulfilled in all cases. E.g. both, FMA and GO, host the concept Chromosome (GO:0005694, FMA:67093) and the developer of the EQ statements is free to chose either one, which leads to inconsistencies in the resource. Another example of a double existing concepts is Anosmia (HP:0000458, PATO:0000817). As those concepts should be removed during the process of quality assessment through the OBO Foundry, no action is required to include this aspect into the decomposition method. For three manually analysed concepts (6% of the investigated cases), the manually assigned EQ statements were inconsistent. These inconsistencies were reported to, confirmed and corrected by the HPO EQ statement developers and are available as a new version to the user community. Towards a generalised phenotype decomposition Even though the decomposition for HPO concepts does not work yet as well as automatically generating EQ statements for MP concepts, the changes required for either ontology are similar and with their rectification most of the mismatches investigated would be addressed. Covering the correct set of annotations for 52% of the structural and process phenotypes contained in MP is a promising start to develop an automated method capable of automatically deriving EQ statements from pre-composed phenotype ontologies. However, given the close development of the MP and HPO EQ statements, the method has to be further tested on other pre-composed phenotype ontologies, such as the WBPhenotype for which also already a decomposed subset of concepts exist. Once evaluated on another pre-composed phenotype ontology, we hypothesise that the performances of our suggested method will increase again and we will be able to successfully decompose phenotype statements into their constituents for all species as long as constituents ontologies are available from the OBO Foundry5 . 5 Paper E http://www.obofoundry.org/ 4 CONCLUSION Applying the suggested method to the generation of EQ statements from MP concept labels of structural and process phenotypes yields a strictly correct EQ statement in 30% of the cases. Assuming that a curator will approve the EQ statements before they are used community wide, additionally assigned annotations can be easily removed from a correct EQ statement. Using this assumption to relax the correctness criterion, we can identify the correct subset of EQ statement constituents in over 52% of MP’s structural and process phenotypes. To achieve a similar rate for the decomposition of HPO based on their labels, the identified problems have to be addressed. The correction of the identified problems will enable a better identification of the EQ statements from concept labels. Once the flaws have been corrected the method can be implemented in a generalised manner to derive EQ statements from a variety of pre-composed phenotype statements which will ease the integration of species-specific pre-composed phenotype information into a species-independent framework. Independently from deriving decomposed phenotype expression, the application of the method also allows for the identification of inconsistencies within the labels of the concepts. While MA and MP follow a rigorous naming scheme and hence facilitate integration, HPO and FMA diverge from each other, creating obstacles for a possible integration. Furthermore, HPO does not consistently name its own concepts which also hinders an integration based on lexical attributes, confuses users of the ontology and prevents easy integration of human data into other frameworks based on a decomposed presentation. Despite allowing for the decomposition of concepts which have no EQ statement yet, the method has also proven useful to identify flaws in the manually assigned EQ statements. By applying the prototype of the method to concept labels of MP and HPO, inconsistencies were identified and corrected accordingly. This procedure improved the quality of the existing EQ statements and consequently of all methods applying the EQ statements such as PhenomeNET (Hoehndorf et al. 2011) and MouseFinder (Chen et al. 2012). 5 ACKNOWLEDGEMENTS The authors thank Georgios V. Gkoutos for his close collaboration in analysing potential errors of the EQ-liser method. He also provided valuable explanations for patterns contained in the EQ statements of the Mammalian and Human Phenotype Ontology. In addition, the authors are also grateful to Irina Colgiu for her fast and reliable implementation of the GO server, used in this study for annotation purposes. Furthermore, the authors would like to thank Maria Liakata for valuable input on the draft of this manuscript. REFERENCES Alison Abbott. Mouse project to find each gene’s role. Nature, 465 (7297):410, May 2010. Joanna Amberger, Carol Bocchini, and Ada Hamosh. A new face and new challenges for Online Mendelian Inheritance in Man (OMIM R ). Human mutation, 32(5):564–7, May 2011. M Ashburner et al. Gene ontology: tool for the unification of biology. The Gene Ontology Consortium. Nat Genet, 25(1):25–9, May 2000. Judith A Blake et al. The Mouse Genome Database (MGD): premier model organism resource for mammalian genomics and genetics. Nucleic Acids Res, 39(Database issue):D842–8, Jan 2011. Molly A Bogue and Stephen C Grubb. The Mouse Phenome Project. Genetica, 122(1):71–4, Sep 2004. Bob Carpenter. LingPipe for 99.99% recall of gene mentions. Proceedings of the 2nd BioCreative workshop, 2007. Chao-Kung Chen et al. MouseFinder: candidate disease genes from mouse phenotype data. Human mutation, Feb 2012. Rachel Drysdale and FlyBase Consortium. FlyBase : a database for the Drosophila research community. Methods Mol Biol, 420: 45–59, Jan 2008. S Gaudan, A Jimeno Yepes, V Lee, and D Rebholz-Schuhmann. Combining Evidence, Specificity, and Proximity towards the Normalization of Gene Ontology Terms in Text. EURASIP journal on bioinformatics & systems biology, page 342746, Jan 2008. Terry F Hayamizu, Mary Mangan, John P Corradi, James A Kadin, and Martin Ringwald. The Adult Mouse Anatomical Dictionary: a tool for annotating and integrating data. Genome Biol, 6(3): R29, Jan 2005. Robert Hoehndorf, Paul N Schofield, and Georgios V Gkoutos. PhenomeNET: a whole-phenome approach to disease gene discovery. Nucleic Acids Res, 39(18):e119, Oct 2011. Sabina Leonelli and Rachel A Ankeny. Re-thinking organisms: The impact of databases on model organism biology. Stud Hist Philos Biol Biomed Sci, 43(1):29–36, Mar 2012. Paula M Mabee et al. Phenotype ontologies: the bridge between genomics and evolution. Trends Ecol Evol (Amst), 22(7):345–50, Jul 2007. Christopher J Mungall et al. Integrating phenotype ontologies across multiple species. Genome Biol, 11(1):R2, Jan 2010. M F Porter. An algorithm for suffix stripping. Program, 1980. Peter N Robinson et al. The Human Phenotype Ontology: a tool for annotating and analyzing human hereditary disease. Am J Hum Genet, 83(5):610–5, Nov 2008. Cornelius Rosse and José L V Mejino. A reference ontology for biomedical informatics: the Foundational Model of Anatomy. Journal of biomedical informatics, 36(6):478–500, Dec 2003. Gary Schindelman, Jolene Fernandes, Carol Bastiani, Karen Yook, and Paul Sternberg. Worm Phenotype Ontology: integrating phenotype data within and beyond the C. elegans community. BMC Bioinformatics, 12(1):32, Jan 2011. Daniel Schober et al. Survey-based naming conventions for use in OBO Foundry ontology development. BMC Bioinformatics, 10: 125, Jan 2009. Barry Smith et al. The OBO Foundry: coordinated evolution of ontologies to support biomedical data integration. Nat Biotechnol, 25(11):1251–5, Nov 2007. Cynthia L Smith, Carroll-Ann W Goldsmith, and Janan T Eppig. The Mammalian Phenotype Ontology as a tool for annotating, analyzing and comparing phenotypic information. Genome Biol, 6(1):R7, Jan 2005. Nicole L Washington et al. Linking human diseases to animal models using ontology-based phenotype annotation. PLoS Biol, 7(11):e1000247, Nov 2009. Paper E 5 The mouse pathology ontology, MPATH; structure and applications Paul N. Schofield1,2, John P. Sundberg2, Beth A. Sundberg2, Colin McKerlie3, George V. Gkoutos4 1 Dept of Physiology, Development and Neuroscience, University of Cambridge, Downing Street, Cambridge, CB2 3EG, UK 2 The Jackson Laboratory, 600, Main Street, Bar Harbor, ME 04609-1500, USA 3 Genetics and Genomic Biology Program, Hospital for Sick Children, Toronto, Canada Dept of Computer Science, University of Aberystwyth, Old College, King Street, SY23 2AX, Wales 4 ABSTRACT The advent of large scale pathology-based phenotyping of mice requires a relatively simple nomenclature and coding system that can be integrated into data collection platforms (computerized medical record keeping systems) to enable the pathologist to rapidly screen and accurately record observations. The data generated needs to be easily and consistently retrieved in a form that can be analyzed computationally. MPATH provides such a platform which, when integrated into a medical records database enables diagnoses to be automatically entered, spelled correctly and consistently (critical for data retrieval), and coded. Built on an ontology platform, diagnoses can be investigated in populations (epidemiology or genome wide association mapping) on a very specific diagnosis or a class of diseases. This enables investigators to interrogate large datasets at a variety of depths, use semantic analysis to identify the relations between diseases in different species and integrate pathology data with other data types, such as pharmacogenomics. 1 INTRODUCTION Since the late eighteenth century when achromatic lenses and reliable histological stains began to be available, investigators of anatomic pathology, and particularly in the mid nineteenth century the innovators of cellular pathology such as Rudolf Virchow, developed and applied terminologies to describe their observations. These depended on the “school” to which the pathologists belonged, but more importantly on the etiologic or mechanistic paradigm in which they were working. Whilst one of the great achievements of the nineteenth century was the recognition of the universality of pathological processes and entities; their occurrence in multiple species as recognisable manifestations of the same underlying processes (Sundberg & Schofield, 2009), it was ! * ! a century before a broadly accepted and rationally structured pathology was developed. The development of pathology terminologies has to an extent occurred independently of disease terminologies and nosologies, partly as a result of the much longer history of classifying diseases, and partly due to the inherited preconceptions of the nature of disease in clinical medicine. The distinction between pathological descriptions of disease, clinical descriptions of disease, disorders and predispositions is still not satisfactorily resolved, although in recent years there have been attempts to rationalise the definitions of these concepts (Scheuermann et al., 2009) and their relation to each other as part of a broadly applicable model of disease other than a “bag” of manifestations or phenotypes which are found in that class of individuals and form the basis of a diagnosis. Issues about severity, time course, organ involvement etc. are beginning to be addressed, but it is remarkable that even treating diseases as a “bag “ of phenotypes has been shown to be a powerful approach in establishing the relationships between diseases and the presence of related diseases in different organisms (Chen et al., 2012; Hoehndorf et al., 2011; Oellrich et al., 2012; Washington et al., 2009). What has recently been identified as important, nevertheless, is that the tissue-specific resolution of the recording of lesions, and the ability to record the pattern of disease within an individual has proved vital for GWAS mapping of predisposing genetic variants in inbred strains of mice, as each class of lesion can be analysed in isolation (Berndt et al., 2011; Li et al., 2012). The discipline of pathology may be broken down into clinical and anatomic pathology, the former is concerned with clinical chemistry, hematology, clinical microbiology and emerging subspecialities such as molecular diagnostics and proteomics, the latter with the histological, histochemical or immunohistochemical observations of the alterations in tissue composition or architecture. Both branches of the medical specialty, which are increasingly merging, may be To whom correspondence should be addressed. Paper F 1 P.N. Schofield et al. viewed as aspects of phenotyping, and both provide subtypes of the clinical signs associated with ongoing disease processes, the results of developmental abnormalities, or the historical presence of disease. 2 2.1 RESULTS AND DISCUSSION Anatomic pathology nomenclature and its applications The universality of the repertoire of responses to underlying genetic or extrinsic insults means that the gross and histopathologically-defined phenotypes are some of the most useful phenotypes for relating diseases between different species and constitute the most species-agnostic phenotype descriptors. This makes a pathologic term-based ontology a crucial tool in experimental and clinical phenotype data capture (Schofield et al., 2011). The development of systematic human pathologic nomenclatures has been driven by the efforts of the American College of Pathologists, initially with the development of the pathology specific nomenclature (SNOP) over 40 years ago (Cornet & de Keizer, 2008) to the current SNOMED –CT with cross references to UMLS, the NCI thesaurus and other terminologies. The ICD (World Health Organisation, 2008), now in its 11th revision, originally derived for epidemiological coding, and the associated ICD-O v-3 for cancer also contains descriptions of many pathological lesions, and the latter is particularly useful for neoplasia. The other driver for pathologic terminology standardisation has been coding of lesions from toxicopathology and again the American Society of Toxicopathology working with Registry of Industrial Toxicology Animal-data (RITA) database group in Europe has produced several internationally accepted nomenclature systems, particularly focusing on proliferative lesions. Recently the STP has undertaken a major harmonization exercise for rodent pathology – the INHAND (International Harmonization of Nomenclature and Diagnostic Criteria for Lesions in Rats and Mice ) initiative (Mann et al., 2012). So far this group has reported on the hepaticobiliary, respiratory, nervous and urinary systems (Frazier et al., 2011; Kaufmann et al., 2011; Renne et al., 2009; Thoolen et al., 2010). For some time the National Cancer Institute’s Mouse Models of Human Cancer consortium (MMHCC) has been examining the classification of tumours in genetically engineered mice and itself has produced a consensus base terminology for neoplasias of the major organ systems presented in a series of papers over the last decade (Marks, 2009). Despite the huge value of these resources none is currently constructed as an ontology with meaningful relations to support inference and automated reasoning, and to that end we developed MPATH, as an ontology to describe lesions that arise in laboratory mice. 2! 2.2 A post-composition strategy for pathology coding Traditionally pathologists have relied on a narrative form of recording their definitive diagnoses, making use of morphologic, etiologic, and disease-based terms that collectively provide a diagnosis useful for clinical patient management. This is particularly important for non–neoplastic lesions where it can be complex to capture important subtleties of distribution, severity, microscopic sub-type and anatomical location for example. Whilst this is the gold standard, it is not possible to compute on data recorded in this way and it is very difficult to tabulate and quantitatively analyse the collected information. There are strong arguments, mainly from experience in toxicologic pathology, that a descriptive (anatomic) rather than diagnostic coding is the most objective and useful way to code pathology-based observations. This is particularly relevant to examination of mutant mice where traditional etiologic or summative diagnostic terms are simply not available because of the novelty of the lesion or it’s presentation – this is particularly the case where mice are manipulated to model human conditions that have not been previously seen, for example lung or mammary tumours (Berndt et al., 2011; Derksen et al., 2006; Meuwissen & Berns, 2005) which have not previously been reported to occur spontaneously in mice. In many cases, a disease diagnosis implies a particular pathogenesis or etiology based on the spontaneous disease, which is not appropriate for the disease caused by genetic and sometimes both genetic and external challenge combined. This latter issue is of particular concern to practicing pathologists and in the development of MPATH we have been urged to include some diagnostic terms as well as descriptive anatomic ones. The MPATH ontology was constructed ab initio by a group of clinical and veterinary pathologists in 2000 and has since been revised and augmented by an evolving group of US and European pathologists on a regular basis. It is clear from more than a decade of experience that expert input and manual curation are essential to generate an accurate and functional resource. One strategy for building the ontology has been to integrate it into large-scale phenotyping and diagnostic programs so that the pathologists use it on a daily basis and have fields to add missing terms or synonyms that they are more familiar with thereby constantly increasing its coverage and utilitarian value. MPATH’s top level distinction is between pathophysiology (pathological processes) and anatomic pathology (pathological entities). One issue which we met with frequently was the normal practice of referring to observation of a physical lesion by using the process term; for example “necrosis” or “sclerosis”. Thus the noun describing the realworld entity observed is homonymous with the inferred process. This problem has been addressed through the textual and logical definitions of terms but is a recurrent source of confusion in formal treatments of pathology nomencla- Paper F The mouse pathology ontology, MPATH; structure and applications ture. Pathologists using MPATH almost exclusively use the anatomic pathology segment of the ontology with the exception of describing inflammation or other general processes where the process is described using qualifiers such as “acute” which are logically process-specific, consistent with the use of PATO qualifiers (see below). The upper levels of MPATH’s anatomic pathology branch include six broad domains familiar to all traditions of pathology training and comprehensively covering all known lesions; cell and tissue damage, circulatory disorder, developmental and structural abnormality, growth and differentiation defect, healing and repair structure, and neoplasm, and are as orthogonal as is feasible given the complexities of pathobiology. The upper levels of MPATH’s pathophysiology branch denote pathological processes that underlie lesions and include six broad domains; cell and tissue damage process, defective growth and differentiation process, developmental process abnormalities, healing and repair process, immunopathology and neoplasia. All pathological processes and entities can be placed within these upper level domains, which will be familiar to all pathologists and are common to all amniotes. MPATH contains 880 core pathology terms in an almost exclusively is_a hierarchy nine layers deep. Currently, almost 90% of the terms have textual definitions. Each class is in the mouse pathology namespace and is uniquely identified by a URI of the form: http://purl.obolibrary.org/OBO/MPATH_n. The main ontology is available in both the OBO Flatfile Format and the Web Ontology Language (OWL). MPATH is housed in a subversion repository and is made available via OBO registry, Bioportal and on the project’s website http://mpath.googlecode.com/. MPATH contains relationships and other logical axioms to other ontologies such the Gene Ontology (GO) (Ashburner et al., 2000), Cell Type ontology (CL) (Bard et al., 2005) and the Phenotype And Trait Ontology (PATO) (Gkoutos et al., 2004). For example, the MPATH term transitional cell metaplasia (MPATH:172) represents a metaplastic response of the transitional epithelium, for example in the bladder to give squamous metaplasia and glandular metaplasia. To allow computational access to these relations, we use the derives-from relation and relate metaplasia (MPATH:549) (an MPATH term that denotes an abnormal transformation of a differentiated adult cell or tissue of one kind into a differentiated tissue of another kind) with the CL term transitional epithelial cell (CL:0000244). Many tissue responses are common to multiple anatomical sites and as far as possible the verbosity of specifying a particular response in multiple tissues has been avoided, the additional topographical or anatomical information for description coming from an anatomy ontology, generally the MA (Hayamizu et al., 2005) or EMAP ontologies (Richardson et al., 2010) for the mouse, however, there is often an intrinsic anatomical element embedded in the term or traditional pathology includes information about the cell type or tissue of origin. This is most frequent with the neoplasias and we felt that such terms were best included in their familiar form. Most descriptions are then postcomposed from a combination of an MPATH term and an anatomical (MA) or cell type (CL) (Bard et al., 2005) component. In addition to the core terms it is important to describe organ specific topography, distribution, microscopic character, duration/chronicity, and severity. These qualifiers or modifiers are generally applicable across a wide range of organs and lesions and so need to be coded separately to the core terms to allow post-composition as required. The pattern we have adopted is very close to that recommended by the INHAND proposals and also includes “compound” terms which lie beneath a definitive diagnosis or disease level of description, but bundle defined sets of descriptive terms, for example “nephropathy”, “alopecia”, glomerulonephritis” which are in common use and well understood. These qualifiers have been incorporated into PATO, and some examples are given in table 1. "#$! %&'(&$)*! +,'! -,./,%01)! /(2 &#,3,)*!4$%-'0/&0,1%!01!567"8!0%!%9..('0%$4!01!:0)9'$! ;(!01!(1!$<(./3$!0339%&'(&$4!01!:0)9'$!;=> 2.3 Implementation of MPATH coding strategy The strategy adopted was originally designed to describe histopathology images for the Pathbase mouse pathology database (Schofield et al., 2010), but lends itself readily to a wide range of coding applications. The MPATH strategy has been adopted by two major high-throughput studies. A combination of MPATH and PATO is being used for the capture of pathology data from the genome-wide mutant mouse phenotyping project, KOMP2 run as part of the International Mouse Phenotyping Consortium (Brown & Moore, 2012), where the MPATH approach is being used in the primary phenotyping pipeline by the Toronto Centre for Phenogenomics and other centres carrying out histopathology. MPATH has also been adopted for the MoDIS database (Sundberg et al., 2008) to capture and analyse pathology data from a massive aging study which has systematically phenotyped the most important 31 inbred mouse Figure 1 Post-composition strategy: elements of the compound description are specified on the left hand side of the figure and specific examples given for three observations which taken together are indicative of foreign body pneumonia Paper F 3! P.N. Schofield et al. strains in current use. Complete necropsies of mice were carried out at 12 and 20 months of age (cross-sectional study) and moribund mice in the life span (longitudinal study). Nearly 2000 mice were necropsied, generating more than 50,000 slides (Sundberg et al., 2011). Lesion incidence and severity data for all organs is now being applied in highly successful GWAS studies of age-associated disease (Sundberg et al., 2011). MPATH has proved to be additionally useful in dealing with the recoding of legacy data from non-standard nomenclatures, permitting integration of otherwise siloed data. Examples are the ERA database (Tapio et al., 2008) and with the Northwestern University Janus radiobiology database (http://janus.northwestern.edu/janus2/) who have coded 50,000 individual mouse records to MPATH to link the two datasets. 2.4 tion and lexical matching to core ontologies. This approach can be useful for suggesting definitions where the class label is a composite of for example, anatomy and process (MA+GO). Automated decomposition of unilexical terms such as are found in the neoplasias is much more difficult though approaches with text mining definitions from other ontologies such as NCIt (Sioutos et al., 2007) for lexically matching labels may be useful to expert curators in establishing more simple definitions for these classes. Table 1. Examples of qualifiers now incorporated into PATO Paper F PATO Class name Definition Severity 0000461 normal no lesions mild moderate marked severe per-acute Lesion dependent; often size, number and characteristics. Duration 0000394 0000395 0000465 0000396 0002387 0000389 acute 0002091 subacute 0001863 chronic 0002387 chronic-active 0000627 focal 0002388 focally extensive 0001791 0002389 multifocal multifocal to coalescing 0000330 random 0001566 diffuse 0000635 generalized 0000634 unilateral 0000618 bilateral 0002389 segmental MPATH as a core ontology for PATO-based logical definitions The PATO framework was built with the intention of providing an integration platform for phenotype data between species and between data types (Gkoutos et al., 2004). According to the PATO framework, phenotype data can be described by utilising species-specific ontologies (such as the various anatomy ontologies) or species-agnostic ontologies such as GO with the various qualities provided by the PATO ontology in order to describe affected entities in a phenotype manifestation. PATO can be used for annotation either directly in a so-called post-composed (postcoordinated) manner or for providing logical definitions (equivalence axioms) to ontologies containing a set of precomposed (pre-coordinated) phenotype terms (Gkoutos et al., 2004; Gkoutos et al., 2009a; Gkoutos et al., 2009b; Mungall et al., 2009). For further discussion see (Schofield et al., 2012) Rather than using a pre-composed phenotype ontology such as MPO (Smith & Eppig, 2009) or HPO (Robinson et al., 2008), phenotypes may be described using the Entity– Quality (EQ) formalism. In the EQ method, a phenotype is characterized by an affected Entity and a Quality (from PATO) that specifies how the entity is affected. The affected entity can either be a biological function or process such as specified in GO, or an anatomical entity. The phylogenetic conservation, at least within the amniotes, of most histopathologic lesions or processes makes MPATH an important core ontology in writing logical definitions and we have used it extensively in defining classes in the major precomposed phenotype ontologies and MPATH is an important component ontology of our recently developed semantic approaches to comparative phenomics – PhenomeNET and Mousefinder (Chen et al., 2012; Hoehndorf et al., 2011). Composition of logical definitions is a time-consuming task for which there are currently several approaches to automation using class label segmentation, entity recogni- 4! Qualifier Distribution extremely acute and aggressive beginning abruptly with marked intensity between acute and chronic slow progress and long continuance coexistence of chronic process and superimposed acute process single well delineated lesion single lesion with expansion into surrounding tissue multiple lesions multiple lesions some interconnecting with each other no appreciable pattern not circumscribed or limited affecting all regions without specificity of distribution confined to one side only involving both sides relating to a segment The mouse pathology ontology, MPATH; structure and applications 2.5 REFERENCES Future directions Whilst MPATH was originally designed to support rodent, and particularly mouse pathology the extensive overlap with human clinical pathology means that most of the terms may be used in a human context and linked to the foundational model of anatomy (FMA) (Rosse & Mejino, 2003) as the anatomy ontology. Extending MPATH to become a Mammalian phenotype ontology encompassing human pathology is a major undertaking, but we have established that the current structure and upper level classes would readily support the inclusion of human terminology. Initially we will import terms for neoplasias from the CINEAS codes (Central Information System for Hereditary Diseases and Synonyms; http://www.cineas.org/; Prof Rolf Sijmons, pers, comm). SNOMED-CT, UMLS and ICD-O v3 will be mined for terms not currently in MPATH which relate to anatomic pathology. Terms already covered by existing ontologies such as Disease Ontology (DO) (Schriml et al., 2012) may be referenced using MIREOT (Courtot et al., 2011). DO classifies diseases largely by anatomical site and not by disease process or class, and overlaps only slightly with MPATH as it is concerned with summative diagnostic entities for the main part. For example there is no “inflammation” superclass in DO for the tissue specific inflammatory conditions described. Use of MPATH to construct logical definitions for DO classes would potentially add a further dimension to the richness and applicability of DO. The power of the description of pathological lesions to discriminate between diseases and therefore between models of human disease is substantial. We recently estimated the information content (IC) of pre-composed MP ontology terms used to code phenotypes in the EUMODIC mouse phenotyping pipeline (Morgan et al., 2012) which included or excluded anatomic pathology descriptions, using their logical definitions. Pathology-related phenotypes were shown to have a significantly greater discriminatory power than other in vivo assays, strongly supporting the use of these assays in the development of mouse models of human diseases (Schofield et al. 2011). Further development and application of MPATH will inevitably depend on community engagement and we encourage anyone with an interest to provide feedback. ACKNOWLEDGEMENTS The authors would like to thank those who have contributed to the development and application of MPATH over the years. This work was funded by the European Commission. Contract QLRI-1999-00320, the Ellison Medical Foundation, National Institutes of Health (AG25707, for the Shock Aging Center, CA89713, and AR056635 to JPS, and HG004838-04 to PNS. Ashburner, M., Ball, C.A., Blake, J.A., Botstein, D., Butler, H., Cherry, J.M., Davis, A.P., Dolinski, K., Dwight, S.S., Eppig, J.T., Harris, M.A., Hill, D.P., Issel-Tarver, L., Kasarskis, A., Lewis, S., Matese, J.C., Richardson, J.E., Ringwald, M., Rubin, G.M. & Sherlock, G. (2000). Gene ontology: tool for the unification of biology. The Gene Ontology Consortium. Nat Genet, 25, 25-9. Bard, J., Rhee, S.Y. & Ashburner, M. (2005). An ontology for cell types. Genome Biol, 6, R21. Berndt, A., Cario, C.L., Silva, K.A., Kennedy, V.E., Harrison, D.E., Paigen, B. & Sundberg, J.P. (2011). Identification of fat4 and tsc22d1 as novel candidate genes for spontaneous pulmonary adenomas. Cancer Res, 71, 5779-91. Brown, S.D. & Moore, M.W. (2012). Towards an encyclopaedia of mammalian gene function: the International Mouse Phenotyping Consortium. Dis Model Mech, 5, 289-92. Chen, C.K., Mungall, C.J., Gkoutos, G.V., Doelken, S.C., Kohler, S., Ruef, B.J., Smith, C., Westerfield, M., Robinson, P.N., Lewis, S.E., Schofield, P.N. & Smedley, D. (2012). MouseFinder: Candidate disease genes from mouse phenotype data. Hum Mutat, 33, 858-66. Cornet, R. & de Keizer, N. (2008). Forty years of SNOMED: a literature review. BMC Med Inform Decis Mak, 8 Suppl 1, S2. Courtot, M., Frank, G., Allyson, L.L., James, M., Daniel, S., Ryan, R.B. & Alan, R. (2011). MIREOT: The minimum information to reference an external ontology term. Appl. Ontol., 6, 23-33. Derksen, P.W., Liu, X., Saridin, F., van der Gulden, H., Zevenhoven, J., Evers, B., van Beijnum, J.R., Griffioen, A.W., Vink, J., Krimpenfort, P., Peterse, J.L., Cardiff, R.D., Berns, A. & Jonkers, J. (2006). Somatic inactivation of E-cadherin and p53 in mice leads to metastatic lobular mammary carcinoma through induction of anoikis resistance and angiogenesis. Cancer Cell, 10, 437-49. Frazier, K.S., Seely, J.C., Hard, G.C., Betton, G., Burnett, R., Nakatsuji, S., Nishikawa, A., Durchfeld-Meyer, B. & Bube, A. (2011). Proliferative and nonproliferative lesions of the rat and mouse urinary system. Toxicol Pathol, 40, 14S-86S. Gkoutos, G.V., Green, E.C.J., Mallon, A.-M., Hancock, J.M. & Davidson, D. (2004). Building mouse phenotype ontologies. Pac. Symp. Biocomputing, 9, 178-189. Gkoutos, G.V., Mungall, C., Dolken, S., Ashburner, M., Lewis, S., Hancock, J., Schofield, P., Kohler, S. & Robinson, P.N. (2009a). Entity/quality-based logical definitions for the human skeletal phenome using PATO. Conf Proc IEEE Eng Med Biol Soc, 1, 7069-72. Gkoutos, G.V., Mungall, C., Dolken, S., Ashburner, M., Lewis, S., Hancock, J., Schofield, P., Kohler, S. & Robinson, P.N. (2009b). Entity/quality-based logical definitions for the Paper F 5! P.N. Schofield et al. human skeletal phenome using PATO. Conf Proc IEEE Eng Med Biol Soc, 2009, 7069-72. Hayamizu, T.F., Mangan, M., Corradi, J.P., Kadin, J.A. & Ringwald, M. (2005). The adult mouse anatomical dictionary: a tool for annotating and integrating data. Genome Biol, 6, R29. Hoehndorf, R., Schofield, P.N. & Gkoutos, G.V. (2011). PhenomeNET: a whole-phenome approach to disease gene discovery. Nucleic Acids Res, 39, e119. Kaufmann, W., Bolon, B., Bradley, A., Butt, M., Czasch, S., Garman, R.H., George, C., Groters, S., Krinke, G., Little, P., McKay, J., Narama, I., Rao, D., Shibutani, M. & Sills, R. (2011). Proliferative and nonproliferative lesions of the rat and mouse central and peripheral nervous systems. Toxicol Pathol, 40, 87S-157S. Li, Q., Berndt, A., Guo, H., Sundberg, J.P. & Uitto, J. (2012). A Novel Animal Model for Pseudoxanthoma Elasticum: The KK/HlJ Mouse. Am J Pathol. Mann, P.C., Vahle, J., Keenan, C.M., Baker, J.F., Bradley, A.E., Goodman, D.G., Harada, T., Herbert, R., Kaufmann, W., Kellner, R., Nolte, T., Rittinghausen, S. & Tanaka, T. (2012). International harmonization of toxicologic pathology nomenclature: an overview and review of basic principles. Toxicol Pathol, 40, 7S-13S. Marks, C. (2009). Mouse Models of Human Cancers Consortium (MMHCC) from the NCI. Dis Model Mech, 2, 111. Meuwissen, R. & Berns, A. (2005). Mouse models for human lung cancer. Genes Dev, 19, 643-64. Morgan, H., Beck, T., Blake, A., Gates, H., Adams, N., Debouzy, G., Leblanc, S., Lengger, C., Maier, H., Melvin, D., Meziane, H., Richardson, D., Wells, S., White, J., Wood, J., de Angelis, M.H., Brown, S.D., Hancock, J.M. & Mallon, A.M. (2012). EuroPhenome: a repository for high-throughput mouse phenotyping data. Nucleic Acids Res, 38, D577-85. Mungall, C.J., Gkoutos, G.V., Smith, C.L., Haendel, M.A., Lewis, S.E. & Ashburner, M. (2009). Integrating phenotype ontologies across multiple species. Genome Biol, 11, R2. Oellrich, A., Hoehndorf, R., Gkoutos, G.V. & RebholzSchuhmann, D. (2012). Improving disease gene prioritization by comparing the semantic similarity of phenotypes in mice with those of human diseases. PLoS One, 7, e38937. Renne, R., Brix, A., Harkema, J., Herbert, R., Kittel, B., Lewis, D., March, T., Nagano, K., Pino, M., Rittinghausen, S., Rosenbruch, M., Tellier, P. & Wohrmann, T. (2009). Proliferative and nonproliferative lesions of the rat and mouse respiratory tract. Toxicol Pathol, 37, 5S-73S. Richardson, L., Venkataraman, S., Stevenson, P., Yang, Y., Burton, N., Rao, J., Fisher, M., Baldock, R.A., Davidson, D.R. & Christiansen, J.H. (2010). EMAGE mouse embryo spatial gene expression database: 2010 update. Nucleic Acids Res, 38, D703-9. 6! Robinson, P.N., Kohler, S., Bauer, S., Seelow, D., Horn, D. & Mundlos, S. (2008). The Human Phenotype Ontology: a tool for annotating and analyzing human hereditary disease. Am J Hum Genet, 83, 610-5. Rosse, C. & Mejino, J.L., Jr. (2003). A reference ontology for biomedical informatics: the Foundational Model of Anatomy. J Biomed Inform, 36, 478-500. Scheuermann, R.H., Ceusters, W. & Smith, B. (2009). Toward an ontological treatment of disease and diagnosis. Summit on Translat Bioinforma, 2009, 116-20. Schofield, P.N., Gruenberger, M. & Sundberg, J.P. (2010). Pathbase and the MPATH Ontology: Community Resources for Mouse Histopathology. Vet Pathol, 47, 1016-20. Schofield, P.N., Sundberg, J.P., Hoehndorf, R. & Gkoutos, G.V. (2012). New approaches to the representation and analysis of phenotype knowledge in human diseases and their animal models. Brief Funct Genomics, 10, 258-65. Schofield, P.N., Vogel, P., Gkoutos, G.V. & Sundberg, J.P. (2011). Exploring the elephant: histopathology in highthroughput phenotyping of mutant mice. Dis Model Mech, 5, 19-25. Schriml, L.M., Arze, C., Nadendla, S., Chang, Y.W., Mazaitis, M., Felix, V., Feng, G. & Kibbe, W.A. (2012). Disease Ontology: a backbone for disease semantic integration. Nucleic Acids Res, 40, D940-6. Sioutos N, Coronado Sd, Haber MW, Hartel FW, Shaiu W-L, Wright LW (2007). NCI Thesaurus: A semantic model integrating cancer-related clinical and molecular information. J Biomed Inform, 40,30-43. Smith, C.L. & Eppig, J.T. (2009). The mammalian phenotype ontology: enabling robust annotation and comparative analysis. Wiley Interdiscip Rev Syst Biol Med, 1, 390-9. Sundberg, J., Berndt, A., Sundberg, B., Silva, K., A. , Kennedy, V., Bronson, R., Yuan, R., Paigen, B., Harrison, D. & Schofield, P., N. (2011). The mouse as a model for understanding chronic diseases of aging: the histopathologic basis of aging in inbred mice. Pathobiology of Aging & Age-related Diseases, 1, 71719. Sundberg, J.P. & Schofield, P.N. (2009). One medicine, one pathology, and the one health concept. J Am Vet Med Assoc, 234, 1530-1. Sundberg, J.P., Sundberg, B.A. & Schofield, P. (2008). Integrating mouse anatomy and pathology ontologies into a phenotyping database: tools for data capture and training. Mamm Genome, 19, 413-9. Tapio, S., Schofield, P.N., Adelmann, C., Atkinson, M.J., Bard, J.L., Bijwaard, H., Birschwilks, M., Dubus, P., Fiette, L., Gerber, G., Gruenberger, M., Quintanilla-Martinez, L., Rozell, B., Saigusa, S., Warren, M., Watson, C.R. & Grosche, B. (2008). Progress in updating the European Radiobiology Archives. Int J Radiat Biol, 84, 930-6. Thoolen, B., Maronpot, R.R., Harada, T., Nyska, A., Rousseaux, C., Nolte, T., Malarkey, D.E., Kaufmann, W., Kuttler, K., Deschl, U., Nakae, D., Gregson, R., Vinlove, M.P., Paper F The mouse pathology ontology, MPATH; structure and applications Brix, A.E., Singh, B., Belpoggi, F. & Ward, J.M. (2010). Proliferative and nonproliferative lesions of the rat and mouse hepatobiliary system. Toxicol Pathol, 38, 5S-81S. Washington, N.L., Haendel, M.A., Mungall, C.J., Ashburner, M., Westerfield, M. & Lewis, S.E. (2009). Linking human diseases to animal models using ontology-based phenotype annotation. PLoS Biol, 7, e1000247. World Health Organisation. (2008). International Statistical Classification of Diseases and Health Related Problems (The) ICD-10. WHO: Geneva. Paper F 7! Functions, roles and dispositions revisited. A new classification of realizables Johannes Röhl, Ludger Jansen Institute of Philosophy, University of Rostock, Germany ABSTRACT The concept of a function is central both to biology and to technology. But there is an intricate debate how functions as well as related entities like dispositions and roles are to be represented in top level ontologies and how they are to be related. We review important philosophical accounts and ontological models for functions and roles and discuss three models for the relation of functions and dispositions. We conclude that mainly because of the need to account for malfunctioning, functions should not be treated as a subtype of dispositions, but as their sibling category. 1 an agent or a group of agents involving a plan about the future use of this artifact (or of artifacts of this type). Table 1: Definitions of the children of realizables in BFO 1.1.1 and 2 (‘Graz release’) INTRODUCTION The concept of a function is central to biology as well as to psychology, technology and engineering. However, realizable entities like functions, dispositions and roles are notoriously difficult to understand and there is no consensus how to model them within a top level ontology. Among other things, this is witnessed by the Basic Formal Ontology (BFO; http://www.ifomis.org/bfo). BFO version up to 1.1.1 contained these three categories as immediate children of the category Realizable, that were jointly exhaustive and pairwise disjoint. However, in the transition to the new version BFO 2 it is planned to position Function as a subtype of Disposition (Table 1). In order to bring new light into this problem, this paper discusses the relation of functions to plans (§ 2) and the relation of functions to dispositions (§ 3). 2 2.1 FUNCTIONS AND PLANS Design functions of artifacts In ancient Latin, the word “functio” was used to describe the duties of certain positions: It was the functio of, say, a quaestor to raise taxes. This was, what being a quaestor was about. In general, not only official positions, but all things created by men could be ascribed the use for which they were created as their function or, more precisely, their design function. A screwdriver can be said to have the design function to drive screws because it is produced with the plan to be used for this purpose. A design function, then, is not a property that inheres in the functional artifact, but it is the content of an ascription by Definition in BFO 1.1.11 Definition in BFO 22 Disposition = A realizable entity that essentially causes a specific process or transformation in the object in which it inheres, under specific circumstances and in conjunction with the laws of nature. A general formula for dispositions is: X (object) has the disposition D to (transform, initiate a process) R under conditions C. b is a disposition means: b is a realizable entity & b’s bearer is some material entity & b is such that if it ceases to exist, then its bearer is physically changed, & b’s realization occurs when and because this bearer is in some special physical circumstances, & this realization occurs in virtue of the bearer’s physical make-up. Function = A realizable entity the manifestation of which is an essentially enddirected activity of a continuant entity in virtue of that continuant entity being a specific kind of entity in the kind or kinds of contexts that it is made for. A function is a disposition that exists in virtue of the bearer’s physical make-up and this physical make-up is something the bearer possesses because it came into being, either through evolution (in the case of natural biological entities) or through intentional design (in the case of artifacts), in order to realize processes of a certain sort. Role = A realizable entity the manifestation of which brings about some result or end that is not essential to a continuant in virtue of the kind of thing that it is but that can be served or participated in by that kind of continuant in some kinds of natural, social or institutional contexts. b is a role means: b is a realizable entity & b exists because there is some single bearer that is in some special physical, social, or institutional set of circumstances in which this bearer does not have to be & b is not such that, if it ceases to exist, then the physical make-up of the bearer is thereby changed. 1 http://jowl.ontologyonline.org/bfo.html (last access 12.09.2012). http://bfo.googlecode.com/svn/releases/2012-07-20-graz/owlgroup/bfo.owl (last access 12.09.2012). 2 Paper G 1 J. Röhl, L. Jansen They are made for a certain purpose, their function. Let us call this the planning account of design function. On this account, the truth-maker of a function ascription is a plan. From the point of view of the planning account, we can say the following about (design) functions: • Design functions are grounded in designers’ function ascriptions. • Designers can decide on functions independently of the physical structure of artifacts. • Thus, independently from its physical structure and its dispositions, an artifact can have any function. • It will, though, not be able to realize its function unless it possesses a corresponding disposition to do so. 2.2 longer is an actual plan, but a plan within the fiction of Mother Nature designing its creatures. Both accounts show that pre-Darwinian biology had no problem with applying a variant of the planning account to biological functions. While biological entities are still treated as functional wholes, the planning account is no longer viable for the post-Darwinian biology of today. But then the problem arises: What is the ground for a biological function ascription? And this comes down to: What is a biological function? 2.4 The concept of a function has been controversially discussed in the philosophy of biology and elsewhere and several accounts have been proposed to illuminate the nature of biological functions (cf. (Ariew et al. 2002) and (Krohs/Kroes 2009) for recent contributions to this debate, cf. also (Johansson et al. 2005) and (Burek et al. 2006) for an elaborate formal ontology of functions). Among the different approaches to functions the most straightforward is the causal role analysis: That X has function F simply means that X has the disposition to causally contribute to some output O of a complex system S ((Cummins 1975) as characterized by (Boorse 2002); see there for further references). A well-known problem is that this account is extremely broad and admits many unintuitive functions: It implies e.g. that clouds should be ascribed the function to produce rain, because they undoubtedly have a central causal role in the production of rain (more examples and further criticism in (Boorse 2002)). To avoid this broadness, further conditions have to be added in order to narrow down possible functions of a thing. Intuitively, functions are connected either with some intention as in artifactual functions or a (not necessarily intended or conscious) goal in biological functions. A formal characterization of such a goal-contribution approach has been suggested by (Röhl 2012). However, many philosophers of science see even such “deflationary” teleological accounts as untenable in post-Darwinian biology. As an alternative, etiological acccounts of biological functions have been suggested (Wright 1973). Instead of looking forward to a goal, a function is taken to be dependent on the history of a type of biological entity exemplifying the function, i.e. the development by evolutionary selection that accounts for its existence in the first place. However, according to the etiological account, in the first generation of a biological type there cannot be any functions for definitional reasons, although the actual structure of the organs would be “functional”, which is very counterintuitive. Moreover, a certain body part may acquire new uses and functions during the evolutionary history of a species, while the early history Use functions of artifacts The planning account can easily be modified in order to deal with use functions of artifacts. Use functions are directed at those activities that users actually use things for. (Cf. (Mizoguchi et al. 2012, 110) for the distinction of design function and use function and (Preston 2003) for problems of this distinction.) If I use my screw driver to open my paint cans, it has the use function to open paint cans. It has not been produced for this purpose, hence the use function can differ from its design function, though it might be just the same. Moreover, one and the same thing can have many different use functions at different occasions. This account can be extended to biomedical entities. If I use digitalis to kill my wife, I have a certain action plan that involves the participation of both a probe of digitalis and my wife with a certain intended outcome. 2.3 The problem of biological functions In pre-Darwinian biology, organisms and their parts were described as if they, too, were something created – either, allegorically spoken, by a personified Nature, or by God as a creator. In the latter case, ascribing functions to biological entities could be conceived of as reading the mind of God before the act of creation and as a reconstruction of the reasoning underlying his creation. From this point of view, the planning account can be extended to biological functions: The truth-maker of the ascription of a biological function is, in this framework, God’s plan for his creation. In the former case we have something like an as-if parlance, which can be found, e.g., in Aristotle: Although Aristotle rejected the idea that the universe or life had a beginning in time, he often says that Nature has well organized its creatures. We suggest to read this as a fictionalist account of biological function, meaning: Were this plant or animal brought about by Mother Nature (a very intelligent designer), she would have done so for good reason. Hence, the planning account can be uphold with a small modification: The truth-maker of the ascription of a biological function no 2 Philosophical accounts of biological functions Paper G Functions, roles and dispositions revisited functions and dispositions take place “in virtue of the bearer's physical makeup” whereas the realizations of roles are not grounded in the physical structure of the bearer, but dependent on circumstances. Functions are distinguished from dispositions by the additional condition that the function bearer possesses the physical structure that grounds the function because of how it came to be there in the first place: In the case of artifactual functions by the intentional design and production, in the case of biological functions by a history of evolutionary selection. So BFO-roles are closer to optional, accidental “use functions” whereas BFO-functions and BFOdispositions are determined by their causally relevant internal structure and thus close to the goal-contribution account as it is said that the realization of a (biological) function “helps to realize the characteristic physiology and life pattern for an organism of the relevant type” (Arp/Smith 2008: 2). The specific difference of dispositions and functions is again the historical (evolutionary) or the intentional (design) component, respectively. Similarly, these intentional and historical criteria are used in BFO 2 as the specific difference of functions as opposed to non-functional dispositions. and hence the reason for its existence remain the same (Boorse 2002: 66). In face of these difficulties, Boorse's “general goalcontribution” approach can be considered the “minimal core” of the concept of a function, with system S, system part X and goal G: “X performs function Z in the G-ing of S at t if and only if at t, the Z-ing of X is a causal contribution to G” (Boorse 2002: 70). This is still weak, because functions could be performed only once and accidentally and fulfill this definition. Boorse relies on distinctions like the normal function of a type as opposed to accidental functions or deviations of single tokens to avoid those counterintuitive accidental functions. The concept of a systemic function as suggested by Mizoguchi et al. (2012: 109) is very similar to this account. With the introduction of a “systemic context” C in addition to a system S and a system part A they define: “C is a systemic context for S and according to C, A is a component of a subsystem of S, the goal of this subsystem is to realize the goal of C, and some behaviors of A play the (functional) role determined by C.” This can be applied both to biological and to technical functions. A systemic context C for the human liver (= A) would be the human digestive system (goal: digestion of food and extractions of nutrients) and within the subsystem of fat digestion the function of the liver is the production of bile. The point is that the systemic function is contextdependent in a very specific way, i.e. via a system its bearer is part of. Hence the function of a thing can change depending on the system it is a component of. 2.5 Table 2: Function, role, and disposition in BFO Functions in BFO In the older versions of BFO, functions, dispositions and roles are sibling subclasses of the class “realizable dependent continuant” (Arp/Smith 2008). This common superclass implies that all three are: • continuants, i.e. they are wholly present at any time of their existence; • dependent (like qualities) on the independent continuant (some material thing or system) that is their bearer; • realizable, which means that they are essentially connected to certain processes, their realizations, and • the bearers participate in the realization processes. These are the processes they are roles, dispositions or functions for. Note that realizables do not have to be (always or ever) realized (Röhl/Jansen 2011), as, e.g., in the case of a safety mechanism the function of which will only be realized if certain conditions obtain (and they may never obtain). For the specific differences between functions, roles and dispositions Arp and Smith draw on elements of the debates sketched above (Table 2). The realizations of Disposi- Artifact tion function Biological Role function Grounding internal internal internal external Modal status mixed essential? essential? accidental Relevance of history no yes yes yes Dependent on intentions no yes no yes 3 3.1 FUNCTIONS AND DISPOSITIONS Dispositions A common philosophical position takes dispositions as a type of properties (Ellis/Lierse 1994). A disposition is a causal property that is linked to a realization, i.e. to a specific behaviour or process which the individual that bears the disposition will show under certain circumstances or as a response to a certain trigger. Something is watersoluble if it dissolves when put in water. In this fashion dispositions establish a link between (independent) continuants (stable things) and occurrents (processes) and the fundamental connection is the following: Continuant S has disposition D for realization P and, in case P occurs, S, the bearer of the disposition, is a participant of this process P. Dispositions are often treated as a special kind Paper G 3 J. Röhl, L. Jansen of dependent continuants that are linked to a process of realization by a respective formal relationship to a realization process (Arp/Smith 2008), (Roehl/Jansen 2011), (Jansen 2007). Note that the terminology is often confusing as dispositions and functions are sometimes named according to their realization processes. In the Gene Ontology (http://www.geneontology.org), e.g., many subclasses of the Molecular Function Ontology, are described by the term “activity” (as in “bioactive lipid receptor activity”) which, paradoxically, does not denote the actual acting (process). In this paper, function, disposition and role will be held strictly distinct from the processes that are their realizations. 3.2 the function. This occurs frequently in biology and medicine, and hypo- or hyperfunctions are often disorders, e.g., hypotension and hypertension disorders of the cardiovascular system with regard to the parameter blood pressure. This definition of malfunction presupposes the very possibility of a function being present without the corresponding disposition. Hence they cannot be identical. Malfunctioning is therefore a clear indicator of the normative dimension of functions. More generally, we can make value judgements about artifacts or body parts with respect to their function: Something may be a bad saw or a good heart. This normative aspect of functions is in general not shared by dispositions. Dispositions can be blocked or incompletely realized, but their bearers are not evaluated in a normative fashion. Therefore it does not seem to be appropriate to classify functions as dispositions. E.g., a lung with a carcinoma may still have the function to serve as an oxygen provider for the body, but the function may no longer be realized because the corresponding disposition (to be able to serve as an oxygen provider for the body) is no longer present. According to Vermaas/Houkes (2003), any theory of function has to give an account for the normative aspect of functions and for the possibility of malfunctions. Now there seem to be at least two distinct ways to interpret this case: (1) The token lung has lost the disposition and, because functions (a) are dispositions or (b) are dependent on them, it has also lost the function. Malfunction for a token then means that this token has simply lost the function, rather than that it still has the function without being able to realize it. But we can speak of malfunction rather than nonfunction, because the type lung typically has both the disposition and the corresponding function to provide oxygene (Boorse 2002: 89; McLaughlin 2009). As McLaughlin points out, being a token of a type involves an evaluative dimension. Other possible sources of the normativity for functions are a means-end relationship or a hierarchical part-whole relation (McLaughlin 2009). These aspects come together in a systemic goalcontribution account, because here the functioning and the malfunction of a part in the functional hierarchy is evaulated with respect to its working as a means to the end/goal of the whole. (2) The function is ontologically independent of the disposition. The function of a lung as a normative ascription is still there, but because the corresponding disposition is not, there is malfunction. The task of medicine would then be to restore the disposition so the the organ would be (fully) functional again. While there is no clear-cut argument to decide between these two options, option (2) better allows to account for Are functions dispositions? The central difference between dispositions on the one hand, and functions and roles on the other lies in their context-dependence. Continuants may lose or acquire dispositions, but not without fundamental changes within the bearer. In contrast, many functions can be performed by different types of bearers and an object may have different functions in different contexts without any change in itself. Chopsticks, for example, have the function to support eating. Similar sticks found in the wood do not have any such function, though they may have the very same physical structure and hence the same dispositions. Dispositions, that is, are purely internally grounded, while the function of the chopsticks is a historical property due to the way this artifact has been produced. On the other end of the spectrum, social functions and roles are context-dependent or “externally grounded” in the respective context. Biological functions, like those of organs, enzymes etc. are somewhere in between, with an entity usually fulfilling several functions in a certain range of contexts. They are objective systemic functions in the sense mentioned above and not merely ascribed by an agent; their context-dependence is fixed by the functional hierarchy of the respective physiological system. An organ like the liver has many functions like production of bile, glycogene storage, cholesterol synthesis etc. But all these are fixed by the respective physiological systems the liver and its products are functionally involved in. They are not as arbitrary or flexible as the screwdriver that could serve the use functions (i.e. roles) of a can opener or a weapon. Malfunctioning is (i) having a function but (ii) not living up to it. In a case of malfunctioning the realization of the function either does not happen at all or in an insufficient way. Technically speaking, the output of the thing or system is not in the standard or target range (Del Frate 2012). If the realization can be measured quantitatively, we can distinguish between hypofunction and hyperfunction, that is staying below a lower threshold or exceeding an upper limit for the output parameter of 4 Paper G Functions, roles and dispositions revisited healing processes. For according to (2), a healing process consists in restoring a disposition where there is a function without its corresponding disposition. According to (1), however, there cannot be functions without corresponding disposition (e.g., because functions are these dispositions). Thus a healing process according to (1) consists in restoring the very function. 3.3 realization for the function type, and because the normal realization is dependent on the corresponding disposition, we have a correspondence of function and disposition at the type level or for prototypical tokens. But this is to be distinguished from a token-level dependence of the function on the corresponding disposition. If we want to accommodate malfunctioning, we have to reject the latter. Do functions depend on dispositions? 4 For all of these reasons, we should assume that functions are not identical to dispositions. Nevertheless, even if they are distinct entities, functions could ontologically depend on dispositions. On the planning account, functions of artifacts are clearly independent from the dispositions of their bearers, as due to the fallibility of human designers the one could occur without the other. From the point of the theistic extension of the planning account to biological functions, however, the existence of a function implies the existence of the corresponding disposition in typical cases, given the usual assumptions of God’s omniscience and benevolence. But even if a biological function is typically accompanied by a disposition, this concurrence is not universal, as proved by malfunctioning. The dispositions as part of the internal structure of a thing determine whether it can fulfill the respective function in a given context. Johansson (2004) calls this the “substratum” of a function. While the function itself is independent from its substratum, its realization depends on its existence. This dependence can be a generic one, because sometimes different dispositions or structures can ground the same function: E.g., the cooling function of a cooler can be implemented in different technical setups (Johansson 2004, 66). As we know biological functions only through their actual realizations, we would have no reason to ascribe them, unless instances of a certain kind typically displayed that behaviour and, a fortiori, possessed a corresponding substratum disposition. How would we know the biological function of, say, a heart, if hearts did not typically have the disposition to pump blood and typically realized this function? So there should be some connection between the function and the disposition of the organ. On the other hand, many diseases like heart insufficiency are characterized by the very contrast between functions and the lack of corresponding dispositions, and so is malfunctioning in general. Malfunctioning artifacts or diseased organs are characterised by the loss of the disposition to fulfill their function. We conclude that the corresponding disposition is only necessary for the realization of a function, not for the function itself. Because in biological (and many artifactual) cases we can evaluate the performance of token functions with respect to what is a normal CONCLUSION We can summarize the discussion by suggesting a new classification of realizables. It concurs with BFO 1.1.1 in treating functions as siblings of dispositions rather than special dispositions. It makes use of two independent criteria: essence optionality and structure optionality. A realizable can be optional given a certain physical structure of its bearer. All realizables that are externally grounded, i.e. in some context, are optional in this sense, e.g. all roles. In contrast, dispositions are internally grounded, based on the bearer's physical structure and therefore not optional given the bearer’s physical structure. Because a bearer can gain and lose dispositions some dispositions are also optional. But what is at stake here, is optionality given the essence of the bearer. If a disposition is optional in this sense, a bearer can lose it without ceasing to be. However, not all dispositions are optional given the essence of the bearer. Some dispositions, like the disposition of a proton to attract electrons, are essential: Losing this dispostions would imply that the proton ceases to be a proton, i.e. that it ceases to exist. Functions are essential in this sense: Given the essence of being a heart, it is not optional to have the function to pump blood. And given the essence of being a screwdriver, it is not optional to have the function to manipulate screws. We thus end up with a cross classification of realizables, presented in Table 3. Table 3: A new cross-classification of realizables Internally grounded (= non-optional given the physical structure) Externally grounded (= optional given the physical structure) Essential (= non-optional given the essence) Essential disposition Function Accidental (= optional given the essence) Accidental disposition Role Functions are externally grounded. We argued that there are good arguments not to treat functions as dispositions, Paper G 5 J. Röhl, L. Jansen Cummins, R. (1975): Functional analysis. Journal of Philosophy, 72:741–765. Ellis B, Lierse C (1994): Dispositional Essentialism, Australasian Journal of Philosophy 72: 27-45. Del Frate, L. (2012): Preliminaries to a Formal Ontology of Failure in Engineering Artifacts,in Donnelly , M./ Guizzardi, G. (eds.): Formal Ontology in Information Systems (FOIS 2012), ISO Press, 107-130. Jansen, L. (2007): Tendencies and other Realizables in Medical Information Sciences, The Monist 90/4, 534-555. Johansson, I. (2004): Ontological Investigations, Heusenstamm. Johansson, I./Smith, B./Munn, K./Tsikolia, N./Elsner, K./Ernst, D./Siebert, D. (2005): Functional Anatomy: A Taxonomic Proposal, Acta Biotheoretica 53: 153-166. Krohs, U./Kroes, P. (eds.) (2009): Functions in biological and artificial worlds. Comparative philosophical perspectives, Cambridge, Mass. Loebe, F. (2007): Abstract vs. social roles. Towards a general theoretical account of roles, Applied Ontology 2/2, 127-158. McLaughlin, P. (2009): Functions and Norms, in Krohs/Kroes (eds.) 2009, 93-102. Mizoguchi, R. Kitamura, Y., Borgo, S. (2012): Towards a Unified Definition of Function, in Donnelly , M./ Guizzardi, G. (eds.): Formal Ontology in Information Systems (FOIS 2012), ISO Press, 103-116. Preston, B. (2003): Of Marigold beer – a reply to Vermaas and Houkes, British Journal for the Philosophy of Science 54, 601-612. Röhl, J. (2012): Mechanisms in biomedical ontology, Journal of Biomedical Semantics 2012, 3(Suppl 2):S9 Röhl, J./Jansen, L. (2011): Representing dispositions, Journal of Biomedical Semantics 2011, 2(Suppl 4):S4 Smith et al. (2005): Relations in Biomedical Ontologies. Genome Biology 2005 6:R46. Spear, A. D. (2006): Ontology for the 21st Century. An Introduction with Recommendations (BFO Manual), http://www.ifomis.org/bfo/documents/manual.pdf. Vermaas, P. E. (2009): On Unification: Taking Technical functions as Objective (and Biological functions as Subjective, in: Krohs/Kroes (eds.) 2009, 70-87. Vermaas, P./Houkes, W. (2003): Ascribing functions to Technical Artefacts: A Challenge to Etiological Accounts of Functions. The British Journal for the Philosophy of Science, 54, 261-89. Wright, L. (1973): Functions. The Philosophical Review, 82, 139–168. nor to make functions dependent on dispositions. This distinction is our central disagreement with the BFO 2 suggestion discussed above. We also define roles in a rather narrow way, different from Mizoguchi et al. 2012 and other papers (for a very wide notion of role cf. Loebe 2007): On our account, roles are never essential for its bearer. The way of speaking that would assign an essential “breather” or “eater” role to a human being is not to be taken ontologically serious. Breathing and eating are processes, not functions or roles. Humans have to participate in breathing and eating processes on a regular basis, but it is not their role to breathe. Therefore in our classification scheme what are called “use functions” in the literature (cf. 2.2 above) are roles, in agreement with the BFO conception of roles. All in all, we discussed three different models for the ontological analysis of functions: • The planning account was able to deal with design and use functions of artifacts, but not with biological functions. • Equating functions with dispositions leads to problems with malfunctions. • Only treating functions as a sibling category of disposition own was able to circumvent these problems. On this latter account, functions are not only disjoint from dispositions, they are also ontologically independent from dispositions. Functions are, however, normally and mostly accompanied by corresponding dispositions. This is the reason why it is so difficult to distinguish between these categories. Malfunctioning, however, requires them to be distinct categories: It happens in case the corresponding disposition is lacking. ACKNOWLEDGMENTS This work was supported by DFG grant JA 1904/2-1 within the project GoodOD. Many thanks to Andrew Spear who provided us with a recent draft of the followup version of (Spear 2006) and to three anonymous referees for critical and helpful comments. REFERENCES Ariew, A., Cummins, R,. and Perlman, M. (2002): Functions. New essays in the Philosophy of Psychology and Biology, Oxford. Arp, R., Smith, B. (2008): Function, Role, and Disposition in Basic Formal Ontology, Proceedings of Bio-Ontologies Workshop (ISMB2008), 45–48. Boorse, C. (2002): A Rebuttal on Functions, in Ariew et al., 63112. Burek, P./Hoehndorf, R./Loebe, F./Visagie, J./Herre, H./Kelso, J.(2006): A top-level ontology of functions and its application in the open biomedical ontologies, Bioinformatics 22, No.14, e66-e73. 6 Paper G Function of Bio-Models: Linking Structure to Behaviour Clemens Beckstein, Christian Knüpfer Artificial Intelligence Group, University of Jena, Germany ABSTRACT Computational models in Systems Biology are used in simulation experiments for addressing biological questions. Generally, these questions are causal questions asking for mechanistic explanations of biological phenomena. An epistemological analysis of the role and use of models in Systems Biology is an important prerequisite for computer support for answering these questions. The notion of biological function can also be applied to computational models in Systems Biology. Models play a specific role (teleological function) in the scientific process of finding explanations for dynamical phenomena. In order to fulfil this role a model has to be used in simulation experiments (pragmatical function). A simulation experiment always refers to a specific situation and a state of the model and the modelled system (conditional function). We claim that the function of computational models refers to both the simulation experiment executed by software (intrinsic function) and the biological experiment which produces the phenomena under investigation (extrinsic function). In this paper we describe the different functional aspects of computational models in Systems Biology. This description is conceptually embedded in our “meaning facets” framework which systematises the interpretation of models in structural, functional and behavioural facets. Briefly, the function is linking between the structure and the behaviour of a model. A thorough analysis of the function of bio-models therefore is an important first step towards a comprehensive formal representation of the knowledge involved in the modelling process. Any progress in this area will in turn improve computer-supported modelling. In this paper we use our conceptual framework to review formal accounts for functional aspects of models, such as checklists, ontologies, and formal languages and outline a strategy for developing an ontology for describing the intention of bio-models. 1 INTRODUCTION Models in Systems Biology have two essential features: They are dynamic and computational. “Dynamic” refers to the fact that the models are used in simulation experiments in order to generate temporal behaviours. “Computational” means that the simulation experiments are executed by software. One conclusion from the latter feature is that the models have to be encoded in some computerunderstandable format. For short, we call a computational dynamic model of a biological systems a “bio-model”. Generally, a mathematical model establishes a relation between a system under observation (what Rosen calls the “natural system”) and a formal system. Rosen calls this the “Modeling Relation” [1]. In order to be useful the modelling relation has to be an isomorphism: the structure of the formal system can be mapped onto the structure of the natural system. The notion of “function” as used in physiology incorporates two aspects of a biological entity: (1) The function states a role of an entity played as a component of an encompassing process. The biological function is therefore tied to a specific process. (2) The function characterises the behaviour which the entity has to exhibit for fulfilling its role. Biological function links system structure (the entities and relation) to behaviour. The most famous example is the function of genes which links the genotype to the phenotype. The notion of function can also be applied to bio-models: The function of a bio-model describes its role in the scientific process of finding mechanistic explanations for biological phenomena. Beside these teleological aspects of a model’s function there are pragmatical aspects: The function of a bio-model also describes the use of the model in simulation experiments for generating behaviours. Thus, function links a model (structure) to its behaviours. To summarise, the function of bio-models describes why and how to use models in simulation experiments. In this paper we investigate the function of bio-models (Sec. 3) and whether and how the function can be formalised (Sec. 4). We claim that the function of a bio-model links its structure to its behaviour. Before we describe the function of a bio-model in detail we will introduce the “meaning facets” (Sec. 2) which provide a framework to systematically describe the knowledge involved in creating and using bio-models. 2 MEANING FACETS OF BIO-MODELS In our analysis of the modelling process in Systems Biology we make two important observations about bio-models [2]. First, a computational model has a dual interpretation: In order to be used in computer simulations the encoded model has to be intrinsically interpreted with respect to the encoding format used. This can be done without referring to the modelled natural system. Furthermore, a model has to be related to the natural system (cf. Rosen’s modelling relation [1] mentioned above), i.e. it has to be extrinsically interpreted. Second, dynamic models are considered on three levels: Models are systems of components and relations (model structure). Models are used in simulation experiments for answering biological questions (model function). Models exhibit temporal changes (model behaviour). The three levels of bio-models are inspired by functional modelling in engineering (see, e.g., [3]). A complete description of a bio-model has to encompass all six “meaning facets” [2], i.e. the intrinsic and extrinsic sides on each of the three levels. If we look at the scientific process of modelling a biological system in order to explain data observed in experiments we can clearly identify the three levels and the dual interpretation (see Fig. 1). There are more details about the six meaning facets [4]. However, in this paper we will focus on the functional facets of bio-models. 3 FUNCTION OF BIO-MODELS A bio-model has an intrinsic and an extrinsic function (Sec. 2) which can be understood in teleological and in pragmatical terms (Sec. 1). There is a third perspective on the function of bio-models, which we call “conditional”: The role played by an entity depends on the Paper H 1 Function of Bio-Models: Linking Structure to Behaviour computer model system results simulation modelling intrinsic model competence intention performance extrinsic dynamics explanation intrinsic intrinsic extrinsic extrinsic experiment target system data reality Figure 1. Structure (blue/left), function (yellow/middle), and behaviour (green/right) of a bio-model. The model relates the (intrinsic) computer representation with the (extrinsic) biological reality. (1) Structure: The biological target system is transferred into a model which can in turn be intrinsically interpreted as a formal system. This establishes a modelling relation between the two systems. If there is a valid mapping between the components of the target system and the formal system, we call the model a competence model. (2) Function: The intention of the model is its use in simulation experiments for explaining biological phenomena observed in biological experiments. (3) Behaviour: The simulation experiments produces results which can be interpreted as the dynamics of the model. This dynamics can be related to the interpreted data of the biological experiments. If the behaviour of the model is similar to the behaviour of the biological system, we call the model a performance model with respect to the corresponding biological phenomena. Explanation is using a competence model in an simulation experiment which makes it a performance model with respect to the biological phenomena to explain. context; in different situations or under different conditions an entity may have different roles. 3.1 Intended Use (Teleological Function) Bio-Models are used in simulation experiments in order to answer questions about the biological system under investigation. The questions may regard the explanation of observed behaviours or the prediction of possible behaviours. What is an accepted explanation or prediction depends on the scientific field and community [5]. Furthermore, specific assumptions restrict the permitted answers, e.g. by stating that a specific reaction is very fast. The extrinsic teleological function of bio-models refers to the questions addressed by the model and the assumptions restricting the answers. The intended use of the model in simulation experiments has to reflect the questions. Depending on the kind of questions different types of simulation experiments may be appropriate. Often, different simulation experiments have to be combined in order to yield the desired outcome. Constraints which are in line with the assumptions are imposed on the simulation experiments. Such constraints may include value restrictions, ratios between values, and conservation rules. The intrinsic teleological function of a bio-model describes its intended use and imposed constraints. 2 3.2 Model Instantiation (Conditional Function) In general, the addressed questions refer to certain boundary conditions and to a specific initial state of the experimentally observed biological system. The boundary conditions determine the environment of the biological system (e.g. temperature, pH, nutrition) and may be reflected by corresponding kinetic data. Often, plausible ranges are given for some of the conditional values instead of single values. The extrinsic conditional function of a bio-model is expressed in terms of boundary conditions and initial states. A bio-model contains state variables and formal parameters. In order to be used in simulation experiments, the model must be fully instantiated, i.e. assign concrete values to all parameters. Furthermore, the initial values for all state variables have to be chosen. The intrinsic conditional function of a bio-model makes the model ready to be used in simulation experiments by means of parameter instantiation and state initialising. 3.3 Experimental Setup (Pragmatical Function) As mentioned above, bio-models explain or predict the behaviour of the modelled biological system. This teleological function requires a complementary description of the experimental settings which lead to the observed behaviour and allow for the verification of the predicted behaviour. Usually the experimental data is transformed into Paper H Functional Aspect intended use intrinsic extrinsic model instantiation intrinsic extrinsic experimental setup intrinsic extrinsic Checklists Languages Ontologies MIASE unknown SED-ML unknown see Sec. 4.4 see Sec. 4.4 MIASE MIBBI SED-ML SABIO-RK useless XCO MIASE MIBBI SED-ML FuGE KiSAO MMO Table 1. Formal approaches for functional aspects of bio-models. See main text for details. the final observations by means of result calculations. The extrinsic pragmatical function of a bio-model describes the experimental settings and result calculations related to the dynamical phenomena under investigation. Bio-Models are used in simulation experiments. The setup of the simulation experiments precisely describes the procedure applied to the instantiated model. This involves the simulation algorithm used and specific settings for this algorithm. In addition, the exact steps, their order and applied perturbations have to be specified. Postprocessing of the raw data from the simulation experiments generates the desired outcome. The intrinsic pragmatical function of a biomodel describes the setup of the simulation experiments applied to the model structure and the post-processing which finally produces the model behaviour. 4 FORMAL REPRESENTATION OF FUNCTION In this section we will briefly review existing approaches for formalising the different functional aspects of bio-models (see Sec. 3). Tab. 1 provides an overview of the formal approaches. The classification of the formal approaches in checklists, languages and ontologies is motivated by [6]. There are some gaps in Tab. 1. We are not aware of any checklists or languages for the extrinsic teleological function of bio-models. In Sec. 4.4 we therefore propose an ontology for teleological functions which could be a starting point for such efforts. Ontologies for intrinsic model instantiation seem not to be very useful: There is not much conceptual knowledge involved in assigning values to parameters and variables. This is not the case for the (extrinsic) boundary conditions. The Experimental Conditions Ontology (XCO) [7] provides a rich vocabulary of experimental conditions for phenotype experiments. The Measurement method ontology (MMO) [7] can be used for specifying the measurement method in extrinsic experimental setup descriptions. Because XCO and MMO are slightly out of our scope we will not provide further details. All other formal approaches displayed in Tab. 1 are discussed in the following sections. 4.1 MIASE: A Checklist for Simulation Experiments To start the formalisation of a specific kind of scientific data, like bio-models, experiment protocols, or experimental results, the responsible community first of all has to be agree about the information needed. So-called “Minimum Information Checklists” state what information has as least to be described. Checklists are semi-formal in that they structure the information. However, the information is still formulated in natural language. The most important checklist for bio-models is the Minimal Information Requested In the Annotation of Models (MIRIAM, [8]). MIRIAM describes what information must be provided for exchanging models. Although it mainly refers to model structure MIRIAM also requests that a model should be able to be simulated and to reproduce relevant published results. The concrete intrinsic functional description of the simulation experiments producing these results is the main focus of the Minimum Information About a Simulation Experiment (MIASE, [9]). MIASE requests information about model instantiation (conditional function), the exact experimental setup, and the necessary post-processing (pragmatical function). It also partly concerns the intended use (teleological function) to the effect that the type of simulation experiment has to be described. For the description of the extrinsic function there are a lot of specific checklists listed in the MIBBI portal (Minimum Information for Biological and Biomedical Investigations, [10]). The listed checklists concern information about the boundary conditions (conditional function) and the experimental settings (pragmatical function) for specific types of biological experiments. 4.2 SED-ML: A Language for Simulation Experiments In most cases the checklists can be translated to a elementary data model which can then be extended to a formal language. Such a formal language for describing simulation experiments is SED-ML (Simulation Experiment Description Markup Language, [9]) which is able to specify the type of simulation (intrinsic teleological function), the model instantiation and initial values (intrinsic conditional function), and the setup and post-processing of the simulation experiments (intrinsic pragmatical function). However, SED-ML can not fully describe the intended use and the imposed constraints. The Systems Biology Markup Language (SBML, [11]) used for the encoding of the model structure is also able to determine parameter values and initial values. But, in order to be able to reuse models in different simulation experiments we suggest to clearly separate descriptions of models (in SBML) from descriptions of their use (in SED-ML). There are languages for describing extrinsic experimental conditions and specific biological experiments. SABIO-RK (System for the Analysis of Biochemical Pathways – Reaction Kinetics, [12]), for example., describes experimental and environmental conditions for measurements of kinetic data. The Functional Genomics Experiment Object Model (FuGE, [13]) describes experiments in functional genomics. We will not go into detail here. 4.3 KiSAO: An Ontology for Simulation Algorithms Ontologies formalise conceptual knowledge. They can provide vocabularies for formal languages. The Kinetic Simulation Algorithm Ontology (KiSAO, [14]) is employed within SED-ML to precisely specify the algorithms used for the simulation experiment. Thus, KiSAO contributes to the description of the intrinsic pragmatical function of bio-models. 4.4 An Ontology for Intention of Bio-Models How could an ontology for the teleological function of bio-models look like? The questions about the biological system under investigation which are addressed by the model are formulated in natural language. There is a wide diversity of such questions. However, if we focus on the actual tasks tackled by simulating the model, we are able to identify patterns in the questions. Common tasks include: Paper H 3 Function of Bio-Models: Linking Structure to Behaviour (1) the approximation of observed behaviour, (2) the investigation of the variability in behaviour, (3) the demonstration of the ability for specific kinds of behaviour, and (4) the examination of the influence of parameters to the behaviour. Each task requires a different corresponding simulation type. We could accordingly classify the intended uses of bio-models: (1) time series (eventually including parameter fitting), (2) bifurcation analysis, (3) stability analysis, and (4) parameter scan. This list of tasks and corresponding intended uses is far from complete. However, it outlines the strategy for developing an ontology for the intention of bio-models. 5 RELATED WORK The dual interpretation of bio-models is rooted in the “knowledge representation hypothesis” from Artificial Intelligence: “Any mechanically embodied intelligent process will be comprised of structural ingredients that a) we as external observers naturally take to represent a propositional account of the knowledge that the overall process exhibits, and b) independent of such external semantical attribution, play a formal but causal and essential role in engendering the behavior that manifests that knowledge.” – [15, p.15] Simon generalises this duality to all kinds of artifacts which serve as interfaces between an inner and an outer environment [16]. Rosen’s modelling relation [1] is a congruence between an extrinsic/outer natural system and an intrinsic/inner formal system established by a model. There are other conceptual frameworks for modelling and simulation in general [17; 18], and in particular for bio-modelling [6]. However, our meaning facets are more rigid and provide much more details. Zeigler’s notion of an “experimental frame” [18] resembles the model instantiation (conditional function) and the experimental setup (pragmatical function) presented in this paper. Furthermore, the experimental frame “is the operational formulation of the objectives that motivate a modeling and simulation project” [18, p.27], i.e. it also describes the teleological function of a model. The field of functional modelling relates structure, behaviour and function of engineering artifacts. [3] reviews the different approaches to formalise function and its relations to structure and behaviour. Two different notions of function are employed in [3]: On the one hand, function is mediating between structure and behaviour and determines the “structural behaviours”, i.e. all possible behaviours the model is able to show. This is more or less what we call conditional and pragmatical function. On the other hand, function refers to the intention of the modeller and restricts the possible behaviours to the “expected behaviours”. This notion of function as purpose corresponds to our teleological function. In short, functional modelling addresses the “the questions of what the device and its components do or what the purpose of the device and its components are” [3, p.149]. Joining these two sides function becomes “the bridge between human intention and physical behavior of artifacts” [19, p.271]. The distinction between structural and expected behaviours originates from [20]. We transfer the notion of function in biology to modelling in systems biology. There are some strong parallels between function in biology and function of bio-models. The idea that function 4 links between structure and behaviour is deeply rooted in molecular biology, as, e.g., stated in [21, p.0712]: “If one such behavior seems useful (to the organism), it becomes a candidate for explaining why the network itself was selected, i.e., it is seen as a potential purpose for the network.” However, we will not discuss the notion of function in biology further here. [22] compares the notion of function in biology and technology and examines disanalogies and parallels. There are some formal approaches to biological function. Ontologies like EcoCyc [23] and the Gene Ontology [24] list molecular functions played by biological entities. [25] presents an ontology of biological functions which formalises three functional aspects: the so-called “function structure”, the realisation and the has-function relation, which could be related to our teleological, pragmatical and conditional function, respectively. 6 CONCLUSION We have applied the notion of function to computational models in Systems Biology. Function is the link between the model structure and the model behaviour. The intrinsic function of bio-models describes three aspects of the model’s use: Why should the model be used in simulation experiments (teleological function)? Which model instance should be used in simulation experiments (conditional function)? How should the model be used in simulation experiments (pragmatical function)? The extrinsic function of a bio-model refers to the biological questions which are addressed by the model. The systematisation of the functional aspects of bio-models was used to review corresponding formal accounts. Some functional aspects are well covered by checklists, languages and associated ontologies. However, there is no ontology for the biological questions and the intended use of bio-models. We outlined a strategy for the development of such an ontology. Our epistemological analysis of functional aspects of computational models in Systems Biology and their use in simulation experiments provides an important prerequisite for formalising the involved knowledge. Ultimately, this will improve any computersupported research method for answering biological questions by means of bio-models. REFERENCES [1]Robert Rosen. Anticipatory systems. Pergamon Press, Oxford, UK, 1985. [2]Christian Knüpfer, Clemens Beckstein, and Peter Dittrich. Towards a semantic description of bio-models: Meaning facets – a case study. In Sophia Ananiadou and Juliane Fluck, editors, Proceedings of the Second International Symposium on Semantic Mining in Biomedicine (SMBM 2006), Jena, April 9-12, 2006, CEUR-WS, pages 97–100, Aachen, 2006. RWTH University. [3]Mustafa Suphi Erden, Hitoshi Komoto, Thom van Beek, Valentina D’Amelio, Erika Echavarria, and Tetsuo Tomiyama. A review of function modeling: Approaches and applications. AI EDAM, 22(02):147–169, 2008. [4]Christian Knüpfer, Clemens Beckstein, and Peter Dittrich. How to formalise the meaning of a bio-model: A case study. In BioSysBio 2007: Systems Biology, Bioinformatics, Synthetic Biology, Meeting abstracts. 11.-13. January Manchester, UK, volume 1, Suppl 1 of BMC Systems Biology, page 28, 2007. Paper H [5]Peter Machamer, Lindley Darden, and Carl F. Craver. Thinking about mechanisms. Philosophy of Science, 67(1):pp. 1–25, 2000. [6]Vijayalakshmi Chelliah, Lukas Endler, Nick Juty, Camille Laibe, Chen Li, Nicolas Rodriguez, and Nicolas Le Novère. Data integration and semantic enrichment of systems biology models and simulations. Data Integration in the Life Sciences, pages 5–15, 2009. [7]Mary Shimoyama, Rajni Nigam, Leslie Sanders McIntosh, Rakesh Nagarajan, Treva Rice, D. C. Rao, and Melinda R. Dwinell. Three ontologies to define phenotype measurement data. Front Genet, 3:87–87, 2012. [8]Nicolas Le Novère, Andrew Finney, Michael Hucka, Upinder S Bhalla, Fabien Campagne, Julio Collado-Vides, Edmund J Crampin, Matt Halstead, Edda Klipp, Pedro Mendes, Poul Nielsen, Herbert Sauro, Bruce Shapiro, Jacky L Snoep, Hugh D Spence, and Barry L Wanner. Minimum information requested in the annotation of biochemical models (MIRIAM). Nat Biotech, 23(12):1509–1515, December 2005. [9]Dagmar Köhn and Nicolas Le Novère. SED-ML – an XML format for the implementation of the MIASE guidelines. In Computational Methods in Systems Biology. Proceedings of the 6th International Conference CMSB 2008, Rostock, Germany, October 12-15, 2008., Lecture Notes in Computer Science, pages 176–190, Berlin, Heidelberg, 2008. Springer. [10]Chris F. Taylor, Dawn Field, Susanna-Assunta Sansone, Jan Aerts, Rolf Apweiler, Michael Ashburner, Catherine A. Ball, Pierre-Alain Binz, Molly Bogue, Tim Booth, Alvis Brazma, Ryan R. Brinkman, Adam Michael Clark, Eric W. Deutsch, Oliver Fiehn, Jennifer Fostel, Peter Ghazal, Frank Gibson, Tanya Gray, Graeme Grimes, John M. Hancock, Nigel W. Hardy, Henning Hermjakob, Randall K. Julian, Matthew Kane, Carsten Kettner, Christopher Kinsinger, Eugene Kolker, Martin Kuiper, Nicolas Le Novère, Jim Leebens-Mack, Suzanna E. Lewis, Phillip Lord, Ann-Marie Mallon, Nishanth Marthandan, Hiroshi Masuya, Ruth McNally, Alexander Mehrle, Norman Morrison, Sandra Orchard, John Quackenbush, James M. Reecy, Donald G. Robertson, Philippe Rocca-Serra, Henry Rodriguez, Heiko Rosenfelder, Javier Santoyo-Lopez, Richard H. Scheuermann, Daniel Schober, Barry Smith, Jason Snape, Christian J. Stoeckert, Keith Tipton, Peter Sterk, Andreas Untergasser, Jo Vandesompele, and Stefan Wiemann. Promoting coherent minimum reporting guidelines for biological and biomedical investigations: the MIBBI project. Nat Biotech, 26(8):889–896, August 2008. [11]Michael Hucka, Andrew Finney, Herbert M. Sauro, Hamid Bolouri, John C. Doyle, Hiroaki Kitano, Adam P. Arkin, Benjamin J. Bornstein, Dennis Bray, Athel Cornish-Bowden, Autumn A. Cuellar, Serge Dronov, Ernst Dieter Gilles, Martin Ginkel, Victoria Gor, Igor I. Goryanin, Warren J. Hedley, T. Charles Hodgman, Jan-Hendrik S. Hofmeyr, Peter J. Hunter, Nick S. Juty, Jay L. Kasberger, Andreas Kremling, Ursula Kummer, Nicolas Le Novère, Leslie M. Loew, Daniel Lucio, Pedro Mendes, Eric Minch, Eric D. Mjolsness, Yoichi Nakayama, Melanie R. Nelson, Poul F. Nielsen, Takeshi Sakurada, James C. Schaff, Bruce E. Shapiro, Thomas Simon Shimizu, Hugh D. Spence, Jörg Stelling, Koichi Takahashi, Masaru Tomita, John M. Wagner, and Jian Wang. The systems biology markup language (SBML): a medium for representation and exchange of biochemical network models. Bioinformatics, 19(4):524–531, Mar 2003. [12]Ulrike Wittig, Renate Kania, Martin Golebiewski, Maja Rey, Lei Shi, Lenneke Jong, Enkhjargal Algaa, Andreas Weidemann, Heidrun Sauer-Danzwith, Saqib Mir, Olga Krebs, Meik Bittkowski, Elina Wetsch, Isabel Rojas, and Wolfgang Müller 0001. SABIO-RK – database for biochemical reaction kinetics. Nucleic Acids Res, 40(Database issue):790–796, January 2012. [13]Khalid Belhajjame, Andrew R. Jones, and Norman W. Paton. A toolkit for capturing and sharing FuGE experiments. Bioinformatics, 24(22):2647–2649, November 2008. [14]Mélanie Courtot, Nick Juty, Christian Knüpfer, Dagmar Waltemath, Anna Zhukova, Andreas Dräger, Michel Dumontier, Andrew Finney, Martin Golebiewski, Janna Hastings, Stefan Hoops, Sarah Keating, Douglas B Kell, Samuel Kerrien, James Lawson, Allyson Lister, James Lu, Rainer Machne, Pedro Mendes, Matthew Pocock, Nicolas Rodriguez, Alice Villeger, Darren J Wilkinson, Sarala Wimalaratne, Camille Laibe, Michael Hucka, and Nicolas Le Novère. Controlled vocabularies and semantics in systems biology. Molecular Systems Biology, 7, oct 2011. [15]Brian C. Smith. Reflection and Semantics in a Procedural Language. PhD thesis, Massachusetts Institute of Technology, 1982. [16]Herbert A. Simon. The Sciences of the Artificial. MIT Press, Cambridge, MA, 1969. [17]George J. Klir. Architecture of Systems Problem Solving. Plenum Press, New York, 1985. [18]Bernard P. Zeigler, Herbert Praehofer, and Tag G. Kim. Theory of Modeling and Simulation. Academic Press, 2 edition, 2000. [19]Yasushi Umeda and Tetsuo Tomiyama. FBS modeling: modeling scheme of function for conceptual design. In Working Papers of the 9th Int. Workshop on Qualitative Reasoning About Physical Systems, Amsterdam, pages 271–278, 1995. [20]John S. Gero. Design prototypes: a knowledge representation schema for design. AI Mag., 11(4):26–36, 1990. [21]Arthur D Lander. A calculus of purpose. PLoS Biol, 2(6):e164, 06 2004. [22]Ulrich Krohs and Peter Kroes. Philosophical perspectives on organismic and artifactual functions. In Ulrich Krohs and Peter Kroes, editors, Functions in Biological and Artificial Worlds: Comparative Philosophical Perspectives, Vienna Series in Theoretical Biology. MIT Press, 2009. [23]Peter D. Karp. An ontology for biological function based on molecular interactions. Bioinformatics, 16(3):269–285, 2000. [24]Michael Ashburner, Catherine A. Ball, Judith A. Blake, David Botstein, Heather Butler, J. Michael Cherry, Allan P. Davis, Kara Dolinski, Selina S. Dwight, Janan T. Eppig, Midori A. Harris, David P. Hill, Laurie Issel-Tarver, Andrew Kasarskis, Suzanna Lewis, John C. Matese, Joel E. Richardson, Martin Ringwald, Gerald M. Rubin, and Gavin Sherlock. Gene Ontology: tool for the unification of biology. Nature Genetics, 25:25–29, 2000. The Gene Ontology Consortium. [25]Patryk Burek, Robert Hoehndorf, Frank Loebe, Johann Visagie, Heinrich Herre, and Janet Kelso. A top-level ontology of functions and its application in the open biomedical ontologies. Bioinformatics, 22(14):e66–e73, 2006. Paper H 5 PhysioMaps of Physiological Processes and their Participants !"#$%&'()*(+,,-.!"/#01&''()*(2&#'.3(4,5&67(8,&9$:,6;<3(=&,6>%,?(@*(=-,A7,?B(#$:(( C,9$(8*(=&$$#6%.( 1 Biomedical & Health Informatics, Univ. of Washington, USA, 2Univ. of Cambridge, UK, 3Bioinformatics, Aberystwyth University, UK ABSTRACT The biological meaning of physics-based simulations of biological processes is usually implicit, and often inaccessible to biological researchers, or indeed to any researchers beyond the authors of a particular model. To display and interrogate the physiological content of such models, we have developed formal semantic models of biological process networks that we call PhysioMaps. PhysioMaps are semiautomatically derived by our SemGen software by parsing simulation model code; visualized using our Chalkboard software, and qualitatively interrogated for cause-effect relations using the PathTracing feature of Chalkboard. PhyioMaps are compact, semantic models of biological process networks through which the functional content of biosimulation models and other biomedical resources can be displayed, archived, queried, and integrated. INTRODUCTION Physics-based mathematical model encode and simulate networks of physiological processes in order to test hypotheses and quantitatively predict system behavior. However, the physical principles and mathematical language of models often render these models opaque to non-mathematical biologists. If such models were re-cast to be easily accessible to biologists, we claim that the scientific endeavor could be significantly accelerated. To bridge this gap, we have developed and implemented “PhysioMaps” that are both (a) a visualization of the biology implicit in mathematical biosimulation models, and (b) a computable view on the multi-scale, multi-domain “physiome” (Bassingthwaighte, 2000; Hunter et al., 2003). Here we briefly demonstrate our approach for deriving PhysioMaps from available biosimulation models, for creating graphical displays of the processes and process participants that are implicit in the model code, and for querying the PhysioMaps to display functional links between participants. Within the domain of chemical reaction pathways (“systems biology”), Systems Biology Markup Language (SBML, 2012) and Systems Biology Graphical Notation (SBGN, 2012) can represent chemical participants in reaction processes. However, extensions to such notations are required to represent participants and processes within and * To whom correspondence should be addressed. across multiple scales and domains that are of critical biomedical interest such as cardiovascular function, membrane electrophysiology, neuroendocrine control systems, fluid and electrolyte balance, and respiratory function. The success of SBML, SBGN, and their supporting tools, build on a semantics of biochemical reaction pathways as do other resources such as EcoCyc (EcoCyc), KEGG (KEGG, 2005), and Reactome (Reactome, 2005). The PhysioMap framework generalizes beyond the molecular scale to capture knowledge of multiscale/multidomain biological process based on the semantics of the Ontology of Physics for Biology (OPB, Cook et al., 2008; Cook et al., 2011). Here we define the semantic foundations of PhysioMaps, demonstrate computational derivation of PhysioMaps from available biosimulation model code, and show how PhysioMaps can visualize and interrogate complex physiological systems. We discuss a PhysioMap workflow designed to tighten the loop between physiological modelers and experimentalists for solving outstanding biomedical problems. SEMANTICS OF OPB AND PHYSIOMAPS PhysioMaps are cyclic graphs in which edges are operators that represent biological processes and nodes represent the biological physical entities that are participants in those processes. As a computational representation of a biological process network, a PhysioMap is analogous to a biochemical reaction network, in that it can be semantically composed, decomposed, and integrated with other PhysioMaps to other physiological systems. From one PhysioMap, we could merge it with others, extract portions (graph subsets), or perform intersections with other maps, to create new PhysioMaps that might describe different systems The OBP is an ontology (Cook et al., 2011) that represents entities and relations of classical physics as applied in engineering systems dynamics (Karnopp et al., 1990; Borst et al., 1997) and biological network thermodynamics (Oster, 1973; Person, 1975). OPB leverages formal analogies between physical properties and their quantitative dependencies that apply across all scales and biophysical domains including chemical reactions, fluid flow, diffusion, electrophysiology, etc. For examples, just as chemical reaction fluxes depend on differences in the concentration (or, more accurately, chemical potential), fluid flows are driven by Paper I 1 D.L.Cook et al. Figure 2. Prototype work-flow for producing, displaying, and querying PhysioMaps. Figure 1. Diagram showing separate concerns of the OPB ontological schemata for: (1) PhysioMaps, (2) SemSim models, and (3) biosimulation model code. differences in fluid pressure, ion fluxes by electrochemical gradients, and so forth. As in various upper-level ontologies (e.g., BFO) OPB physical entities (“continuants”) are participants (has_participant relation) in processes (“occurrents”) that can be mapped to entities in available biomedical ontologies (e.g., FMA (FMA, 2011), CL (CL, 2012), GO (GO, 2012), ChEBI (ChEBI, 2006)). OPB is orthogonal to other ontologies because entities and processes are defined in thermodynamic terms i.e., they are “dynamical” in accord with usage in systems dynamics. Thus, for example, OPB:Dynamical entity is defined, in part, as “the bearer of a portion of thermodynamic energy” and OPB:Dynamical process as “the flow of thermodynamic energy between participating dynamical entities”. As roadmaps for this paper, Figure 1 shows how OPB classes and relations provide a class structure for both PhysioMaps and SemSim (semantic simulation) models which each have instances that are derived from simulation model code. Figure 2 shows a workflow by which simulation model code is parsed and annotated into a SemSim model, and then abstracted as a PhysioMap to be visualized and interrogated in our Chalkboard software (Cook et al., 2007). In the next two sections we present in greater detail these two steps in the workflow. GENERATE PHYSIOMAPS FROM MODELS PhysioMaps are semi-automatically derived from biosimulation model code using SemGen (Neal et al., 1998), an application for annotating, decomposing, merging, and encoding biosimulation models (Beard et al., 2012) in any of several languages (CellML (CellML, 2005), SBML (SBML, 2 2012), JSimMML (JSim, 2006), currently). In a two-step process, SemGen reads and parses model code and then, under user guidance, maps each model variable ( “Paorta”, “concGlucose”, for examples) to instances of OPB:Dynamical property subclasses (OPB:Fluid pressure, OPB:Chemical concentration, respectively for the examples). Depending on the physical domain for the dynamical property class (e.g., OPB:Fluid kinetic domain or OPB:Chemical kinetic domain), SemGen accesses (via web services) physical entity reference classes from the FMA, CL, or ChEBI (or other ontologies) and constructs a composite annotation (Gennari et al., 2009; Gouts et al., 2009) for each and by which the biological meaning of the mathematical variable is logically declared. In our example, the variable “PLV” could be annotated as OPB:Fluid pressure property_of FMA:Blood in left ventricle. Having annotated each model variable, SemGen then maps their mathematical dependencies according to whether variables are “left-hand sides” (lhs) of equations or are on the “right-hand side” (rhs) of each equation. From these mappings, SemGen constructs and displays these mathematical dependencies by which each lhs-variable mathematically determined by each rhs-variable. Of particular concern here are the lhs-variables that have been annotated as subclasses of OPB:Flow rate property. For example, a typical cardiovascular dynamic model will have several equations that compute a fluid dynamic version of Ohm’s law for fluid flow such as for blood flow from the lumen of the left ventricle (LV), through the aortic valve (with fluid resistance, RAV), and into the aorta (A). According to Ohm’s law such flows are driven by differences in pressure (PLV – PA) each of which is annotated by SemGen to be an instance of OPB:Fluid pressure which is the fluid kinetic subclass of OPB:Force property. Thus: Paper I FLV,A = (PLV – PA) / RAV Eq. 1 PhysioMaps of Physiological Processes and their Participants dictated by physical laws. The physical entities and processes modeled by the code are entirely implicit in the code and are explicated only, if at all, by superimposed annotations. SemSim models abstract model variables and equations as instances of OPB:Dynamical properties and of OPB:Physical property dependency and formally link the property instances, via the composite annotation mechanism, to formal representations of process participants, but not to processes themselves. A PhysioMap represents, specifically, processes and participants to the exclusion of mathematical detail of how processes occur (i.e., according to a specific formulation of Ohm’s law used by a modeler). To recover such details of source code, PhysioMaps can be linked to the originating SemSim model, and hence to the originating biosimulation code. Figure 3. A prototype Chalkboard display of the PhysioMap of a cardiovascular-flow module extracted by SemGen from a JSim-coded simulation model {Kerckhoffs, et al., 2007}. DISPLAY AND INTERROGATE PHYSIOMAPS The definition of OPB:Dynamical process is “...the flow or the control of flow of thermodynamic energy within or between participating dynamical entities”. Thus, we are able to interpret a flow variable as an attribute of each participating entity (e.g., fluid flow rate out of left ventricular) and identically as an attribute of the process by which blood flows from one entity to the other (e.g., fluid flow rate through aortic valve). Given this identity, SemGen makes two inferences. First, for each variable that is a SemSim instance of OPB:Flow rate property, it infers the existence of an instance of the SemSim process class corresponding to OPB:Energy flow process. Then, for each SemSim instance of OPB:Force property (that represents a force variable on the right-hand-side of an equation), SemGen infers the existence of a participating physical entity according to the composite annotations of the force variable. SemGen then creates a file which lists all participating entities and all processes linked to each of its participants. Although encompassed by OPB and SemSim, our prototype demonstration here does not yet include examples of modulating participants such as the participation of the aortic valve as a flow path to which the resistance variable (RAV of type OPB:Fluid resistance) applies. Creating a PhysioMap from mathematical equations results in a major reduction in complexity. PhysioMaps retain the language-independent biological meaning of model code while excluding mathematical forms, parametric values, and program control code, that are generally opaque to nonmathematical experimental biologists or, even, users of alternative modeling platforms. This is apparent in Figure 1 that shows the separation of concerns for model code, SemSim, and PhysioMap representations. Model code consists solely of (1) model variables, that represent the values of the physical properties of process participants, and (2) model equations that encode the mathematical dependencies Once created, we export PhysioMaps from SemGen into Chalkboard to graphically displays the PhysioMap and allows users to explore cause/effect relations amongst processes. Chalkboard (Cook et al., 2007) is an editor for the BioD biological description language (Cook et al., 2001)that incorporates an extensive graphical vocabulary for biological physical entities and processes. BioD differs only in detail from the SBGN Process Description language. In our prototype demonstrations here, we have reprogrammed Chalkboard to read PhysioMap files and to create process icons (rectangles) linked by arrows to participating entity icons (circles). Figure 3 is a PhysioMap of a cardiovascular model extracted using SemGen from a more extensive model (Kerckhoffs et al., 2007) in which a rectangle represent fluid-flow process (governed by a dependency as in Eq. 1) for flow between portions of blood located in a circuit of blood vessels and heart chambers. Automatically-derived PhysioMaps (as in Fig. 3) present two common user-interface challenges. First, graphical and topological layouts, although semi-automated, require user adjustments to be intuitively satisfying. Second, icon labels (e.g., “Blood in left ventricle”), that are automatically derived and condensed from the formal SemSim composite annotations, may be verbose. Although not illustrated here, PhysioMaps in Chalkboard can be interrogated using Chalkboard’s Path Tracing feature which makes qualitative outcome predictions based on the mathematical dependencies derived in the originating SemSim model. Path Tracing is a user-interface tool that replicates the kind of “thought experiments” physiologists informally reason about the functional implications of their hypotheses. Thus, experimental perturbations (e.g., an increment in the amount of aortic blood) is mentally propagated through a functional network as increments and decrements in the amounts or flow-rates of connected participants and processes. As implemented in Chalkboard, Path Tracing can trace “A-to-B” pathways in complex networks, detect Paper I 3 D.L.Cook et al. positive- and negative-feedback loops, and display the qualitative (up or down) responses of all affected processes and participants in the net. We have developed these tools in the context of largescale collaborative projects like the Virtual Physiological Human (VPH, 2010) and the Virtual Physiological Rat (VPR, 2012) with the aim to “tighten the loop” between modelers and experimentalists. PhysioMaps, as derived from model code, provide experimentalist with a familiar descriptive view of the functional content of models while retaining formal traceable links back to the originating code. Combined with SemGen’s capabilities to decompose, recompose, and encode models, we anticipate that these methods will facilitate the generation and testing of biomedical hypotheses. SUMMARY AND CONCLUSIONS We have introduced the idea of a PhysioMap as a high-level view of a physiological system and demonstrated extensions to our SemGen software which derive PhysioMaps from biosimulation model code using the semantics of classical physics as represented in the OPB. Thus, in addition to SemGen’s capabilities for rapid prototyping of models by decomposing, composing, and re-encoding legacy models, SemGen can infer the biophysical processes and their participants from the mathematical dependencies coded in biosimulation models. In this first report, we have focused on flow processes of the OPB:Energy flow process class but we aim to identify modulation and control processes as subclasses of OPB:Constitutive coupling process (as might be annotated as kinds of GO:Biological regulation). Our claim is that we have established a computational method by which physiological process knowledge, otherwise implicit in the code of multiscale biosimulation models, can be mined and formally expressed using available biomedical ontologies. We have leveraged the biophysical semantics of the OPB to infer the existence of such physical processes along with their participating entities in biosimulation models created by a community of physiologists and biophysicists to address outstanding problems in multiscale biomedicine. ACKNOWLEDGEMENTS This work was partially funded by the VPH Network of excellence, EC FP7 project #248502, and by the American Heart Association. REFERENCES Bassingthwaighte (2000). Strategies for the physiome project. Ann Biomed Eng 28(8): 1043-1058. Beard, et al. (2012). Multiscale Modeling and Data Integration in the Virtual Physiological Rat Project. Annals of Biomedical Engineering (in press). 4 Borst, et al. (1997). Engineering ontologies. Int. J. Human– Computer Studies 46: 365-406. CellML (CellML language, as of 2012) http://www.cellml.org ChEBI (Chemical Entities of Biological Interest, as of 2012) http://www.ebi.ac.uk/chebi/ CL (Cell Ontology, as of 2012) http://www.obofoundry.org/cgibin/detail.cgi?id=cell Cook, et al. (2011). Physical Properties of Biological Entities: An Introduction to the Ontology of Physics for Biology. PLoS ONE 6(12): e28708. Cook, et al. (2001). A basis for a visual language for describing, archiving and analyzing functional models of complex biological systems. Genome Biol 2(4): RESEARCH0012. Cook, et al. (2007). Chalkboard: Ontology-Based Pathway Modeling And Qualitative Inference Of Disease Mechanisms. Pac Symp Biocomput 12: 16-27. Cook, et al. (2008). Bridging biological ontologies and biosimulation: the Ontology of Physics for Biology. AMIA Annu Symp Proc: 136-140. EcoCyc (Encyclopedia of Escherichia coli, as of 2012) http://ecocyc.org/ FMA (Foundational Model of Anatomy, as of 2012) http://sig.biostr.washington.edu/projects/fm/ Gennari, et al. (2009). Multiple ontologies in action: Composite annotations for biosimulation models. J Biomed Inform 44(1): 146-154. Gkoutos, et al. (2009). Entity/quality-based logical definitions for the human skeletal phenome using PATO. Conf Proc IEEE Eng Med Biol Soc 2009: 7069-7072. GO (Gene Ontology, as of 2012) http://www.geneontology.org/ Hunter, et al. (2003). Integration from proteins to organs: the Physiome Project. Nat Rev Mol Cell Biol 4(3): 237-243. JSim (JSim Home Page at NSR, as) http://nsr.bioeng.washington.edu/ Karnopp, et al. (1990). System dynamics: a unified approach. New York, Wiley. KEGG (Kyoto Encyclopedia of Genes and Genomes, as of 2012) http://www.genome.jp/kegg/ Kerckhoffs, et al. (2007). Coupling of a 3D finite element model of cardiac ventricular mechanics to lumped systems models of the systemic and pulmonic circulation. Ann Biomed Eng 35(1): 118. Neal, et al. (1998). The digital anatomist structural abstraction: a scheme for the spatial description of anatomical entities. Proc AMIA Symp: 423-427. OPB (Ontology of Physics for Biology, as of 2012) http://bioportal.bioontology.org/ontologies/44872 Oster, et al. (1973). Network thermodynamics: dynamic modelling of biophysical systems. Q Rev Biophys 6(1): 1-134. Perelson (1975). Network thermodynamics. An overview. Biophys J 15(7): 667-685.Reactome (Reactome, as of 2012) http://www.reactome.org SBGN (Systems Biology Graphical Notation, as of 2012) http://www.sbgn.org/ SBML (Systems Biology Markup Language, as of 2012) http://sbml.org/ VPH (Virtual Physiological Human, as of 2012) http://www.vphnoe.eu/ VPR (Virtual Physiological Rat, as of 2012) http://virtualrat.org/ Paper I Tissue Motifs and Multi-scale Transport Physiology *de Bono, B. a,c,d, Kasteleyn, P.b, Potikanond, D., Kokash, N.b, Verbeek, F.b and Grenon, P a. [a] European Bioinformatics Institute, Cambridge, UK; [b] LIACS, Leiden University, Leiden, the Netherlands; [c] CHIME Institute, UCL, UK [d] Auckland Bioengineering Institute, New Zealand * ABSTRACT Motivation: This work describes the underlying notion and formal representation of tissue motifs in terms of simple combinations of cell types that are co-located in the same tissue. The composition of tissue motifs is characterised on the basis that particular forms of biological interaction can take place between the cells in the motif. One such biological interaction involves the diffusion of gene products that are secreted into the surrounding tissue fluid to reach binding receptors attached to nearby plasma membranes. The development of a public knowledgebase of motifs constrained for diffusive interactions would provide a key resource for the interpretation of gene expression data in the context of inter-cellular communication. To that end, in this work we also present a software tool prototype that supports the supervised capture and annotation of histology image data. The aim of this tool is to support the crowdsourcing of tissue motif knowledge acquisition by the biomedical community. 1 INTRODUCTION This work is concerned with the discussion of a particular element in an overall system of ontology for transport physiology in multicellular organisms. The particular focus of this paper is to support the functional representation of Tissue Motifs. In this paper, a motif is a recurring structural pattern in biology that may span across multiple scales of size. An important feature of motifs of biological structure is that many of them have a well-established functional significance. For instance, the association between (i) recurring linear motifs in primary amino acid or DNA sequence and (ii) functional properties associated with those biological structures has been extensively documented (e.g. [1, 2]). At a higher (i.e. tertiary) protein architecture level, linear motifs of domain superfamily combinations that correlate with distinct functional categories are also known to be evolutionarily conserved [3]. Given that, despite the extensive combinatorial space available for all possible linear combinations of molecular structure, only a very small fraction of motifs are * To whom correspondence should be addressed.: bdb@ebi.ac.uk consistently preserved over evolution indicates that such conserved patterns convey some functional advantage to the bearer of the motif (e.g. by conferring stable, low energy, conformation of tertiary folding and quaternary binding). At a higher size scale of biological structure, basic tissue organization consists of different proportions of cell types, as well as the extracellular matrix they secrete. Based on the number of classes found in the CellType ontology [4], the size of the cell type repertoire is roughly 1,500 (this number is comparable to the number of superfamilies that constitute domain motifs at the polypeptide scale [5]). The cellular architecture of tissues is understood to be under a number of distinct functional constraints. One of the most fundamental of these constraints is the key biophysical requirement that cells have to be within diffusion distance of at least one capillary to ensure appropriate rates of (i) delivery of supplies (e.g. oxygen, glucose) and (ii) elimination of waste (e.g. carbon dioxide, urea). The density of capillary arborization is, in practice, commensurate to the level of metabolic activity typical of that tissue. Given that the distance over which diffusion occurs is approximately 100µm, it is not surprising that most mammalian cells are found within 50µm of a capillary [6]. The same distance constraint applies to paracrine communication between cells in which molecules secreted by one cell have to diffuse to reach the plasma membrane surface of another cell. The combination of these two distance constraints provides the biophysical basis for the structural definition of a functional tissue unit consisting of cells that (i) are metabolically dependent on the same capillary and (ii) are in paracrine communication with one another. This tissue unit per se has a cylindrical shape that shares its long axis with the feeding capillary, has a radius of 50µm and a height of 100µm. The unordered set of cell types that are found in this tissue unit provides the elements that constitute a Tissue Motif, which motif is based on the process of diffusive communication between cells. This approach to developing Tissue Motifs also has to account for two key properties of capillary networks, namely that: 1) any particular cell may be within diffusive range of more than one capillary, and 2) given that a capillary is about 500µm in length, it is possible that the constitution of the Tissue Motif may alter Paper J 1 de Bono, B. et al. along the course of the capillary from the arteriolar to the venular end. In order to further develop and apply Tissue Motif knowledge in support of our study of multi-scale transport physiology, this paper presents the requirements and progress achieved in: 1) formalizing knowledge representation about Tissue Motifs (Section 2), and 2) acquiring Tissue Motif knowledge through the analysis of histology images (Section 3). The final part of the paper (Section 4) outlines the key motivation for building a crowdsourced knowledge repository of tissue motifs in support of data analysis in molecular biology. 2 REPRESENTING KNOWLEDGE ABOUT TISSUE MOTIFS We adopt a formal ontological framework, BFO [7], to provide a formal treatment of Tissue Motifs. BFO has been chosen for its simplicity and clear-cut distinctions. Furthermore, as that framework has already been applied in related areas of the biomedical domain, this choice facilitates some degree of integration of the present treatment with related of forthcoming formalisations ab initio. In BFO parlance, the world is made of two main kinds of things: objects, such as material objects and processes that involve these objects. We find this high-level dichotomy adequate for dealing with Tissue Motifs and their role in physiology processes. According to this view, Tissue Motifs are on the side of objects insofar as these motifs are patterns of structural organisation of possibly complex objects (i.e. tissues). But Tissue Motifs are of course not these objects: they are not tissues; they are repeating patterns of tissue structure. In BFO, entities such as patterns fall under a category of so-called Generically Dependent Continuant (which means that Tissue Motifs need some other entity in order to exist). We will adopt this view and thus separate (i) motifs as entities in their own right from (ii) the entities (i.e. tissues) in which they recur as patterns. While these considerations solve the question of the ontological status of Tissue Motifs, they do little to provide the formal means for describing and registering the characteristics of Tissue Motifs in general and, in particular, for registering the differential characteristics between distinct motifs. Certainly, as generically dependent entities, Tissue Motifs can be characterised as the motifs of some tissue. This however does little more than secure a form of bookkeeping and, while it is fundamental for some purpose to identify and collect the association between tissues and their motifs, more detail is needed. One reason why such associations are important is that Tissue Motifs give a key to the classifica- 2 tion of tissues. Furthermore, once the description of Tissue Motifs includes enough of the physical characteristics from which to derive characterisations of the physiological processes they allow richer characterisations of tissues be achieved, including characterisations of physiological processes now occurring at the tissue level of granularity. In this abstract, we sort the elements of a forthcoming tissue knowledge framework according to the representation they support, namely: 1) the characterisation of Tissue Motifs through (i) the type of relationships in which they enter with other entities such as tissues and material parts of these tissues (e.g. cells and fluid compartments), as well as (ii) the way they are configured in virtue of presenting a given motif; 2) the elicitation of selected aspects of Tissue Motifs allowing for deriving the characteristics of the physiological processes they enable (spatial relationships and distances, in particular), as well as the various types of processes in question (e.g. processes of flow, stress transfer, electrical transmission, etc.); 3) the description of emerging biological properties and functions that tissues have in virtue of presenting given motifs or their combinations. An interesting and also challenging aspect of such knowledge representation is that it brings together, through the central treatment of Tissue Motifs, treatments that are traditionally circumscribed to areas of the biomedical domain but that lack the required articulation to support a multi-scale ontology of transport physiology. Given Tissue Motifs and their formal account, it is possible to articulate the description of transport phenomena from scales that range from the molecular to the organ level. Tissue Motifs, therefore, provide a key bridge for the representation of transport physiology, which can now be traversed as a network of connected and interdependent knowledge representations. 3 ACQUIRING KNOWLEDGE ABOUT TISSUE MOTIFS The extraction of Tissue Motifs from both 2D and 3D histology images ensures that generated knowledge is linked to independent and verifiable pictorial evidence about the tissue architecture in which a motif is found. In this section, we briefly describe the methodology, implemented as a software tool, that supports the collaborative generation of Tissue Motif knowledge through a communal annotation effort. The annotation of histology image data requires segmentation that is applied to these images. Segmentation circumscribes regional elements that can be discerned in the image and subsequently semantically annotated. To that end, we developed a software tool that supports the application of semantic annotation to histology image segments. Initially, Paper J only the use of manually circumscribed segmentation is permitted, with the aim of also supporting the application of automated segmentation methodologies in the future. Currently, the tool allows the annotation of image segments with terms from either of two ontologies, namely: the Foundational Model of Anatomy (FMA) [8] and the Cell Type ontology [4]. The Graphical User Interface (GUI) functionality in the software permits users to drag segments from the image onto tiles in ApiNATOMY [9] treemap graphical depictions of the two ontologies (Figure 1). Future plans for the tool is to also integrate RICORDO [10] webservices in support of storage and querying of the generated metadata. The software tool per se consists of a webstart JAVA application, based on a well-established atlasing tool [11]. The GUI displays two panels: (a) One panel in shows image sections and their corresponding segment information, along with a slider that scrolls through images of a particular object at various levels of resolution (left side of Figure 1). (b) The other panel is used to display the treemap layout of each relevant ontology via a series of tabbed views (right side of Figure 1). Importantly, the application provides a user interface interaction that allows image segment identifier symbols in the image panel to be ‘dragged-and-dropped’ onto the relevant tile in the treemap panel, by way of creating an annotation binding. So far, the tool allows users to visually identify capillaries and their surrounding tissue units as a means to manually segment and annotate those cells that make up the Tissue Motif. The level of inferencing that the annotation tools currently allows is limited to assigning celltype ontology terms (associated with relevant image segments) to be contained_in the FMA anatomical entity term with which the image as a whole is annotated. Future work will also support mereotopological inferencing on the basis of geometrical containment relations between segments in the same image. 4 knowledgebase of Tissue Motifs that bridge anatomy and cell-type ontologies in a systematic and comprehensive manner. Such a goal can only be reached if crowdsourcing of both images as well as curation effort can be harnessed to achieve the required coverage of a wide range of mammalian tissue. REFERENCES 1. 2. 3. 4. 5. 6. 7. 8. 9. CONCLUSION Having access to tissue-level networks of cell types that are able to communicate interchangeably via diffusive processes is a key first step to integrate cell-specific gene expression data coupled with experimentally-determined proteinprotein interaction networks. In particular, such integration would assist with providing a more cellular context to the interpretation of Genome-wide Association study findings, through the identification of altered cellular communication routes as a result of genetic mutation. 10. 11. While both the representation of Tissue Motifs as well as tools to acquire such knowledge are at a relatively early stage of development, the potential for the acquisition approach discussed above is very promising. Our aim is to build the requisite infrastructure for an open public Paper J Hodgman, T.C., The elucidation of protein function by sequence motif analysis. Comput Appl Biosci, 1989. 5(1): p. 1-13. Ricke, D.O., et al., Nonrandom patterns of simple and cryptic triplet repeats in coding and noncoding sequences. Genomics, 1995. 26(3): p. 510-20. Vogel, C., et al., Supra-domains: evolutionary units larger than single protein domains. J Mol Biol, 2004. 336(3): p. 809-23. Bard, J., S.Y. Rhee, and M. Ashburner, An ontology for cell types. Genome Biol, 2005. 6(2): p. R21. Andreeva, A., et al., Data growth and its impact on the SCOP database: new developments. Nucleic Acids Res, 2008. 36(Database issue): p. D419-25. Renkin, E.M. and C. Crone, Microcirculation and Capillary Exchange, in Comprehensive Human Physiology: From Cellular Mechanism to Integration, R. Greger, Windhorst, U., Editor 1996, Springer: New York. p. 1965. Grenon, P., B. Smith, and L. Goldberg, Biodynamic Ontology: Applying BFO in the Biomedical Domain, in Ontologies in Medicine2004, IOS Press: Amsterdam. p. 20-38. Rosse, C. and J.L. Mejino, Jr., A reference ontology for biomedical informatics: the Foundational Model of Anatomy. J Biomed Inform, 2003. 36(6): p. 478-500. de Bono, B., P. Grenon, and S.J. Sammut, ApiNATOMY: a novel toolkit for visualizing multiscale anatomy schematics with phenotyperelated information. Hum Mutat, 2012. 33(5): p. 837-48. de Bono, B., et al., The RICORDO approach to semantic interoperability for biomedical data and models: strategy, standards and solutions. BMC Res Notes, 2011. 4: p. 313. Potikanond, D. and F.J. Verbeek, Visualization and analysis of 3D gene expression patterns in zebrafish using web services, in Visualization and Data Analysis, C.W. Pak, et al., Editors. 2012, SPIE: Bellingham, WA. 3 de Bono, B. et al. Figure 1. GUI screenshot for the histology image annotation tool. 4 Paper J Comparing closely related, semantically rich ontologies: The GoodOD Similarity Evaluator Niels Grewe1∗, Daniel Schober2 and Martin Boeker2 1 2 University of Rostock, Rostock, Germany University Medical Center Freiburg, Freiburg, Germany ABSTRACT Objective To provide an integrated cross-platform ontology evaluation tool based on normalisation techniques and ontology similarity measures. Background Ontology similarity measures are extensively used in ontology matching applications but can also be applied to ontology evaluation scenarios (e.g. in ontology learning) when a ‘gold standard’ ontology is available against which similarity can be computed. Unfortunately, while there are software packages for similarity measurement available that are well suited for more terminologically oriented uses, there is no ready to use solution that copes well with the requirements of more formal ontologies, namely the reliance on top-level ontologies and the presence of semantically rich axiomatic class definitions. Methods We reviewed and applied several similarity measures for the appraisal of data collected in an ontology teaching experiment. We also optimised and applied ontology normalisation techniques to pre-process ontology artefacts in order to produce more consistent results. Results We implemented an advanced normalisation procedure to improve the usefulness of structural similarity measures in the presence of rich class definitions and provide a highly configurable, ready to use tool for performing comparisons of individual ontologies or groups of ontologies. Conclusion Similarity measurements as established in the ontology alignment communities can also serve specific use-cases in ontology evaluation, but their application to semantically richer ontologies, as exemplified by many biomedical ontologies, requires special considerations to be taken into account. We therefore believe that providing an easily accessible tool for performing similarity measurement under these conditions is of considerable value to the biomedical ontology engineering community. 1 INTRODUCTION Quality assurance and ontology evaluation have become important topics in recent years. Successful applications of ontologies in real world usage scenarios can only be expected if they are found to be pragmatically and representationally adequate. Providing quantitative measures for such adequacy is a considerably hard to reach desideratum. One use case where quantifiable data about representational adequacy can be gathered is the comparison of (possibly automatically derived) ontologies or ontology variants against a pre-established ‘gold standard’ model, providing an assessment of similarity for the compared ontologies. This question is also of high relevance for ontology alignment. ∗ to whom correspondence should be addressed In this paper, we present a tool that supports gold standard similarity measures, while taking into account the specific needs of formal ontologies that make use of a large subset of the expressivity provided by OWL 2. We will describe the challenges to similarity measurement arising from the pervasive use of toplevel ontologies in many formal ontologies, such as OBO Foundry compliant [11] biomedical ontologies, and the extensive usage of complex axiomatic characterisation of classes in an ontology, both of which introduce a considerable skew in similarity measures, thus necessitating the development of mitigation strategies. The original use case of our tool has been the assessment of ontologies produced by students as part of a randomized control study on an ontology development curriculum [3] against gold standard models of the same modelling task created by experts according to established best practises. 2 PROBLEM STATEMENT 2.1 Use case The intended use case of our tool was the evaluation of the effect of ontology teaching on the quality of the artefacts produced. In a randomized control trial setting, we had two groups of twelve students each subjected to different instructional regimes and asked them to solve different small modelling tasks, such as partially modelling the anatomy of the stomach. The students were asked to solve these tasks using the ‘lite’-fragment of the BioTop upper domain ontology [2] and an additionally provided number of predefined class IDs. For each of theses exercises, we also provided a predefined expert model to be used in the evaluation. The project’s working hypothesis was that effective training would empower the students to create more correct models and would lead to increased similarity with the gold standard and also increased consistency among the student artefacts. It was therefore necessary to measure, among other things, the similarity of the student models to the expert model and the internal similarity among the student ontologies in different groups. While we will not present the results of this study here, we will describe the software component we designed for performing the similarity analysis. 2.2 Requirements Based on our use case, we derived a number of requirements for the software component and the similarity measurement strategy to be applied: 1. Since our curriculum relied heavily on the use of more expressive OWL 2 constructs, we required those to be taken into account when computing similarity. 2. Mere syntactic variants should not be regarded as differences in a relevant sense. Paper K 1 Grewe, N., Schober D., and Boeker, M. 3. Since most class identifiers were specified beforehand, we did not require sophisticated lexical matching, and especially needed to avoid lexical similarities obscuring semantic differences. At the same time, we should be able to account for unexpected new classes added by students. 4. Since use of top-level ontologies was pervasive both in the curriculum and the experimental setup, differences in the local domain models should neither be obscured by the top-level, nor should the effect of the top-level be ignored completely, since the domain model might make use of the top-level to express crucial semantic constraints. D og ot ⊤ ⊤ A B A B C F D C F E E Figure 1. Characteristic extracts of the class C from og and ot 3 BACKGROUND 3.1 Similarity Measures A similarity measure, in the sense that is relevant here, is a quantitative estimate of the likeness between two classes in different ontologies or, by some kind of aggregation, of those ontologies as a whole or of specific parts of them. Formally, it is a function that maps from pairs of entities (classes or ontologies), into the interval [0, 1], where 1 indicates sameness or equivalence and 0 indicates that the entities are orthogonal to one another. An excellent review of different metrics can be found in [8]. Often, these measures are collectively described as specifying ‘semantic similarity’, but it is useful to make some distinctions among them. We will review some of the more well-known measures in light of our requirements. 3.1.1 Lexical Measures Purely lexical similarity measures can be used to specifically compute the similarity between resource identifiers (IRIs), IRI fragments, class labels or other annotations, for example using the Levenshtein edit distance, which can be easily aggregated to cover the whole ontology. Alternatively, the lexical information (class labels, IRIs etc.) in an ontology can be mapped to dimensions to form a vector space model, in which similarity can be computed as the cosine of angle between the two vectors [8]. Since they rely only on lexical information and take neither the taxonomic structure of the ontology nor the explicit semantics of the class definitions into account, we judged those methods to be ill suited for our approach, even though they partially fulfill requirement 3 in as far as they allow identifying the most likely pairs of user-created classes. 3.1.2 Structural Measures Structural similarity measures basically treat ontologies as graphs, thereby explicitly taking into account their taxonomic and relational structure and allowing ontology similarity measurement to benefit from the vast literature in graph similarity research. Apart from generic measures, such as those based on graph edit distance [16] or maximal common subgraphs [4], there are several structural metric specifically designed for comparing ontologies: Triple Based Entity Similarity Triple based entity similarity tries to account for the similarity of two ontologies in terms of the similarity of the RDF triples they are composed of [5]. This works by initial ‘seeding’ the similarity computation with lexical similarities between two nodes in the RDF graph and iteratively refining this measure by taking into account the overall similarity of the ontology computed this way until the similarity difference between two iterations drops below a certain threshold. Obviously, the results of this method depend on the choice of the lexical similarity measure (e.g. Levenshtein or Jaro-Winkler distance) and the aggregation method. The choices of which are also described in [5, 8]. Since the translation of OWL 2 into RDF is conservative [14], all the explicitly asserted semantics of the ontology are also represented in the graph and contribute to the overall similarity under the triple based scheme, though some information that is implicit in class definitions may not be considered. For this reason, we judge it at least partially fulfill requirement 1. Additionally, even though this method makes use of lexical similarity to seed the computation, we still consider it to fulfill requirement 3 as well, because the lexical similarity is only used to produce a probable initial mapping that is later refined using the structure. It is somewhat susceptible to also capturing syntactic variants (requirement 2), though, and does not make special provisions for requirement 4. OLA Similarity Another promising structural similarity measure that is similar to the triple based method is the OLA (OWL-Lite Alignment) measure by Euzenat and Valtchev [7], which operates by converting the ontology into a labelled graph that explicitly represents most features of the earlier OWL-lite language in terms of nodes and edges. The similarity is then computed by a recursive function that operates on the graph and decomposes complex structures in order to derive an overall similarity. While being a good fit requirements-wise, OLA was not applicable to our use case because our ontology development tasks where formulated and solved in OWL 2. Semantic Cotopy and Common Semantic Cotopy The semantic cotopy (SC) and common semantic cotopy (CSC) measures were defined by Dellschaft and Staab for the evaluation of ontology learning work against a gold standard model [6]. Both compute the so called ‘taxonomic’ precision and recall, i.e. how well the computed taxonomy covers the reference taxonomy. This works by extracting, for a given pair of classes, a characteristic extract from both ontologies that encompasses all subclasses and superclasses of the target class (cf. figure 1). For example, in figure 1, the characteristic extracts of the class C from the ontologies og and ot would be cesc (C, og ) = {⊤, A, C, D, E} cesc (C, ot ) = {⊤, B, C, E} The taxonomic precision of ot can then be computed as follows (where, in this case, C1 and C2 refer to the same class name): tp(C1 , C2 , og , ot ) = 2 Paper K |cesc (C1 , og ) ∩ cesc (C2 , ot )| |cesc (C1 , og )| The GoodOD Similarity Evaluator Precision and its derivative measures such as recall and F-measure can be used to assess how well both ontologies match. As opposed to semantic cotopy, common semantic cotopy is a modification that takes into account the fact that some classes from one ontology might not have a counterpart in the other and picks the best estimate in this case. With regard to requirement 1, (common) semantic cotopy will only capture those aspects of class definitions which have an effect on the taxonomy, and CSC does fare well with regards to requirement 3 because it can identify structurally similar classes if new classes were introduced. It is also easy to adapt it to only consider classes from top-level ontologies if these appear in the characteristic extract of a class from the local model (requirement 4). 3.1.3 Semantical Measures Apart from lexical and structural measures, there have been measures proposed that evaluate the similarity of ontologies in terms of their semantics [13, 1, 9]. These measures strive to compare the models induced by the ontology, e.g. by determining which consequences are entailed by both ontologies in question. Such assessments are difficult to conduct, because the number of consequences of an ontology is on principle infinite. For example, ‘A subClassOf: B’ also entails ‘A subClassOf: (B and B)’, ‘A subClassOf: (B and B and B)’, which is not particularly interesting. Hence a limited subset needs to be chosen. Naturally, this kind of measure is promising for our use case, but since it has only recently found limited implementations [10], we have concentrated on well-understood structural measures when developing our comparison tool. 3.2 Normalisation Naı̈ve lexical and structural measures are problematic when they are put to the task of specifying the similarity of more complex ontologies: The goal of ontologies is to provide a model that accurately represents some domain of reality. They thus thrive on explicit semantics and it is debatable whether purely syntactical comparisons are sufficient here, since they also capture superficial syntactical differences between ontologies. In this light and in view of our requirements, we choose triple based entity similarity and common semantic cotopy to be the most fitting candidates for the implementation of our software component. Additionally, we were very interested in further reducing the sensitivity of the measures to syntactic variations (requirement 2). To this end, Vrandečić and Sure [17, 18] have devised a normalisation algorithm that transforms ontologies into a more predictable form (while preserving the semantics), thus enriching the utility of structural similarity measures. This concept of normalisation has to be distinguished from the ‘untangling’ of poly-hierarchies, for which the term ‘normalisation’ is also used and which is mainly interested in presenting ontologies in a way that is more manageable for the ontology engineer [15]. Contrary to this, normalisation according to Vrandečić explicitly creates poly-hierarchies and proceeds as follows: 1. Create names for all anonymous complex class expressions 2. Create names for all anonymous individuals 3. Reason over the ontology in order to materialise all missing subsumption links 4. Clean redundant and circular subsumptions 5. Assert all individual and property instances to be of the most specific type 6. Normalise property usage, i.e. avoid inverses Step 1 modifies the class hierarchy so that all subClassOf: axioms are asserted exclusively between named classes. So whenever a complex class expression appears on the left or right hand side of a subClassOf: axiom, a new named class is introduced and defined by an equivalentTo: axiom to be equivalent to the original class expression. Step 2 similarly assigns explicit names to anonymous individuals in the ABox, which might be shown to be identical to other individuals by reasoning. Steps 3–5 are the ones crucial to representing the semantics of the graph, and thus rely on a description logics reasoner to be performed. Consequently, the reasoner is used in step 3 to create all subsumption links that are entailed by the ontology but had not been explicitly asserted. Since this produces a large number of redundant subClassOf: axioms, step 4 removes all subsumption links that hold by transitivity alone and replaces those which form a circle with equivalence axioms. Step 5 uses the inferred information on the level of individuals and assigns the inferred types to both property and individual instances, while step 6 merely picks a canonical name for all properties used (e.g. by replacing inverseOf property expressions with normal names). All these procedures preserve the semantics of the ontology while at the same time making sure that much implicit information is explicitly represented in the ontology graph, thus making it more amenable to structural similarity analysis. 4 METHODS 4.1 Improvements to the Normalisation Algorithm Since the explicit semantics of axiomatic class definitions were important for our evaluation, we regarded it as crucial to implement a variant of the aformentioned ontology normalisation, though we made some modifications in our implementation. On the one hand, we restricted the scope of the normalisation to suit our use case. Since our modelling experiments did not involve instance bases, we did not implement the steps 2 and 5. The experimental setup already included a stringent regimen on the use of properties (especially the students were required to use preexisting object properties), so step 6 was also expendable as well. We also discovered that the explicit cycle detection for the taxonomy in step 4 was not necessary, since the reasoner did already infer subsumption in both directions for every pair of classes that were part of a cycle. These could be replaced locally by equivalentTo: axioms. On the other hand, the suggestions of Vrandečić and Sure proved to be not sufficient in some cases. Among other things, we could not distinguish between cases where different class expressions were embedded in, for example, the range of a existential restriction (for example ‘A subClassOf: has locus some (B and C)’ as opposed to ‘A subClassOf: has locus some (C and D)’), for which the original normalisation proposal would only generate a named class that would very often be inferred to occupy a structurally similar place in the class hierarchy. We thus augmented the normalisation algorithm with a procedure that takes inspiration from OLA (cf. section 3.1.2 on the facing page), which uses a recursive algorithm to determine similarity and Paper K 3 Grewe, N., Schober D., and Boeker, M. from [13] who extract ‘signatures’ from class definitions in order to compare their similarity, where a signature is defined as the set of primitive concepts and restrictions used in the axioms that define the class. Our procedure works by explicitly generating named classes for the components of the class definition, so that they can be explicitly represented in the ontology graph. We derive the set of classes to be generated by computing what we call the decomposition set of the class definition. We defined the decomposition set as follows (presentation follows OWL 2 functional syntax): Configuration Management N1 N2 N3 Definition 1 (Decomposition Set): Let C be a class expression, then the decomposition set of C DS(C) consists of C and the members of the following sets of classes: Common Semantic Cotopy Semantic Cotopy Normalisation OntoSim Similarity Measures Ontology Cache Similarity Measurement • If C is an atomic class: {} • If C is an ObjectIntersectionOf D1 . . . Dn expression: The DS‘s of all conjuncts D1 ... Dn Figure 2. Architecture of the GoodOD Evaluator • If C is an ObjectU nionOf D1 . . . Dn expression: The DS’s of all disjuncts D1 ... Dn reasoner for reasoning over the ontologies, though we also make use of the JENA and OntoSim libraries (cf. 4.2.3). We provide rich facilities for customisation of the comparison process either through command-line switches or through configuration files in a property list format (plain text or XML), allowing users to quickly tailor comparisons to their needs. The main workflow that is constructed based on the configuration consists of normalisation and subsequent comparison of two individual ontologies or groups of ontologies. For performance reasons, the normalisation process was embedded into a caching mechanism, which avoids resource intensive, redundant normalisation passes. The normalisation of individual ontologies, as well as the comparison of pairs of ontologies, is implemented in a multi-threaded fashion, so that modern multi-core hardware can be used efficiently. As output, our tool provides a command line based summary of the comparisons conducted and also produces comma separated value tables with the individual results. • If C is an ObjectComplementOf D expression: For every member M of DS(D), ObjectComplementOf M is a member of DS(C). • If C is an ObjectOneOf E expression, where E is a set of individuals: The set of all ObjectOneOf expressions that can be constructed with members of the power set of E. • If C is of the form K P D, where K is one of {‘ObjectSomeV alueF rom’, ‘ObjectAllV aluesF rom’, ‘ObjectM inCardinality n’, ‘ObjectM axCardinality n’, ‘ObjectExactCardinality n’} and P is an object property expression: All class expressions that arise from combining K P with the members of the DS(D). • If C is an ObjectHasV alue P I, ObjectHasSelf P , or data property restriction: {} The connections between the original classes and the classes added this way are easily inferred by a reasoner. For example, given the original expression ‘A subClassOf: has locus some (B and C)’, would decompose into the following set of additional axioms: D1 equivalentTo: has locus some B D2 equivalentTo: has locus some C A subClassOf: D1 A subClassOf: D2 This kind of decomposition is of course far from complete. It does, for example, not take into account the property hierarchy of the ontology. One solution would be to also materialise the subsumption hierarchy of object property restrictions (e.g. if has part is a subproperty of has locus, ‘has part some B’ subsumes ‘has locus some B’). We decided not to incorporate this kind of decomposition rule because it seemed to unfairly penalise the absence of property restrictions for many similarity measures. 4.2 Implementation 4.2.1 Architecture Our evaluation tool is implemented as a Java application with a command-line interface. It relies on the OWLAPI [12] for processing the OWL 2 ontologies and on the HermiT 4 4.2.2 Normalisation Modules Different aspects of normalisation are implemented by separate classes, which can be dynamically composed by instances of the NormalizerChain class to form the core normalisation component of the workflow. This allows for easy customisation of the normalisation process and facilitates extension. In detail, we provide classes for minor tasks such as rewriting IRIs or rerouting imports for the ontology. Different steps of the Vrandečić normalisation algorithm (creating names for class expressions, subsumption materialisation) are also implemented separately and can be plugged into the workflow as needed, so can the class responsible for implementing the decomposition set procedure described above. Since normaliser classes are loaded dynamically based on a configuration file, users can easily implement their own normalisation modules by having their classes conform to the Normalizer interface. 4.2.3 Comparison Modules For comparison, we in part rely on the OntoSim library [8], which implements many similarity measures, often in an very API agnostic way. This allowed us to easily plug into its functionality to provide lexical similarity measurement based on a cosine vector model. We also integrated triple based entity similarity using OntoSim, which was only possible by serialising our OWLAPI data structures in memory and importing them Paper K The GoodOD Similarity Evaluator unrestricted restricted Mean SD Mean SD Normalisation F-Measure (n = 312) F-Measure (n = 312) None With Imports Vrandečić GoodOD 0.2552 0.9385 0.8641 0.8353 0.2874 0.0342 0.0505 0.0616 0.2552 0.7580 0.7509 0.7220 0.2874 0.1230 0.1066 0.1626 Table 1. Mean taxonomic F-measure for comparisons of student-generated ontologies against expert models, using various normalisation procedures. into JENA before passing them into OntoSim. Since we also provide an abstract class for interfacing with OntoSim, it is easily possible to extend our tool to cover more of the measures implemented therein. Unfortunately, a general limitation of OntoSim is that it only takes into account classes defined in the ontologies for which the similarity is being computed, neglecting the impact of top-level ontology classes which were frequently used in our ontologies. For that reason, we also implemented both semantic cotopy and common semantic cotopy based taxonomic precision and recall to supplement OntoSim. This implementation can be configured to either take into account classes from imported ontologies or to ignore them. Also, including the entire imported top-level ontology was often not desirable since it was shared by all ontologies in our data set, so we added facilities for comparing only a limited set of pre-defined classes. As with the normalisation modules, comparison modules are resolved at runtime based on the configuration, so users can easily supplement the available similarity measures by writing classes implementing the Comparator interface. 5 RESULTS AND DISCUSSION We are making available the GoodOD Similarity Evaluator under the GNU General Public License (GPL) v3, from our website at http://purl.org/goodod/evaluator; the source code can be downloaded from https://github.com/goodod/ evaluator. In general, we found that we were able to cope even with large numbers of comparisons in reasonable amounts of time on commodity hardware. In our tests, a machine with two 1.6GHz x86-64 CPU cores and 4GB RAM computed 1369 similarity values for different pairs from a set of 325 ontologies using a complex normalisation chain in about 100 minutes wall time. These ontologies contained between 60 and 70 classes (including toplevel), with about the same number of classes being generated during normalisation. Since we have not yet exploited some of the more obvious avenues of optimisation (for example, not only different ontologies could be compared in parallel, but also individual classes), we are confident that the tool can be useful for larger ontologies as well. On the whole, the most resource intensive part of the whole process seems to be reasoning over ontologies that are enriched with a large number of additional class expressions during the normalisation process, which is a common problem for all applications that rely on reasoning over highly formal ontologies. We have not timed the individual steps of the computation, though. To assess the effect of the normalisation procedures, we performed an analysis of 13 × 24 comparisons of student ontologies against expert models from our dataset (table 1). The comparisons were performed both without restrictions on the scope of classes being compared (i.e. including the imports closure) and restricted to just the classes defined in the original ontology using common semantic cotopy (CSC) under a variety of normalisation procedures. The results show that not considering the imports closure of the ontology does leads to similarity measures that are hardly credible, which is probably due to the limited number of classes (usually 10– 15) in the artefacts considered, so that every variation has a large impact on the overall result. Also, computing similarity over the entire imports closure in an unrestricted way obscures all meaningful differences because the shared top-level ontology dwarfs the ontology fragments containing the actual domain models by an order of magnitude. Considering only the restricted results, the Vrandečić normalisation procedure – while having no marked effect on the mean similarity – slightly reduces the standard deviation (SD). This might, for example, be attributed to the fact that it eliminates differences arising where some ontologies were explicitly asserting taxonomic links, while others were leaving them implicit. Our augmented (‘GoodOD’) version that performs decomposition of class definitions reduces the observed similarity while at the same time increasing the standard deviation. This means that the differences between the ontologies in the field are slightly emphasised, but not in an implausible way. We attribute this effect to the fact that the decomposition set procedure brings to light subsumption relationships that were only implicit in the original taxonomy and thus would not be taken into account when computing similarity based on CSC. A principled evaluation of the merits of these measures is difficult, but they generally seem to be in line with the manual appraisal of a sample of the ontologies. Nonetheless, based on this results we conclude that our initial requirements were fulfilled by the evaluation software by a combination of the following factors: • By generating new classes using the decomposition set procedure, we could account for more expressive features of OWL 2 even though we only used structural similarity measures. (requirement 1) • By implementing the Vrandečić normalisation algorithm, we successfully suppressed a great deal of skew created by semantically equivalent syntactic differences. (requirement 2) • By using a structural similarity measure (such as triple based entity similarity or common semantic cotopy), we could successfully ignore lexical information as far as possible. (requirement 3). • By restricting the aggregation of individual similarity values to those obtained from classes in the local domain models, while still considering top-level classes as part of the characteristic extract in the CSC algorihtm, we faithfully represented the effect of the top-level ontology without obscuring the differences we were trying to assess. (requirement 4) We have used our tool for a detailed evaluation of our ontology teaching experiments [3], which generally produced meaningful results. But although our measurement facilities were initially tailored for this use-case, they should be transferable to other, more conservative, uses (for example in ontology learning) and are provided in a highly configurable, ready to use way. Paper K 5 Grewe, N., Schober D., and Boeker, M. 6 CONCLUSION We do not believe that measuring similarity is in general a useful tool for solving quality problems in biomedical ontologies because it assumes the existence of a known good model of the domain covered by the ontology. The problem would hence just be shifted to verifying the quality of that model. Nonetheless, we believe that the availability of ready-to-use software for performing such measurements can be beneficial at least in some cases. It has already proven to be very useful for examining the status of a student cohort in an ontology teaching setting, where students usually work on small, clearly delineated problems for which a canonical solution is already available. Another obvious application is quality assessment on ontologies generated by machine learning, since it can make a big difference for the outcome of an evaluation whether the effect of class definitions has been taken into account or not. Our work on the evaluation tool also shows that generic graphbased similarity measures have tremendous limitations when applied to ‘heavy duty’ ontologies that make extensive use of the more expressive subsets of OWL 2 (i.e. they only capture information that is explicitly represented in the graph structure). And while we have shown that there are workarounds for some of these issues, we believe that the entire enterprise will tremendously benefit from further research in and the implementation of bona-fide semantical similarity measures. ACKNOWLEDGMENTS This work is supported by the German Science Foundation (DFG) as part of the research project JA 1904/2-1, SCHU 2515/1-1 GoodOD (Good Ontology Design). REFERENCES [1]Rudi Araújo and Helena Sofia Pinto. ‘Towards Semantics-based Ontology Similarity’. In: Proceedings of the 2nd International Workshop on Ontology Matching (OM-2007) Collocated with the 6th International Semantic Web Conference (ISWC-2007) and the 2nd Asian Semantic Web Conference (ASWC-2007). Ed. by Pavel Shvaiko et al. 2007. URL: http://ceur-ws.org/ Vol-304/paper4.pdf. [2]Elena Beisswanger et al. ‘Biotop: An Upper Domain Ontology for the Life Sciences’. In: Applied Ontology 3.4 (2008) pp. 205–212. DOI: 10.3233/AO-2008-0057. [3]Martin Boeker et al. ‘Teaching Good Biomedical Ontology Design’. In: Proceedings of the 3rd International Conference on Biomedical Ontology (ICBO) ed. by Ronald Cornet and Robert Stevens. 2012. URL: http://ceur-ws.org/Vol-897/ sessionJ-paper25.pdf. [4]Horst Bunke and Kim Shearer. ‘A graph distance metric based on the maximal common subgraph’. In: Pattern Recognition Letters 19.3–4 (1998) pp. 255–259. DOI: 10.1016/S01678655(97)00179-7. 6 [5]Jérôme David and Jérôme Euzenat. ‘Comparison between ontology distances (preliminary results)’. In: Proceedings of the 7th International Semantic Web Conference (ISWC) ed. by Amit P. Sheth et al. 2008, pp. 245–260. [6]Klaas Dellschaft and Steffen Staab. ‘On How to Perform a Gold Standard Based Evaluation of Ontology Learning’. In: Proceedings of the 5th International Semantic Web Conference (ISWC) ed. by Isabel Cruz et al. 2006, pp. 228–241. [7]Jérôme Euzenat and Petko Valtchev. ‘Similarity-based ontology alignment in OWL-lite’. In: Proceedings of the 16th European Conference on Artificial Intelligence (ECAI) ed. by Ramon López De Mántaras and Lorenza Saitta. 2004, pp. 333–337. [8]Jérôme Euzenat et al. D3.3.4: Ontology distances for contextualisation. Technical Report. NeOn Consortium, 2009. [9]Jérôme Euzenat. ‘Semantic Precision and Recall for Ontology Alignment Evaluation’. In: Proceedings of the 20th International Joint Conference on Artificial Intelligence (IJCAI) ed. by Manuela Veloso. 2007, pp. 348–353. [10]Daniel Fleischhacker and Heiner Stuckenschmidt. ‘A Practical Implementation of Semantic Precision and Recall’. In: Proceedings of the Sixth International Conference on Complex, Intelligent, and Software Intensive Systems (CISIS) ed. by Leonard Barolli et al. 2010, pp. 986–991. DOI: 10.1109/ CISIS.2010.97. [11]OBO Foundry. OBO Foundry Principles. URL: http://www. obofoundry.org/crit.shtml (visited on 03/08/2012) [12]Matthew Horridge and Sean Bechhofer. ‘The OWL API: A Java API for OWL Ontologies’. In: Semantic Web Journal 2.1 (2011) pp. 11–21. [13]Bo Hu et al. ‘Semantic metrics’. In: Proceedings of the 15th International Conference on Knowledge Engineering and Knowledge Management (EKAW) ed. by Steffen Staab and Vojtech Svátek. 2006, pp. 166–181. [14]Peter F. Patel-Schneider and Boris Motik, eds. OWL 2 Web Ontology Language Mapping to RDF Graphs. 2009. URL: http: //www.w3.org/TR/2009/REC-owl2-mapping-tordf-20091027/ (visited on 06/09/2012) [15]Alan L. Rector. ‘Normalisation of ontology implementations: Towards modularity, re-use, and maintainability’. In: Proceedings Workshop on Ontologies for Multiagent Systems (OMAS) in conjunction with European Knowledge Acquisition Workshop. 2002. [16]Linda G. Shapiro and Robert M. Haralick. ‘Structural Descriptions and Inexact Matching’. In: IEEE Transactions on Pattern Analysis and Machine Intelligence PAMI-3.5 (1981) pp. 504–519. DOI: 10.1109/TPAMI.1981.4767144. [17]Denny Vrandečić and York Sure. ‘How to Design Better Ontology Metrics’. In: Proceedings of the 4th European Semantic Web Conference (ESWC) ed. by Enrico Franconi Michael Kifer and Wolfgang May. 2007, pp. 311–325. [18]Denny Vrandecic. ‘Ontology Evaluation’. PhD thesis. Karlsruhe: KIT, Fakultät für Wirtschaftswissenschaften, 2010. Paper K ZooAnimals.owl: A didactically sound example-ontology for teaching description logics in OWL 2 Daniel Schober1*, Niels Grewe2, Johannes Röhl2, Martin Boeker1 1 Institute of Medical Biometry and Medical Informatics (IMBI), University Medical Center, Freiburg, Germany University of Rostock, Rostock, Germany 2 ABSTRACT Motivation: Over the years several OWL ontologies have been published to serve as training examples in teaching the principles of description logics and ontology development to a wide audience. Problem: As some of these resources make no commitment to the biomedical target domain, they increase the learning threshold by adding the burden of an unfamiliar topic area in addition to the technicalities and the intended domain independent description logics principles. Solution: We argue that the outstanding role of formal ontologies in the life sciences is important enough to warrant a robust teaching ontology that is optimized for the biomedical ontology novice intending to learn the basics of description logics in OWL 2. Result: We present a list of requirements for user compliant teaching ontologies. We also exemplify these in the ZooAnimalOntology justifying design decisions with intuitive teaching examples and exercises, carried out and tested in a didactical teaching effort, the GoodOD summer school. 1 INTRODUCTION Drawbacks of existing teaching ontologies Over the years several OWL DL ontologies have been published to serve as training examples in teaching the principles of description logics. However, most existing resources display one or more of the following drawbacks1: · They are often insufficiently aligned to the targeted recipients in the biomedical domain. The modelling domains do not match user requirements particularly well, as biomedical researchers are not necessarily experts in Wines, Pizzas nor Marsupials. Hence learning description logic (DL) principles is distracted by learning domain knowledge in remote fields, sometimes – although being entertaining – misleading the novice to model unicorns, where not needed. · Existing teaching artefacts are also built according to diverse, sometimes conflicting ontological commitment and lack a clearly delineated usage scope that would justify the exact expressivity they use. In the Protégé tutorials for example the expressivity seems to be guided * solely by the intent to explain all the Protégé tools’ capabilities. · No explicit commitment to certain modelling policies. Most teaching ontologies do not clarify to what kinds of entities they commit, i.e. they fail to state their criteria to discriminate classes and individuals, or fail to discuss whether information content entities are admissible in an ontology, etc. · They don’t make use of established known top or upper level ontologies (TLO), but invent their own upper level and own relations, again increasing the burden of orientation within the artefact. · They use overly granular TLOs with hard to grasp class names like independent continuant and occurrent. · They are insufficiently restricted, specified and justified to a certain modeling flavor and OWL expressivity. This would ideally be done according to a targeted application use case. · They lack of instance data serving examples. · They are over simplified: they are not realistic in terms of granularity and complexity of axiomatic descriptions. Simplicity can here mislead modelers to judge ontology engineering as being trivial. To serve the purpose of providing intuitive teaching examples within the GoodOD project2 a content domain had to be selected alleviating the mentioned drawbacks. Besides our earlier analysis into Ontology Simplifications 3, we here define a requirement list for such a didactic Artefact. An OWL ontology was implemented as a practically useable example along these requirements. MATERIALS AND METHODS The domain to be represented in our didactic ontology was chosen according to our target user group, life science students with some IT background. The resource was first developed as one comprehensive Artefact containing all the teaching examples in one file importing Biotop4, which is compliant to established TLOs BFO and DOLCE. BioTop is too complex and distracting for the purpose of a teaching ontology, with e.g. important classes like ‘Human’ residing ten hierarchy-levels deep from root. This makes the orientation difficult for the novice. We therefore simplified the standard BioTop into a Biotop light version and re-implemented solely the needed parts To whom correspondence should be addressed. Paper L 1 D. Schober et al. of it under the ZooAnimal namespace. Classes and relations not used were deleted, e.g. TaxonValueRegion and CanonicityValueRegion. ImmaterialThreeDimensionalPhysicalEntity was deleted, but its more intuitive subclass Place was retained under root. The class Living was moved from BiologicalLife under Action and BiologicalLife was deleted. Inverse relations were only added where they served examples. Domain and range constraints as well as relation properties like transitivity were re-added manually. Some names were simplified, i.e. from hasProperPhysicalPart to hasProperPart. We derived ubiquitously known animal individual names form a suitable wiki page5 (e.g. Keiko the Orca), but also generated typical anonymized instance names, e.g. Cow #213 and even photos of individuals. Class labeling and metadata completeness was checked with the OntoCheck tool6. To keep the working examples simple, for the curriculum exercises, single modules were refactored out and presented to the students in small self-standing OWL files. 2 RESULTS Informative entity naming: We have elaborated on intuitive and explicit class labeling best practices extensively in another paper8. Cognitive ergonomy aligned with the mesocosm: Most modeling cases in the Zoo ontology are from the traceable mesocosm9. The scale of all participating individuals falls within a human beings ‘Lebenswelt’. These can further be separated along the dimensions of the major top level categories, e.g. for material objects it’s the mesosize/length, for processes it’s the meso-duration/time. As ‘meso-world facts’ are likely to be close to the way humans think, as most parts of it are directly and immediately accessible to our human senses, and our brains were trained and coevolved cognitive models10 in coherence with this directly and immediately perceivable meso-world during childhood and socialization11. Hence many facts in such domain are likely to be aligned with already existing categorizations in our heads. This enables the modeler to memorize and verify class definitions and drawn inferences themselves quickly without being distracted from the didactic principles to be conveyed. Our animal domain is hence cognitively intuitive and ergonomic12 in comparison to domains from the microlevel like e.g. genetics or metabolomics. Requirements for domain aligned teaching ontologies We here outline some requirements for good teaching ontologies. Artefact size: The ontology should be as small as possible to foster fast memorization and orientation. Domain alignment: We choose to model the area of zoo animal taxonomy on the easy level of what is e.g. written on signposts any zoo displays for each of its animals, listing their characteristic features, peculiarities, geographic origin and some contextual knowledge. This is likely to be common sense knowledge for the life science domain and it is something any biomedical scientists can immediately relate to, i.e. understand right away, without time consuming need to refresh their knowledge from external sources. Lack in domain knowledge is hence not distracting from the didactics in implementation and modeling tasks. Delineated background ontology commitment: No matter if you are a realist or pragmatist, some words on how this conflict is handled should be given, describing the intended teaching Artefact. We in fact subscribe to an intermediate position of pragmatic realism and have no problems to represent Unicorns as InformationEntity. Coverage and granularity: The domain should be rich enough to cover all classic modeling challenges within the scope of our curriculum. The granularity should fit the compositional approach taken in DL based languages. Complexity: The ontology should not contain too many overly complicated and long nor nested restrictions. The simpler the axiomatizations, the easier they will be grasped by the novice and foster Ontology Understanding.7 2 The ZooAnimal.owl teaching ontology The ZooAnimal.owl ontology in OWL 2 expressivity SHO currently consists of 123 classes, the root classes being Action, Disposition, InformationObject, MaterialObject, Place, Quality and ValueRegion. 11 Individuals, 159 subclass and 42 equivalent class axioms were asserted (hidden GCI count of 36). Of the 11 object properties (all from Biotop-lite) two are subproperties and three are transitive. All but one object property, the hasPart superproperty, have domain and ranges specified. The ontology contains no number and value restrictions, as these will disturb tableau based reasoning and will better fit an advanced curriculum. The OWL file is available at our webpages: http://www.imbi.uni-freiburg.de/ontology/ZooAnimals The actual didactic exercises of the curriculum will be described in another paper. Coverage of topics and examples The ontology contains all standard expressions most likely to occur in our domain. We here list examples to show how they are expressed correctly in OWL 2 in our teaching ontology. DL language constructs: Set theoretic basics and taxonomy: Venn diagrams were used to introduce sets of individuals and derive classes from common properties. The meaning of the SubClassOf relation in building the taxonomic class hierarchy, the only relation between classes, was explained here. Paper L ZooAnimals.owl: A didactically sound example-ontology for teaching description logics in OWL 2 Logical Operators: We introduced intersections, unions and complements, e.g. KoalaBear subClassOf (Mammal and Herbivore). BirdThatCanNotFly equivalentTo Bird and (not (bearerOf some FlyingDisposition)) Relations and their properties: Domain and range, as well as transitivity were explained according to partonomies, e.g. VertebralColumn hasProperPart some Vertebra, Vertebrata hasProperPart some VertebralColumn, Wolf SubclassOf Vertebrata à Wolf hasProperPart Vertebra. Inverses like hasProperPart, properPartOf were explained. Defined classes, axiomatisation: Canonical Insects have 6 legs, spiders have 8 legs. As insects can survive with less than 6 legs, this example can also serve to discuss the requirement to introduce more granular but unintuitive modeling constructs such as Dispositions. Property hierarchies: E.g. hasPart with its subproperties hasGranularPart and hasProperPart. RestrictionTypes: EquivalentTo vs. SubClassOf. E.g. Bird SubclassOf has_part some Beak, but the reverse statement that everything that has part some beak is a bird, is falsified by the Platypus, a mammal with a beak. Existential vs universal Quantifiers: E.g. KoalaBear that eats only EucalyptusPlant and nothing else. Usually the universal quantifier is accompanied by the closure: In this case we also need to assert an existential quantification. Reasoning for inference and consistency checks: E.g. Bird equivalentTo Vertebrate and hasProperPhysicalPart some BirdWing Penguin subClassOf Vertebrate and hasProperPhysicalPart some BirdWing àInference: Penguin subClassOf Bird. Dolphin SubClassOf Fish (Fish & Mammal disjoint) à Reasoner detects dolphin as inconsistent, as it has mammal characteristics. Modeling best practices and patterns: Partitions: We introduced partitioning nodes for different classification schemes, i.e. AnimalByAnatomyParitition, AnimalByLocomotionPartition, AnimalByHabitatPartition and AnimalByNutritionPartition. E.g. we added a "and (bearerOf some FlyingDisposition)" restriction to the FlyingFish to let a reasoner classify it also under FlyingAnimal within the AnimalByLocomotionPartition. Phylogenetics (derives from, hasLocus): A common phylogenetic origin is a good approximation for similarities as the homologous developments of a bat wing and a bird wing show, however the reverse is not true as the analogous development for the hydrodynamic shape of shark (fish) and dolphin (mammal) illustrate. We also defined African and Indian elephant restrictions to reflect different country of origin, e.g. IndianElefant EquivalentTo (derivesFrom some (Elephant and (participatesIn some (Living and (hasLocus value India))))) Qualities: We added a color quality to the GreenGrashopper SubClassOf Hexapoda and bearerOf some (Color and (qualityLocated some Green)) Ontology Design Patterns (ODP), e.g. entailments for an exception pattern that classifies a Penguin SubClassOf AtypicalBird: Bird EquivalentTo AtypicalBird or TypicalBird Penguin SubClassOf Bird Penguin SubClassOf not (bearerOf some FlyingDisposition) TypicalBird EquivalentTo Bird and (bearerOf some FlyingDisposition). Atypicality: E.g. introducing a class Beak (SubClassOf BodyPart) and include ‘hasProperPart some Beak’ in the definition of Bird. Introducing atypical Platypus SubClassOf (Mammal and hasProperPart some Beak). Regarding typical Classes e.g. ‘Bird’ and its exceptions, e.g. ‘Ostrich’ SubclassOf ‘BirdThatCannotFly’, one has to admit that 'typicality' is hard to model in OWL and needs some form of non-monotonic reasoning. Class Border problems: Is Archaeopterix a bird and reptile? Does it exist at all (extinct)? Identity: This issue can merely be discussed according to workarounds, as OWL cannot handle temporal dynamics and reclassification, i.e. where an Instance of Caterpillar can become an instance of Butterfly after metamorphosis (individual stages of the same insect). External rules or the Ontology Pattern and Processing Language OPPL1 might be introduced here. Collections and Grains: The StandardFood for a Horse (UngulateFood) consists of Water, Hay, Cereal and HappyHorsePowder, which in turn consists of Mineral and Vitamin. The cereal constitution varies according to the Type of Horse. Horses get Oat cereals; Zebras get a mix of Oat and Wheat. ImmaterialObjects and Boundaries: A Cage consists of a CageFrame and InteriorOfCage, which borders with the Cage. InhabitedCage can be defined when the interior is occupied by some animal. The habitat of a Jaguar can be described via the place a JaguarPopulation lives in. InformationObjects: Animals are identified via a unique code on an implemented RFID-Chip. A FeedingPlan, capturing food, time and frequency for different animals, is represented classifiable as InformationObjects. ClosurePattern: A KoalaBearPopulation consists of at least one (some) KoalaBear in the range of the hasGranularPart relation, and all further fillers for this same range can only be of the type KoalaBear (only, closure): KoalaPopulation (hasGranularPart some KoalaBear) and (hasGranularPart only KoalaBear). 3 DISCUSSION Comparison with existing artefacts: The ZooAnimal Ontology differs from classic biological taxonomy in a massive reduction of classes to a minimal set that is easy to learn and grasp, the differentiae being clear and intuitive as only ani1 http://oppl2.sourceforge.net/ Paper L 3 D. Schober et al. mals everybody knows are included. Another factor increasing ontologic ergonomics is the use of colloquial vs. Latin naming. The ontology’s capability of automatically inferring multiple parenthood renders it superior to standard taxonomy, where Carnivore is an Order under the class Mammal, neglecting the fact that there are carnivorous animals that are not mammals like the alligator. Embedding into Ontology Engineering Tutorial: Some practical exercises were carried out before introducing ontology engineering in Protégé 4 (P4). One was to group cards of individuals according to common properties (set theory). Here the usual traps could be provided (Dolphin is no Fish, Kiwi is a Bird without wings) and the limitation of physical representations without the possibility for multiple parenthood (Flipper as Mammal and WaterAnimal) was illustrated practically. A next exercise was to assign class names to the generated sets of individuals. These practical examples were later implemented in Protégé using the Zebra Fish Anatomy ontology (ZAO). In another practical exercise a semantic network from the ZAO was presented as a graph of Classes connected by unlabeled relations, which had to be labeled via object properties from ZAO. Not all of the curriculum topics and examples have been implemented in ZAO yet, but these can successively be transferred from the available exercise owl module files. Some more difficult problems have not yet been modeled in our Artefact but might be modeled later for an advanced course. Limitations: Top level specific granularity assumptions like using Dispositions and the addition of closure axioms can render axiomatic definitions quite complex and not suited for the novice: HerbivoreAnimal equivalentTo AnimalByNutritionPartition and (bearerOf some (Disposition and (hasRealization only (Eating and ((hasPatient some Plant) and (hasPatient only Plant))))))2 Such precise, but complex and long axiomatizations were only be presented at the end of the curriculum and in the self-standing complete ontology. In the earlier teaching modules these were simplified. In addition tool support for ontology understanding might be investigated, i.e. as discussed by Nernst13. Another example of didactically distracting complexity is the labeling and the use of artificially sounding Partition classes, e.g. MarineAnimal equivalentTo AnimalByHabitatPartition and (agentIn some (Living and (hasLocus some MarineHabitat))) This expression should possibly be simplified to: 2 The ‘hasPatient’-relation refers to any participant in a process rather than a hospital patient. 4 MarineAnimal equivalentTo (Animal and livesIn some Sea), using the partition superclass and more intuitive labels and relations. 4 CONCLUSION We have presented characteristics of DL ontologies to serve as teaching resources and introduced a first draft of a didactically sound teaching ontology. Although our resource is aligned to the particular needs of the novice in biomedical ontology engineering and intended for teaching within the description logics OWL 2 regime, we believe it can be used to teach a novice from any application domain background. Adhering strictly to the Mesocosmos modelling level and a domain almost everybody in more or less familiar with makes it almost universally applicable. Practical use of this teaching resource in a 2 week curriculum has shown promising results. It was difficult to balance the need for formal rigidity dictated by the chosen semantics and expressivity with the requirement of simplicity, ergonomics and intuitiveness. One solution was the separation of the teaching resource into small modules and fragments that are aligned to a given stage in a curriculum. This way the teaching resource can grow more complex in a recursive and incremental manner, allowing the students to follow the curriculum more easily. ACKNOWLEDGEMENTS This work is supported by the German Science Foundation (DFG) as part of the research project JA 1904/2-1, SCHU 2515/1-1 GoodOD (Good Ontology Design). REFERENCES 1 Boeker, M et al. “Teaching Good Biomedical Ontology Design” In: Proceedings of the 3rd International Conference on Biomedical Ontology (ICBO), Graz 2012. 2 The GoodOD Project, http://www.iph.uni-rostock.de/GoodOntology-Design.902.0.html, last accessed 20.07.2012 3 Schober D, Boeker M: Ontology Simplification: New Buzzword or Real Need? Mannheim, OBML 2010, IMISE Report 2010, Edited by Herre H, Hoehndorf R, Kelso J, Schulz S, Institut fuer Medizinische Informatik, Statistik und Epidemiologie (IMISE), Markus Loeffler. 2010, M1-5. http://www.onto-med.de/obml/ws2010/obml2010report.pdf 4 Beißwanger, E et al., BioTop: An Upper Domain Ontology for the Life Sciences - A Description of its Current Structure, Contents, and Interfaces to OBO Ontologies, In: Applied Ontology 3:4, 2008, pp. 205–212. 5 List of known animals, http://de.wikipedia.org/wiki/Liste_bekannter_Tiere, last accessed 20.07.2012 6 Schober D, Tudose I,Svatek V, Boeker M: OntoCheck: Verifying Ontology Naming Conventions and Metadata Paper L ZooAnimals.owl: A didactically sound example-ontology for teaching description logics in OWL 2 Completeness in Protégé 4, Journal of Biomedical Semantics (JBMS), Invited Submission, OBML 2011, in print August 2012, http://www.jbiomedsem.com/ 8 Schober D, Smith B, Lewis S, et al. (2009). Survey-based naming conventions for use in OBO Foundry ontology development. BMC Bioinformatics, Vol.10, Issue 1 9 Gerhard Vollmer: Evolutionäre Erkenntnistheorie, Hizel, Stuttgart 1. A. 1975, 2. A. 1980, S. 161. 10 Evolutionary Epistemology, http://en.wikipedia.org/wiki/Evolutionary_epistemology, last accessed 20.07.2012 11 Konrad Lorenz, Die Rückseite des Spiegels. Versuch einer Naturgeschichte des menschlichen Erkennens, 1973 12 Cognitive Ergonomics, http://en.wikipedia.org/wiki/Cognitive_ergonomics, last accessed 20.07.2012 13 N. A. Ernst, M.-A. Storey, and P. Allen. Cognitive support for ontology modeling. Int. J. Hum.-Comput. Stud., 62(5):553 577, 2005. Paper L 5 In der Reihe IMISE-REPORTS sind bisher erschienen: 2002 1/2002 Barbara Heller, Markus Löffler Telematics and Computer-Based Quality Management in a Communication Network for Malignant Lymphoma 2/2002 Barbara Heller, Katrin Kühn, Kristin Lippoldt Report OntoBuilder 3/2002 Barbara Heller, Katrin Kühn, Kristin Lippoldt Handbuch OntoBuilder 4/2002 Barbara Heller, Katrin Kühn, Kristin Lippoldt Leitfaden für die Eingabe von Begriffen in den OntoBuilder 5/2002 Mitarbeiter des IMISE Skriptenheft für Medizinstudenten Medizinische Biometrie Medizinische Statistik und Informatik (Kursus zum Ökologischen Stoffgebiet) 1/2003 Birgit Brigl, Thomas Wendt, Alfred Winter Ein UML-basiertes Meta-Modell zur Beschreibung von Krankenhausinformationssystemen 2/2003 Thomas Wendt Modellierung von Architekturstilen mit dem 3LGM² 3/2003 Birgit Brigl, Thomas Wendt, Alfred Winter Requirements on tools for modeling hospital information systems 4/2003 Madlen Dörschmann Evaluation der Fehlerhäufigkeit im Rahmen einer Klinischen Studie 5/2003 Mohammad Zaino Statistische Analyse zur Aufdeckung von neurotoxischen Störungen infolge langjähriger beruflicher Schadstoffexposition Mitarbeiter des IMISE Skriptenheft zum SPSS-Kurs Kurs zur Auswertung medizinischer Daten unter Verwendung des Statistikprogramms SPSS 2003 2004 1/2004 2/2004 Renate Abelius, Barbara Heller, Luisa Mantovani, Frank Meineke, Roman Mishchenko, Jan Ramsch Standardisierung von Studienkurzprotokollen Qualitätsgesicherte rechnerbasierte Erfassung, Verarbeitung und Speicherung 3/2004 Jan Ramsch, Renate Abelius, Therapieschemata - Qualitätsgesicherte vereinheitlichte Barbara Heller, Luisa Man-tovani, rechnerbasierte Erfassung, Verarbeitung und Frank Meineke, Roman Mishchenko Speicherung 4/2004 Jan Ramsch Variabilität beim Einsatz von onkologischen Therapieschemata - Erkennung von Ausnahmen und resultierenden Therapieänderungen 5/2004 André Wunderlich (Diss.) Prognostische Faktoren für chemotherapieinduzierte Toxizität in der Behandlung von Malignomen speziell bei aggressiven Non-Hodgkin-Lymphomen 6/2004 Skriptenheft für Medizinstudenten Methodensammlung zur Auswertung klinischer und epidemiologischer Daten Mitarbeiter des IMISE 7/2004 Grit Meyer (Diss.) Charakterisierung der zellkinetischen Wirkungen bei exogener Applikation von Erythropoetin auf die Erythropoese des Menschen mit Hilfe eines mathematischen Kompartimentmodells 1/2005 Ingo Röder (Diss.) Dynamic Modeling of Hematopoietic Stem Cell Organization – Design and Validation of the New Concept of Within-Tissue Plasticity 2/2005 Katrin Braesel (Dipl.) Modellierung klonaler Kompetitionsprozesse hämatopoetischer Stammzellen mit Hilfe von Computersimulationen 3/2005 Dr. Barbara Heller (Habil) Knowledge-Based Systems and Ontologies in Medicine 1/2006 Alexander Strübing, Ulrike Müller 2/2006 Marc Junger (Diss.) Evaluation des 3LGM² Baukastens Studienplan - Ergebnisse - Auswertung Benutzermodellierung bei der Qualitätssicherung im onkologischen Studienmanagement 3/2006 Thomas Wendt (Diss.) Modellierung und Bewertung von Integration in Krankenhausinformationssystemen 1/2007 Markus Kreuz (Dipl.) Entwicklung und Implementierung eines Auswertungswerkzeuges für Matrix-CGH-Daten 2/2007 Mitarbeiter des IMISE Skriptenheft für Studenten 2005 2006 2007 Methodensammlung zur Auswertung klinischer und epidemiologischer Daten 3/2007 Frank Meineke (Diss.) Räumliche Modellierung und Simulation der Organisations- und Wachstumsprozesse biologischer Zellverbände am Beispiel der Dünndarmkrypte der Maus Daniel Müller-Briel (Dipl.) Standardisierung klinischer Studienprotokolle unter Berücksichtigung der Therapieplanung 1/2010 A. Winter, L. Ißler, F. Jahn, A. Strübing, T. Wendt Das Drei-Ebenen-Metamodell für die Modellierung und Beschreibung von Informationssystemen (3LGM² V3) 2/2010 H. Herre, R. Hoehndorf, J. Kelso, S. Schulz OBML 2010 Workshop Proceedings, Mannheim, September 9-10, 2010 H. Herre, R. Hoehndorf, F. Loebe OBML 2011 Workshop Proceedings, Berlin, October 6-7, 2011 2008 1/2008 2010 2011 1/2011

Log In

Integration of the Anatomical Therapeutic Chemical Classification System and DrugBank using OWL and text-mining

Integration of the Anatomical Therapeutic Chemical Classification System and DrugBank using OWL and text-mining

Related Papers

RELATED PAPERS