IMISE-REPORTS
Herausgegeben von Professor Dr. Markus Löffler
M. Boeker, H. Herre, R. Hoehndorf, F. Loebe (Eds.)
OBML 2012
Workshop Proceedings
Dresden, September 27-28, 2012
IMISE-REPORT Nr. 1/2012
Medizinische Fakultät
Impressum
Herausgeber: Universität Leipzig
Medizinische Fakultät
Institut für Medizinische Informatik, Statistik und Epidemiologie (IMISE)
Härtelstraße 16-18, 04107 Leipzig
Prof. Dr. Markus Löffler
Editoren: Martin Boeker, Heinrich Herre, Robert Hoehndorf, Frank Loebe
Redakteur: Frank Loebe
Kontakt: Telefon: (0341) 97-16100, Fax: (0341) 97-16109
Internet: http://www.imise.uni-leipzig.de
Redaktionsschluss: 17. September 2012
Druck: Inhalt: Universitätsklinikum Leipzig AöR, Bereich 2 - Abteilung Zentrale Vervielfältigung/Formularwesen
Einband: Buch- und Offsetdruckerei Herbert Kirsten
IMISE 2012 (Report als Sammelband). Das Copyright der Einzelartikel verbleibt bei den Autoren.
Alle Rechte vorbehalten. Nachdruck nur mit ausdrücklicher Genehmigung
des Herausgebers bzw. der jeweiligen Autoren und mit Quellenangabe gestattet.
ISSN 1610-7233
Proceedings of the
4th WORKSHOP OF THE
GI WORKGROUP
“ONTOLOGIES IN BIOMEDICINE
AND LIFE SCIENCES”
(OBML)
Dresden, Germany
September 27-28, 2012
Group Website:
https://wiki.imise.uni-leipzig.de/Gruppen/OBML
Organizers
Martin Boeker
Heinrich Herre
Robert Hoehndorf
Frank Loebe
(chair)
University Medical Center Freiburg
University of Leipzig
University of Cambridge, UK
University of Leipzig
Local Organizer
Michael Schroeder
Technical University Dresden
Keynote Speakers
Francisco M. Couto
Georgios V. Gkoutos
Michael Schroeder
University of Lisbon, Portugal
Aberystwyth University, UK
Technical University Dresden
Program Committee
Martin Boeker (program-chair)
Heinrich Herre (program-chair)
Patryk Burek
Fred Freitas
Georgios V. Gkoutos
Giancarlo Guizzardi
Robert Hoehndorf
Ludger Jansen
Janet Kelso
Toralf Kirsten
Frank Loebe
Axel Ngonga-Ngomo
Anika Oellrich
Roberto Poli
Dietrich Rebholz-Schuhmann
Peter Robinson
Daniel Schober
Paul N. Schofield
Michael Schroeder
Stefan Schulz
Luca Toldo
George Tsatsaronis
University Medical Center Freiburg
University of Leipzig
University of Leipzig
Federal University of Pernambuco, Recife, Brazil
Aberystwyth University, UK
Federal University of Espirito Santo, Brazil
University of Cambridge, UK
University of Rostock
Max Planck Institute for Evolutionary Anthropology, Leipzig
University of Leipzig
University of Leipzig
University of Leipzig
European Bioinformatics Institute, Cambridge, UK
University of Trento, Italy
European Bioinformatics Institute, Cambridge, UK
Charité Berlin
University Medical Center Freiburg
University of Cambridge, UK
Technical University Dresden
Medical University of Graz, Austria
Merck KGaA, Darmstadt
Technical University Dresden
Additional Reviewers
John Hancock
Oliver Kutz
Filipe Santana da Silva
Medical Research Council, Harwell, Oxfordshire, UK
University of Bremen
Federal University of Pernambuco, Recife, Brazil
ii
Authors
Maria Anderberg
Clemens Beckstein
Martin Boeker
Bernard de Bono
Mathias Brochhausen
Daniel L. Cook
Francisco M. Couto
Samuel Croset
Mikael Eriksson
Martin N. Fransson
Ana T. Freitas
Georg Fuellen
John H. Gennari
Georgios V. Gkoutos
Christoph Grabmüller
Pierre Grenon
Niels Grewe
Robert Hoehndorf
William R. Hogan
Ludger Jansen
Paul Kasteleyn
Sanela Kjellqvist (form. Kurtovic)
Christian Knüpfer
Natallia Kokash
Andreas Kurtz
Ulf Leser
Jan-Eric Litton
Catia M. Machado
Colin McKerlie
Roxana Merino-Martinez
Maxwell Neal
Loreana Norlin
Anika Oellrich
Dome Potikanond
Dietrich Rebholz-Schuhmann
Johannes Röhl
Daniel Schober
Paul N. Schofield
Beth A. Sundberg
John P. Sundberg
Fons Verbeek
Karolinska Institute, Stockholm, Sweden
University of Jena
University Medical Center Freiburg
European Bioinformatics Institute, Cambridge, UK
University of Arkansas for Medical Sciences, Little Rock, AR, USA
University of Washington, Seattle, USA
University of Lisbon, Portugal
European Bioinformatics Institute, Cambridge, UK
Karolinska Institute, Stockholm, Sweden
Karolinska Institute, Stockholm, Sweden
Technical University of Lisbon, Portugal
University of Rostock
University of Washington, Seattle, USA
Aberystwyth University, UK
European Bioinformatics Institute, Cambridge, UK
European Bioinformatics Institute, Cambridge, UK
University of Rostock
University of Cambridge, UK
University of Arkansas for Medical Sciences, Little Rock, AR, USA
University of Rostock
Leiden University, The Netherlands
Karolinska Institute, Stockholm, Sweden
University of Jena
Leiden University, The Netherlands
Charité Berlin
Humboldt University of Berlin
Karolinska Institute, Stockholm, Sweden
University of Lisbon, Portugal
Hospital for Sick Children, Toronto, Canada
Karolinska Institute, Stockholm, Sweden
University of Washington, Seattle, USA
Karolinska Institute, Stockholm, Sweden
European Bioinformatics Institute, Cambridge, UK
Leiden University, The Netherlands
European Bioinformatics Institute, Cambridge, UK
University of Rostock
University Medical Center Freiburg
University of Cambridge, UK
The Jackson Laboratory, Bar Harbor, Maine, USA
The Jackson Laboratory, Bar Harbor, Maine, USA
Leiden University, The Netherlands
iii
Preliminary Program
as of September 16, 2012
iv
THURSDAY
Sep 27, 2012
(12:00-13:00)
(Getting together / Registration / COFFEE )
13:00 - 13:20
13:20 - 14:15
R. Hoehndorf
M. Schroeder
14:15 - 14:30
COFFEE
Session 1
14:30 - 14:50
14:50 - 15:10
15:10 - 15:30
Ontologies in the Clinical Environment
Chair: M. Boeker
M. Brochhausen Developing an Ontology for Sharing Biobank Data based on the BBMRI Minimum Data Set MIABIS
C. Machado
Enrichment Analysis Applied to Disease Prognosis
S. Croset
Integration of the Anatomical Therapeutic Chemical Classification System and DrugBank using OWL and Text-Mining
15:30 - 16:00
COFFEE
Session 2
16:00 - 16:20
16:20 - 16:40
16:40 - 17:00
Special Topic: Representations of Phenotype and Pathology
L. Jansen
Using Ontologies to Study Cell Transitions
A. Oellrich
Auto atically Tra sfor i g Pre‐ to Post‐Co posed Phe otypes: EQ‐lisi g HPO a d MP
P. Schofield
The Mouse Pathology Ontology, MPATH; Structure and Applications
17:00 - 17:20
COFFEE
17:20 - 18:15
G. Gkoutos
19:30
DINNER
Welcome Remarks
GoPubMed: Semantic Search for the Life Sciences
Keynote on Special Topic: The Importance of Physiology in Translational Research
Chair: R. Hoehndorf
Preliminary Program
as of September 16, 2012
FRIDAY
Sep 28, 2012
09:00 - 09:10
COFFEE
Session 3
09:10 - 09:30
09:30 - 09:50
09:50 - 10:10
10:10 - 10:30
Special Topic: From Function to Physiology
J. Röhl
Functions, Roles and Dispositions Revisited. A New Classification of Realizables
C. Knüpfer
Function of Bio-Models: Linking Structure to Behaviour
D. Cook
PhysioMaps of Physiological Processes and their Participants
B. de Bono
Tissue Motifs and Multi-Scale Transport Physiology
10:30 - 11:00
COFFEE
Session 4
Methods and Tools
11:00 - 12:00
F. Couto
Keynote: Semantic Similarity in Biomedical Ontologies: Measurements, Assessment and Applications
12:00 - 12:20
12:20 - 12:40
N. Grewe
D. Schober
Comparing Closely Related, Semantically Rich Ontologies: The GoodOD Similarity Evaluator
ZooA i als.owl: A Didactically Sou d E a ple‐O tology for Teachi g Descriptio Logics i OWL 2
12:40 - 14:00
LUNCH
Chair: H. Herre
Chair: N.N.
v
14:00 - 15:00
Open Discussion
Table of Contents
Keynote Abstracts
Page
Semantic Similarity in Biomedical Ontologies: Measurement, Assessment and
Applications
Francisco M. Couto
viii
The Importance of Physiology in Translational Research
Georgios V. Gkoutos
viii
GoPubMed: Semantic Search for the Life Sciences
Michael Schroeder
viii
workshop papers according to program ordering
Paper
ID
Nr. of
Pages
Developing an Ontology for Sharing Biobank Data based on the BBMRI Minimum
Data Set MIABIS
Mathias Brochhausen, Martin N. Fransson, Mikael Eriksson, Roxana Merino-Martinez,
Loreana Norlin, Sanela Kjellqvist (form. Kurtovic), Maria Anderberg, Umit Topaloglu,
William R. Hogan and Jan-Eric Litton
A
6
Enrichment Analysis Applied to Disease Prognosis
Catia M. Machado, Ana T. Freitas and Francisco M. Couto
B
5
Integration of the Anatomical Therapeutic Chemical Classification System and DrugBank
using OWL and Text-Mining
Samuel Croset, Robert Hoehndorf and Dietrich Rebholz-Schuhmann
C
4
Using Ontologies to Study Cell Transitions
Ludger Jansen, Georg Fuellen, Ulf Leser and Andreas Kurtz
D
5
Automatically Transforming Pre‐ to Post‐Composed Phe otypes: EQ‐lisi g HPO and MP
Anika Oellrich, Christoph Grabmüller and Dietrich Rebholz-Schuhmann
E
5
The Mouse Pathology Ontology, MPATH; Structure and Applications
Paul N. Schofield, John P. Sundberg, Beth A. Sundberg, Colin McKerlie and
George V. Gkoutos
F
7
Functions, Roles and Dispositions Revisited. A New Classification of Realizables
Johannes Röhl and Ludger Jansen
G
6
Function of Bio-Models: Linking Structure to Behaviour
Clemens Beckstein and Christian Knüpfer
H
5
Ontologies in the Clinical Environment
Special Topic: Representations of Phenotype and Pathology
Special Topic: From Function to Physiology
the papers of this session continue on the next page
vi
PhysioMaps of Physiological Processes and their Participants
Daniel L. Cook, Maxwell L. Neal, Robert Hoehndorf, Georgios V. Gkoutos and
John H. Gennari
I
4
Tissue Motifs and Multi-Scale Transport Physiology
Bernard de Bono, Paul Kasteleyn, Dome Potikanond, Natallia Kokash, Fons Verbeek
and Pierre Grenon
J
4
Comparing Closely Related, Semantically Rich Ontologies: The GoodOD Similarity
Evaluator
Niels Grewe, Daniel Schober and Martin Boeker
K
6
ZooA i als.owl: A Didactically Sou d E a ple‐O tology for Teachi g Descriptio Logics
in OWL 2
Daniel Schober, Niels Grewe, Johannes Röhl and Martin Boeker
L
5
Methods and Tools
vii
Keynote Abstracts
Francisco M. Couto, University of Lisbon, Portugal
Semantic Similarity in Biomedical Ontologies: Measurement, Assessment and Applications
The analysis of complex biomedical entities and events, such as disease and epidemiological models,
is challenging due to their multiple domain features, and thus to accurately describe them we need
to use concepts from multiple biomedical ontologies, such as gene mutations, protein functions,
anatomical parts and phenotypes. The usefulness of ontological annotations to interlink and
interpret biomedical information is widely recognized, particularly for retrieving related information.
This relatedness can be captured by semantic similarity measures that return a numerical value
reflecting the closeness in meaning between two ontology concepts or two annotated entities. These
measures have been successfully applied to biomedical ontologies, particularly to the Gene Ontology,
for comparing proteins based on the similarity of their functions. This talk will discuss ongoing efforts
to calculate semantic similarity, describe popular techniques used to calculate their effectiveness,
and present existing biomedical applications that use semantic similarity measures to improve their
performance.
Georgios V. Gkoutos, Aberystwyth University, UK
The Importance of Physiology in Translational Research
The targeted mutation of animal models and the systematic study of phenotypes resulting from
these mutations have resulted in several significant breakthroughs over the recent years. Currently,
associations between genotype and phenotype in animal models are being recorded systematically
for multiple species. The aim is to consistently describe the phenotypes associated with mutations in
every function-bearing gene, reveal the ge es’ fu ctio s, the structure a d dy a ics of physiological
pathways as well as reveal the pathobiology of disease in humans and other animals. A consistent
representation of physiology is one key challenge towards achieving such a goal. The major focus of
this presentation is to highlight the importance of a physiology representation in relation to recent
efforts to systematically compare phenotypes across species and translate insights from animal
model research into an understanding of human traits and disease.
Michael Schroeder, Technical University Dresden, Germany
GoPubMed: Semantic Search for the Life Sciences
In the talk I give an overview over recent work on ontology generation and semantic search with
applications in drug discovery and trend analysis.
viii
Developing an Ontology for Sharing Biobank Data based on the
BBMRI Minimum Data Set MIABIS
Mathias Brochhausen*1, Martin N. Fransson2, Mikael Eriksson2, Roxana MerinoMartinez2, Loreana Norlin2, Sanela Kjellqvist¤2, Maria Anderberg2, Umit Topaloglu1, William R. Hogan1 and Jan-Eric Litton 2
1
University of Arkansas for Medical Sciences, Little Rock, AR, USA
2
Karolinska Institutet, Stockholm, Sweden
¤
Formerly Sanela Kurtovic
ABSTRACT
Sharing data about heterogeneous collections of specimens
stored in biobanks is an important topic with respect to making optimal use of sparse resources. Based on a minimum
data set for biobank data sharing created by the European
Biobanking and Biomolecular Resources Research Infrastructure (BBMRI), we developed an ontology coded in
OWL2. This ontology provides shared semantics to guide
data collection and enables ontology-assisted querying of
the data. The development of this application ontology for
sharing data across biobanks follows the criteria laid out by
the OBO Foundry. The goal of using ontologies within
BBMRI is to ensure semantic data integration and to enable
reasoning over data.
1
INTRODUCTION
The move towards a universal information infrastructure for
biobanking is directly connected to the issues of semantic
interoperability through standardized message formats and
controlled terminologies. Several efforts have been made
toward the integration of biobank and research data. Harmonization of biomedical data has been addressed in large
collaborations but most of the efforts have been done in a
project-driven fashion; focused on the harmonization of the
specific information needed by a particular project [1,2]. In
contrast, one of the aims of the European Biobanking and
Biomolecular Resources Research Infrastructure (BBMRI)
is to provide the necessary formats to compare biobank information at different levels of detail [3]. Following the preparatory phase of BBMRI, several national BBMRI nodes
were initiated with the same aim as the European BBMRI,
but on a national scale.
In recent years, ontologies have become increasingly important to biological and biomedical research. The widespread adoption of Gene Ontology [4] is only one example
*
To whom correspondence should be addressed.
classes printed bold, object properties in italics and OPERATORS all
caps. Definitions of classes referred to here can be found in Tab. 1
1
of usage of ontologies in the biomedical arena. Ontologies
stand out from other semantic resources, such as terminologies and controlled vocabularies, by (among others) two
characteristics: a logical structure to support algorithmic
processing and a focus on the empirical knowledge regarding the phenomena that are the basis of the data [5]. Considering current differences in biobank concepts among different countries and the inability to compare concepts across
languages, an ontology is crucial for researchers looking for
semantically comparable and international data sets. Moreover, an ontology that includes formal restrictions would help
to overcome semantic underspecification, even in a single
language environment. The use of semantic annotation, such
as axioms coded in OWL, will also facilitate the design and
implementation of an informatics model for biobank data
sharing. The implementation of this model in an informatics
system will lead to the integration of biobank and research
data and a first step towards global knowledge discovery
within the biobank domain.
One of our main use cases lies outside the BBMRI effort:
aggregating and querying data from several, independently
operated biobanks at the University of Arkansas for Medical
Sciences (UAMS) and the Arkansas Children’s Hospital
Research Institute (ACHRI). UAMS has a Tissue Procurement Facility and several relatively smaller individual research labs (i.e. the Myeloma Institute, the Spit for the Cure
project) that collect and store specimens for research purposes. In addition, ACHRI has several independent labs
similarly managing specimens, including the Center for
Birth Defects Research, autism research, and the Women's
Mental Health Program. Both institutions would like their
collected specimens and annotated data to be used for research purposes while keeping operations of each lab independent. Also, to facilitate finding specimens of interest, it
is a requirement to integrate biobank data with electronic
health record (EHR) data. In another concurrent effort,
UAMS has invested in creating an Enterprise Data Warehouse (EDW) to facilitate access to and integration of clini-
Paper A
1
M. Brochhausen et al.
cal, basic-science, and other data for research and quality
reporting. Informatics for Integrating Biology and the Bedside (i2b2) [6,7] is an open-source platform for retrieving
de-identified data from the UAMS EDW. i2b2 was designed
primarily for cohort identification, allowing users to perform queries to determine the existence of a set of patients
meeting certain inclusion or exclusion criteria. To ensure
semantic integration of data from numerous biobanks with
EHR data, an ontology is needed to which the data will be
mapped in the i2b2 Ontology Cell. The initial development
of OMIABIS as presented here will provide the core of this
ontology. The aim of the data integration is to allow data
queries over the entire data within the data warehouse regardless of the initial data structure in the data source.
Because the management, the operations, and the data collected in the biobanks are distinct, it will be a challenge to
manually map all of the biobanks’ data into a single i2b2
instance. Currently the biobanks use caTissue [8], an opensource biospecimen management tool. Despite using a single software application, integration of data is not guaranteed because each biobank creates its own specimen annotation forms with different data elements. To ensure integration, we will incorporate OMIABIS into caTissue’s annotation forms for all UAMS/ACHRI biobanks and the biobank
administration data model. Then, the data in separate caTissue instances for the biobanks can be easily ETL (Extract,
Transform and Load) to the EDW i2b2 instance, and queried with common semantics..
We present the development of an ontology based on the
"Minimum Information About BIobank data Sharing"
(MIABIS) for sharing key biobank attributes focused on
administrative aspects of biobanks. This is a first logical
step towards open and stimulating research collaboration.
2
MATERIAL
As a first step towards harmonization of biobank data, a
minimum data set was proposed during the preparatory
phase of the European BBMRI in 2008-2010. The data set
consisted of twenty-one attributes describing information
pertaining to biobanks and related entities, i.e., individual
subjects, cases and samples. The BBMRI minimum data set
was further elaborated within the Swedish node BBMRI.se.
To avoid any legal or ethical issues associated with the lowest level of data, i.e., individual subjects, cases and samples,
this level was removed during development.
The updated version is called MIABIS – Minimum Information About BIobank data Sharing – and consists of fiftytwo attributes considered important for establishing a system of data discovery for biobanks and sample collections.
The data set employs existing standards, e.g., the Sample
PREanalytical
Code
(SPREC),
ICD
Codes
[http://apps.who.int/classifications/icd10/browse/2010/en]
and definitions developed by the Public Population Project
2
in Genomics (P3G), [http://www.p3g.org] and the International Society for Biological and Environmental Repositories (ISBER) [http://www.isber.org/]. MIABIS is being used
in a structured Scandinavian survey to gather information
about sample collections stored in biobanks in a searchable
database.
MIABIS was developed in the context of several use cases
described by invited researchers. Two examples of use cases
used during development by the MIABIS are described below, although the latter one would require inclusion of individual level data, similar to the use-case from
UAMS/ACHRI, described previously:
1) Search for tissue samples from donors diagnosed
with nemaline myopathy. Determine the age group.
What are the sample storage conditions? Contact
the biobank for detailed information about the biopsy samples and whether myoblast cell cultures
have been grown from these samples.
2) Search for sample collections having at least 10
cases with tissue from thoracic aorta as well as
blood, serum, or plasma from the same donor. Also
check if clinical data has been registered for the
donors such as physical measurements. Contact the
person responsible for the sample collection to obtain detailed information on the specific kind of
thoracic aorta biopsies of interest. Also assure that
the biopsies were performed +/- one week in relation to the blood sampling.
For reasons mentioned above personal data protection considerations are not addressed here. Furthermore, an ontology
that supports security and privacy of data and user authorization, authentication, and permissions, is beyond the scope
of this work.
Our use-case necessitates a semantically rich resource that
allows the mapping of data from heterogeneous sources
unto the ontology that will act as a global schema.
3
METHODS
Our aim is to provide a shared semantic schema as the foundation for MIABIS to ensure semantic integration of biobank data. An additional benefit of such a schema will
guide data entry to unify and regiment language use across
multiple researchers. To make the ontology easily accessible
and implementable, we chose Web Ontology Language
(OWL) 2 [9] for implementation. To facilitate re-use and
harmonization across ontologies, we used Basic Formal
Ontology (BFO) [http://www.ifomis.org/bfo] as the Upper
Ontology [5,10]. In addition the entire ontology development followed the principles of best practice in ontology
development as set forth by the OBO Foundry [11,12]. In
doing so we took the first step towards collaboration with
the Open Biological and Biomedical Ontologies group to
ensure re-use and uptake of our efforts across the domain.
Paper A
Developing an Ontology for Sharing Biobank Data based on the BBMRI Minimum Data Set MIABIS
Re-use of preexisting ontologies is key among the OBO
Foundry principles. In creating Ontologized MIABIS
(OMIABIS), we imported the Proper Name Ontology
(PNO) [http://purl.obolibrary.org/obo/iao/pno.owl] in its
entirety. PNO is based on the Information Artifact Ontology
(IAO)
[http://purl.obolibrary.org/obo/iao.owl].
Thus,
OMIABIS is an extension of IAO. In addition, multiple entities from other ontologies, namely the Ontology of Biomedical
Investigations
(OBI)
[http://purl.obofoundry.org/obo/obi.owl] and the Ontology
of Medically Relevant Social Entities (OMRSE)
[http://code.google.com/p/omrse] are imported using a tool
based on the MIREOT methodology [13], which was developed in a joint endeavor between the University of Arkansas
for Medical Sciences and the University of Arkansas at Little Rock. We choose to re-use the ontologies mentioned
above based on the fact that they are members of the OBO
Foundry and, thus, are built according to the same basic
principles, partly extending the same upper ontology (BFO).
Our aim is to provide ontological representations to facilitate integration of biobank data with biomedical research
data. The latter often is annotated with terms from GO or
OBI. Thus, choosing ontologies from the very same orthogonal ontology library (OBO Foundry) of which the latter are
members seems to be the best strategy to cater to potential
users.
Besides the use of the ontology as a means to achieve integration of pre-existing, heterogeneous data, we foresee the
possibility to use the global schema to guide development of
new biobank data resources and ensure the validity of data
entry.
The first step in providing ontological representations of the
biobank domain as captured by MIABIS was to categorize
the MIABIS data attributes according to their basic ontological commitment as shown in Figure 1. From this analysis it
is obvious that OMIABIS needs to bring together two rather
different domains: the domain of administrative information
and the domain of biomedical research. This need is reflected by the imports specified in the previous paragraph.
Based on the multi-domain character of the data elements in
MIABIS it is obvious that OMIABIS is not a domain ontology, but brings together pre-existing classes from domain
ontologies to represent and formally define the MIABIS
attributes to provide semantic integration of biobank related
data.
4
RESULTS
The latest release of OMIABIS in OWL can be downloaded
from http://purl.obolibrary.org/obo/omiabis.owl. For the
initial release our aim is to represent the first column of
Figure 1, which comprises all classes and object properties
closely related to biobanks and their administration. This
initial step of development includes most of the innovations
with regard to applied ontology, since we foresee that for
the following steps we will rely heavily on previous representations by OBO Foundry ontologies. In order to represent the biobank related attributes of MIABIS we have to
represent types of things that are not yet represented in OBO
Foundry ontologies. OMIABIS includes a total of 249 classes and 64 object properties. 35 classes and object properties
are newly created for the initial version of OMIABIS. A
textual definition is given for all newly created classes and
object properties.
Fig. 1: Categorizing MIABIS data attributes according to ontological commitments
The central class of OMIABIS is "biobank". Its textual definition is: "A biobank is a collection of samples of biological
substances (e.g. tissue, blood, DNA) which are linked to
data about the donors of the material. They have a dual nature as collections of samples and data." The definition is
derived from the definition for human biobank by [14]. The
class is formally restricted to be the equivalent to1:
"object_aggregate
AND has_part SOME
(object_aggregate
AND (has_part ONLY (specimen AND participates_in
SOME storage)))
AND has_part SOME (material information bearer
AND (participates_in SOME
(digital curation
AND (has_specified_output SOME
(data set
AND (is about SOME
(object_aggregate
1
classes printed bold, object properties in italics and OPERATORS all
caps. Definitions of classes referred to here can be found in Tab. 1
Paper A
3
M. Brochhausen et al.
AND (has_part ONLY specimen))))))))2"
Notably OMIABIS, likewise MIABIS, differentiates between a biobank and the organization running a biobank.
This is important because one organization can run multiple
biobanks with a different focus and different management
respectively. Therefore, OMIABIS also represents "biobank
organization". Its textual definition is: "A biobank organization is an organization bearing legal personality that owns or
administrates a biobank". "Biobank organization" is equivalent to:
"organization
AND ((owns SOME biobank)
OR (administrates SOME biobank))
AND (bearer_of SOME legal person role)"
Referring to the class "legal person role" from OMRSE is
necessary due to the fact that the definition of organization
in OBI does not refer to legal personality. Any group of
human beings that has some organizational rules fulfills the
textual definition according to OBI. However, for our use
case the issue of legal personality is crucial. Therefore, we
add this aspect to the definition and description of "biobank
organization".
Table 1 lists all imported classes referred to by the definitions above, their textual definition and their source.
The formal description of biobank organization uses two
object properties which have been specifically created for
OMIABIS:
1. "owns"
Definition: "a owns b iff a is the bearer of the roles
that concretizes the claims and obligations regarding b."
Domain: Homo sapiens
OR organization
OR collection of humans
OR aggregate of organizations
Range: information content entity
OR material_entity
Characteristics: asymmetric
2. "administrates"
Definition: "a administrates b if c owns b and some
rights and obligations regarding b are transferred3 from c to
a."
Domain: Homo sapiens
OR organization
OR collection of humans
OR aggregate of organizations
Range: information content entity
OR material_entity
Characteristics: asymmetric
2
Note that this class description is based on object properties and classes
from BFO, IAO and OBI.
3
The 'transfers' object property is represented in Document Acts Ontology
(d-acts): http://purl.obolibrary.org/iao/d-acts.owl
4
!"#$%&
!"#$%&'())*$)(&$+
34$%.-$0+
3&!*()$+
-(&$*.(/+.0;!*>
-(&.!0+"$(*$*+
=.).&(/+%:*(&.!0+
=(&(+3$&+
/$)(/+4$*3!0+*!/$+
'$()*)+),*&
,+-(&$*.(/+$0&.&1+
230(456(&$*.(/70&.&18+&9(&+.3+(+
-$*$!/!).%(/+3:-+!;+3$4(*(&$+
!"#$%&+230(45<"#$%&8+$0&.&.$3+
(0=+4!33$33$3+0!0>%!00$%&$=+
"!:0=(*.$3?+
,+-(&$*.(/+$0&.&1+&9(&+9(3+&9$+
34$%.-$0+*!/$?+
,+-(.0&$0(0%$+4*!%$33+"1+C9.%9+
-(&$*.(/+$0&.&.$3+&9(&+(*$+0!&+
(%&.D$/1+-$&("!/.E.0)+(*$+
4/(%$=+.0+C$//+.=$0&.;.$=+/!%(>
&.!0+(0=+4!33."/1+:0=$*+%!0>
&*!//$=+$0D.*!0-$0&+.0+(=>9!%+
=$D.%$3F3&*:%&:*$3+.0+!*=$*+&!+
4*$3$*D$+(0=+4*!&$%&+&9$-+
;*!-+=$%(1F(/&$*(&.!0+(0=+
-(.0&(.0+(D(./("./.&1+
,0+.0;!*-(&.!0+"$(*$*+.3+(+-(&$>
*.(/'$0&.&1+.0+C9.%9+(+%!0%*$&.>
E(&.!0+!;+(0+.0;!*-(&.!0+%!0>
&$0&+$0&.&1+.09$*$3?+
G.).&(/+%:*(&.!0+.3+&9$+4*!%$33+!;+
$3&("/.39.0)+(0=+=$D$/!4.0)+
/!0)+&$*-+*$4!3.&!*.$3+!;+=.).>
&(/+(33$&3+;!*+%:**$0&+(0=+;:>
&:*$+*$;$*$0%$+"1+*$3$(*%9$*3H+
3%.$0&.3&3H+(0=+9.3&!*.(03H+(0=+
3%9!/(*3+)$0$*(//1?+
,+=(&(+.&$-+&9(&+.3+(0+())*$)(&$+
!;+!&9$*+=(&(+.&$-3+!;+&9$+3(-$+
&14$+&9(&+9(D$+3!-$&9.0)+.0+
%!--!0?+,D$*()$3+(0=+=.3&*.>
":&.!03+%(0+"$+=$&$*-.0$=+;!*+
=(&(+3$&3?+
,+*!/$+"!*0$+"1+(+9:-(0+.0=.>
D.=:(/+!*+"1+(+%!//$%&.!0+!;+
9:-(03+*$)(*=$=+(3+4!3>
3$33.0)+*.)9&3+(0=+=:&.$3+$0>
;!*%$("/$+(&+/(C?+
-,./0$&&
@A<+
<@B+
<@B+
B,<+
<@B+
B,<+
<6IJ7+
Table 1: Definitions and Source of classes referred to in Section 4
Figure 2 depicts the central classes for representing biobank
related attributes from MIABIS.
From Fig. 2 it becomes clear that the ontology allows inference of facts that are not computable based on the data attributes alone. The fact that the biobank contact person is a
member of the biobank organization is one example. These
inferences are based on class restrictions specified on the
OWL file.
Regarding our use case at UAMS/ACHRI, class restrictions
for the classes within the domain, for instance the definition
of biobank given above, are a crucial requirement. They
allow automated comparison and integration of pre-existing
data structures. This is a crucial part of our plan to use
OMIABIS to facilitate entering biobank data into i2b2. Furthermore, once the entirety of MIABIS is represented in
Paper A
Developing an Ontology for Sharing Biobank Data based on the BBMRI Minimum Data Set MIABIS
OMIABIS the class restrictions will technically enable querying for sex and age of donors, storage conditions (etc.
freezer) and PI of the study the specimen was obtained for.
Figure 2: Biobank, biobank organization and biobank contact person in
OMIABIS
4.1
Discussion
Taking into consideration the immediately biobank-related
attributes in MIABIS we only found one problematic case
for which we will need more information from the users of
MIABIS: biobank type. The possible values for this attribute are Pathology, Cytology, Gynecology, Obstetrics, Transfusion, Transplant, Clinical Chemistry, IVF, Bacteriology,
Virology. There are already strong indications from
MIABIS users that this list is not exhaustive. The possible
values for biobank type are under elaboration and will be
updated as time progresses. The rationale behind these
choices is to allow the submitter of data to easily select
something that seems plausible to her. However, the downside of this approach is a certain difficulty for end users to
find relevant biobanks and studies for her research. It is obvious that the currently proposed categories above are not
disjoint from each other. A specimen collection by virtue of
the type of specimens stored can be of interest to both
pathologists and virologists, or gynecologists and cytologists, and so on. In order to provide useful ontological representation of these classes we need users to specify which
characteristics of a biobank make it useful for which specialty of medicine or which research domain.
MIABIS is not concerned with data privacy protection issues due to its focus on sharing metadata about biobanks.
However, once sharing of biomaterial is considered these
issues need to be addressed. As far as ontology is concerned
this should be done in a separate ontology that can be modularized with OMIABIS to provide the shared semantic
backbone necessary. For harmonizing and modularizing
ontologies, clear quality criteria for the development of the
ontologies in question are an asset.
From the perspective of the BBMRI effort focusing on the
descriptive level of biobank information, i.e., biobank and
sample collection/study data, will give two benefits: 1) It
will avoid any legal/ethical issues, and 2) it will facilitate a
relatively easy way for data discovery, compared to requesting data for each individual sample. Hence, MIABIS may
stand a better chance of being accepted globally. Further, it
is plausible that biobank information will continue to be
distributed; descriptive data may be distributed in an international setting, while the sample data may continue to be
stored locally at the biobanks. Thus, a system intended for
data discovery of biobank information should consider the
distributed information content. In this setting a semantic
web approach may be preferable over a query language for
structured databases.
For the UAMS/ACHRI use case we foresee that a more detailed ontology that technically enables sharing individual
donor and specimen data will likely be needed. For this effort OMIABIS provides the core of a set of modularized
ontologies to govern data exchange between the multiple
biobanks. The fact that MIABIS is focused on administrative data regarding biobanks makes OMIABIS a perfect
candidate for a central role in this endeavor.
Our next step is to cooperate with other biobank projects to
consolidate and expand OMIABIS working toward a domain ontology for biobanking. The developers will seek
contact with the OBI consortium to discuss having
OMIABIS as a domain specific extension of the Ontology
of Biomedical Investigation. We plan to maintain and develop OMIABIS as an open-source artifact using subversion
and tagging versions as release whenever we reach stability.
5
CONCLUSIONS
Even though the domain of biobanking has not been a focus
of ontology development and ontology-driven computing it
is obvious that in creating an ontology for data sharing in
biobanking, one can build on numerous pre-existing ontologies. The difficulty is that the domain of biobanking touches
on multiple aspects of biomedical research, such as molecular biology, medicine, law, public health, cryobiology, and
physics. Bringing these multiple fields together in one ontological representation is a huge challenge. First steps were
taken in the initial development of OMIABIS, which builds
upon a multilateral effort to achieve data integration in biobanking from the BBMRI consortium.
ACKNOWLEDGMENTS
The work is partially funded by the Arkansas Biosciences
Institute, the major research component of the Arkansas
Tobacco Settlement Proceeds Act of 2000 and by award
number UL1TR000039 from the National Center for Ad-
Paper A
5
M. Brochhausen et al.
vancing Translational Sciences (NCATS). The content is
solely the responsibility of the authors and does not necessarily represent the official views of NCATS or the National
Institutes of Health. We would like to thank the people involved in the European BBMRI preparatory phase, financially supported by the European Commission (grant
agreement 212111) and the Swedish Research Council for
granting the BBMRI.se project (grant agreement 829-20096285). The authors would also like to thank Josh Hanna and
three anonymous reviewers for their valuable comments.
to support biomedical data integration. Nature Biotechnology
2007, 25(11): 1251-5.
[13] Courtot M, Gibson F, Lister A, Malone J, Schober D, Brinkman R, Ruttenberg A..MIREOT: the Minimum Information to
Reference an External Ontology Term. Available from Nature
Precedings
<http://dx.doi.org/10.1038/npre.2009.3576.1>
(2009)
[14] Deutscher Ethikrat. Human biobanks for research [Internet].
Berlin: 2010 [cited 2012 Aug 02]. Available from:
http://www.ethikrat.org/themen/dateien/pdf/stellungnahmehumanbiobanken-fuer-die-forschung.pdf
REFERENCES
[1] Litton JE, Muilu J, Björklund A, Leinonen A, Pedersen NL.
Data modeling and data communication in GenomEUtwin.
Twin Res. 2003 Oct;6(5):383-90.
[2] Muilu J, Peltonen L, Litton JE. The federated database--a basis
for biobank-based post-genome studies, integrating phenome
and genome data from 600,000 twin pairs in Europe. Eur J
Hum Genet. 2007 Jul;15(7):718-23.
[3] Yuille M, van Ommen GJ, Bréchot C, Cambon-Thomsen A,
Dagher G, Landegren U, et al. Biobanking for Europe. Brief Bioinform. 2008 Jan;9(1):14-24.
[4] Bodenreider O, Stevens R. Bio-Ontologies: current trends and
future directions. Brief Bioinform. 2006 7 (3): 256-274.
[5] Smith B, Brochhausen M: Putting biomedical ontologies to
work. In: Blobel B, Pharow P, Nerlich M, editors. eHealth:
Combining Health Telematics, Telemedicine, Biomedical Engineering and Bioinformatics to the Edge – Global Experts Summit Textbook. Amsterdam: IOS Press; 2008:135-40.
[6] Gainer V, Hackett K, Mendis M, Kuttan R, Pan W, Phillips L,
Chueh H, Murphy SN. Using the i2b2 Hive for clinical discovery: an example. AMIA Annu Symp Proc. 2007; p.959.
PMID:18694059.
[7] Mendis M, Wattanasin N, Kuttan R, Pan W, Hackett K, Gainer
V, Chueh H, Murphy SN. Integration of Hive and Cell software in the i2b2 architecture. AMIA Annu Symp Proc..
2007;p.1048. PMID:18694146.
[8] London JW, Chatterjee D, Using the Semantically Interoperable Biospecimen Repository Application, caTissue: End User
Deployment Lessons Learned, IEEE International Conference
on BioInformatics and BioEngineering (BIBE), 2010.
[9] W3C [Internet]. Cambridge (MAS), Sophia Antipolis, Tokyo:
W3C; c2012 [cited 2012 Mar 9]. OWL 2 Web Ontology Language Document Overview [about 8 screens]. Available from:
http://www.w3.org/TR/owl2-overview.
[10] Spear AD. Ontology for the Twenty First Century: An Introduction with Recommendations [Internet]. Saarbrücken; 2006
[cited 2012 Mar 9]. Available from:
http://www.ifomis.org/bfo/documents/manual.pdf.
[11] The Open Biological and Biomedical Ontologies [Internet].
Berkeley (CA): Berkeley BOP [cited 2012 Mar 9]. OBO
Foundry Principles [about 1 screens]. Available from:
http://obofoundry.org/crit.shtml
[12] Smith B, Ashburner M, Rosse C, Bard J, Bug W, Ceusters W,
et al. The OBO Foundry: Coordinated evolution of ontologies
6
Paper A
Enrichment analysis applied to disease prognosis
Catia M Machado1,2*, Ana T Freitas2 and Francisco M Couto1
1
LaSIGE, Departmento de Informática, Faculdade de Ciências, Universidade de Lisboa, Lisboa, Portugal
2
Instituto de Engenharia de Sistemas e Computadores/Instituto Superior Técnico, Lisboa, Portugal
ABSTRACT
Enrichment analysis is normally used to identify relevant
biological features that can be used to describe a set of
genes under analysis that, for example, share a common
expression profile.
In this article we propose the exploitation of enrichment
analysis for a different purpose: the evaluation of a disease
prognosis. With this application of enrichment analysis we
expect to identify clinical and biological features that best
differentiate between patients that suffered a specific disease event from those that did not. The features thus identified will be used to create patient profiles, which will in turn
be evaluated through similarity and supervised classification
approaches to predict the occurrence of the event.
This article presents the enrichment analysis methodology
proposed for a prognosis study, in which we use the disease
hypertrophic cardiomyopathy and its most severe manifestation, sudden cardiac death, as a case study.
1
INTRODUCTION
Enrichment analysis is normally used for the functional
analysis of large lists of genes identified with highthroughput technologies such as expression microarrays. It
exploits the use of statistical methods over ontological gene
annotations to identify biological features that are represented in the gene set under analysis more than would be expected by chance. Such biological features are said to be
enriched, or overrepresented, and are then used to formulate
a biological interpretation of the gene set.
The ontology most commonly used in these analyses is the
Gene Ontology (Ashburner et al. 2000, Robinson and Bauer
2011, Zhang et al. 2010), although other resources such as
MeSH and KEGG are also explored (Leong and Kipling
2009). Strategies based on multiple vocabularies have also
been developed, namely in pharmacogenomics, including
the Human Disease Ontology and the Pharmacogenomics
Knowledge Base (Hoehndorf et al. 2012). LePendu et al.
propose a method to generate annotations when using vocabularies other that the Gene Ontology, testing its feasibility with the Disease Ontology (LePendu et al. 2011).
*
In terms of statistical methods, the most commonly used is
the Fisher’s exact test (Robinson and Bauer 2011, Huang et
al. 2009), with more recent implementations also using
Bayesian techniques (Bauer et al. 2010).
Enrichment analyses are normally divided in three categories: Singular Enrichment Analysis (SEA), Gene Set Enrichment Analysis (GSEA) and Modular Enrichment Analysis (MEA). SEA works with a user-selected gene set and
iteratively tests the enrichment of each individual ontology
concept in a linear mode. GSEA also evaluates the enrichment of ontology concepts individually, but considering all
the genes in the experiment and not just a user-selected gene
set. MEA works with a user-selected gene set, but incorporates into the analysis the relationships between concepts
represented in the ontologies, thus evolving from a termcentric approach to a biological module-centric approach
(Huang et al. 2009).
Several tools have been developed that implement one or
more of these approaches. Examples of these tools are Ontoexpress (Khatri et al. 2002), GSEA (Subramanian et al.
2005), and GOToolBox (Martin et al. 2004) (a detailed list
of tools was collected by Huang et al. 2009).
In this work we propose to adapt the enrichment analysis to
develop a disease prognosis methodology, with the goal of
predicting if specific events may or may not occur in a given
patient. The enrichment analysis will be applied to identify
the set of clinical and genetic features that might assist us in
the differentiation of the patients for whom the event occurred from the patients for whom it did not. The identified
features will then be used to create profiles for the individual patients. In order to differentiate between the two sets of
patients, the profiles will be subjected to an evaluation step,
in which we will explore a similarity and a classification
approach. In the similarity approach, different semantic similarity measures (Pesquita et al. 2009) and a relatedness
measure (Ferreira and Couto 2011) will be tested to compare the profiles, followed by machine learning algorithms
such as clustering and nearest neighbors. In the classification approach, the patient profiles will be analyzed with
supervised classification algorithms such as random forests
(Breiman 2001) and Bayesian networks (Berner 2007) (see
Fig. 1).
To whom correspondence should be addressed.
Paper B
1
C.M.Machado et al.
The datasets to be used in the implementation of this methodology were collected by biomedical experts in the context
of medical practice, and are thus characterized by a small
number of clinical features and a high number of missing
values, among other aspects. With this work, our purpose is
to evaluate if the application of an enrichment analysis to
this type of dataset can result in the extraction of relevant
knowledge from controlled vocabularies to improve the
quality of the dataset and, consequently, the quality of the
predictions made from it.
As a case study we will consider the disease hypertrophic
cardiomyopathy (HCM). This is a genetic disease that is the
most frequent cause of sudden cardiac death (SCD) among
apparently healthy young people and athletes (Maron et al.
2009, Alcalai et al. 2008). It is characterized by a variable
clinical presentation and onset, and there are approximately
900 mutations in more than 30 genes currently known to be
associated with it (Harvard Sarcomere Mutation Database).
Due to these characteristics, HCM is very difficult to diagnose. The prognosis is by no means easier, since the severity
of the disease varies even between direct relatives. It has
been observed that the presence of a given mutation can
correspond to a benign manifestation in one individual and
result in SCD in another (Maron et al. 2009, Alcalai et al.
2008).
Similarity Approach
Enrichment
Profiles
Analysis
Classification Approach
tween the application of this analysis in the context of gene
expression data analysis and in the context of the prognosis
methodology. Finally, we present how the enrichment analysis will be conducted with data from HCM patients, and
how the patient profiles will be created from the results obtained.
2
DATASET
The data necessary for the diagnosis and the prognosis of
HCM has been represented in a semantic data model, with
mappings established between the concepts in the model
and four controlled vocabularies: the National Cancer Institute Thesaurus (NCIt) (version 10.03) (Sioutos et al. 2007),
the Systematized Nomenclature of Medicine-Clinical Terms
(SNOMED-CT) (version 2010_01_31) (SNOMED), the
Gene Regulation Ontology (version 0.5, released on
04_20_2010) (Beisswanger et al. 2008) and the Sequence
Ontology (released on 11_22_2011) (Eilbeck et al. 2005). A
total of 85.8% of the clinical concepts represented in the
model was mapped either to NCIt or SNOMED-CT, in identical proportion (42.9%).
Table 1 contains all the clinical features to be used in the
present work. With the exception of two of these features,
Sporadic and Hypertrophy morphology, they are represented in the semantic model and have an established mapping
with NCIt or SNOMED CT.
Table 1. Clinical features considered in the enrichment analysis and their
possible values.
Feature
• Data from patients
• Ontologies
Prognosis
Fig. 1. Schematic representation of the prognosis methodology.
The methodology is composed of two units: the first (left-side)
receives as input data from patients mapped to biomedical ontologies (or controlled vocabularies in general). It will apply an enrichment analysis to identify a list of ontology terms considered to
be enriched, which will be used to create profiles for the patients.
These profiles will then be subjected to an evaluation step (the
second unit, on the right-side) that will result in the evaluation of
the prognosis for individual patients. For the implementation of the
second unit, we will explore a similarity and a classification approach.
Due to the importance of the prognosis of HCM in terms of
SCD, this will be the event analyzed in our present study.
This work is currently under development, and in the rest of
the article the focus will be on our proposed application of
enrichment analysis to disease prognosis. In the following
sections we present the dataset and the methodology. In the
methodology section we begin by drawing a parallel be-
2
Cardioverter defibrillator
Non-obstructive HCM
Obstructive HCM
Resuscitated sudden death
Sudden death
Non-sudden death
Sudden death family history
Familial
Sporadic
Blood pressure
Gender
Age
Hypertrophy morphology
Possible values
-1;1*
normal; hypertension;
hypotension
male; female
1; 2; 3; 4
apical; centric; concentric
* The values -1 and 1 correspond respectively to the absence or the presence of the
feature in the patient.
Familial and sporadic indicate if the patient has either the familial (hereditary) or the
sporadic form of HCM.
The age values correspond to the following intervals, in years: (1) [0,20]; (2) ]20,40];
(3) ]40,60]; (4) >60.
The genetic features are the mutations associated with the
disease, with possible values {-1,1}, i.e. absence or presence
of the mutation in the genome of the patient. The genes in
Paper B
Enrichment analysis applied to disease prognosis
which each mutation occurs are currently being mapped to
the Gene Ontology.
Both clinical and genetic features have been previously collected for 80 patients from Portuguese hospitals and molecular biology research laboratories, for the evaluation of
associations between genetic and clinical factors. The clinical features presented in Table 1 are considered by the medical experts as the most relevant for the diagnostic and the
prognosis of HCM, and were thus the only ones provided
for our present study. Table 2 shows the percentage of patients that have a known value for each of the clinical features and the total of 569 mutations tested.
the set of patients for whom the event occurred and also for
the set of patients for whom it did not, each set will be in
turn considered the study set. Fig. 2 shows an exemplificative representation of the population and study sets in a gene
expression experiment, and their counterparts in the prognosis analysis.
B
A
g4
g1
g3
g5
Population set
Study set
g2 g6
p4
p3
p1
p5
p6 p2
Table 2. Percentage of patients that have a known value for each of the
clinical features and for the total number of mutations.
Feature
Cardioverter defibrillator
Non-obstructive HCM
Obstructive HCM
Resuscitated sudden death
Sudden death
Non-sudden death
Sudden death family history
Familial
Sporadic
Blood pressure
Gender
Age
Hypertrophy morphology
Mutations
3
Fig. 2. Population set and study set in (A) a gene expression analysis and (B) the prognosis of disease-related events. In this example, the population set in A is composed of 6 genes, g1 to g6, and
the study set of 2 genes, g3 and g5. In accordance, the population
and study sets in B are composed of 6 and 2 patients, respectively.
Patients (%)
96
36
36
96
100
100
37
96
96
39
96
60
96
76
3.1
METHODOLOGY
The first step in any enrichment analysis is the definition of
the list of entities to be analyzed.
Considering the case of gene expression analysis, the complete list of genes under analysis is called the population set.
As referred in the Introduction, the GSEA receives this list
as input. However, both SEA and MEA require two sets of
genes as input: a user-selected gene set, which is called the
study set and is a sub-set of the population set; and the
population set. The criterion used to select the study set can
be (and normally is) the level of expression of the genes in
the biological setting under analysis, meaning that the study
set will be the set of genes that are considered to be overand/or under-expressed. The evaluation of the existence of
enriched ontology terms is then made for the study set in
respect to the entire population set. This means that for an
annotated term to be considered as enriched its annotation
rate has to be higher in the study set than in the population
set.
Considering the application of the enrichment analysis to
the prognosis of disease-related events, the population set is
the complete list of patients with the disease. Since we are
interested in obtaining a list of enriched ontology terms for
Definition of patient profiles
Our aim is to define the patient profiles based on the result
of individual enrichment analyses performed with different
controlled vocabularies.
In order to assess the feasibility of this methodology, we
will begin by performing analyses with the Gene Ontology
and the NCIt.
Considering our case study of SCD occurrence in HCM
patients, we intend to evaluate the existence of ontology
terms that can assist us in separating patients with SCD
from patients without SCD.
When performing the analysis with the Gene Ontology, the
terms which enrichment will be evaluated depend on the
mutations the patients have. Firstly, the list of mutations that
all the patients in the study set have (e.g. patients with SCD,
with mutation value =1) is compiled; secondly, the list of
non-redundant mutated genes is retrieved from the list of
mutations; finally, the list of Gene Ontology terms used to
annotate the mutated genes is retrieved. The terms annotated
to the patients in the rest of the population are retrieved in
the same manner. The frequency of occurrence of the annotations is then calculated based on the patients, i.e., how
many patients in the study set and the population set are
annotated with the term. For each term, a patient can only be
counted once, even if he/she has more than one mutation
through which the term can be identified.
When performing the analysis with NCIt, the terms which
enrichment will be evaluated depend on the values of the
clinical features. For the features with possible values {1,1}, they will be considered if having value equal to 1 (thus
being present in the patient); for the categorical features, all
will be considered except when there are no known values
for any of the patients in the set. The terms annotated to the
Paper B
3
C.M.Machado et al.
features are retrieved based on the mappings already defined
between them and the NCIt. The following two features
exemplify the procedure for boolean and categorical variables, respectively:
• Non sudden death: when value =1, retrieve and use the
term Non_Sudden_Cardiac_Death (and its parent
terms).
• Blood pressure: when value =hypertension, retrieve and
use the term Hypertension (and its parent terms).
The frequency of occurrence of the annotations is calculated
as before, i.e., how many patients in the study set and the
population set are annotated with the term.
We will test both SEA and MEA approaches. Since GSEA
produces a list of enriched terms for the entire set of entities,
it is not as interesting for our study as the other two.
The lists of enriched terms that result from the analysis with
each controlled vocabulary will be compiled and used as a
template-profile for the respective set of patients (e.g. with
SCD). The individual profiles will be defined as follows: for
each patient and each ontology term, it is checked if the
patient is annotated with the term; if true, a pair variable/term is created for that patient. The complete set of pairs
variable/term thus obtained is the profile for that specific
patient.
The pairs variable/term will substitute the original variables
in the second unit of the prognosis methodology (Fig. 1).
In this study we include in the group of patients with SCD
both patients that died due to a sudden cardiac arrest and
patients that suffered at least one resuscitated sudden cardiac arrest (which can be either alive or dead). The group of
patients without SCD includes all the other patients.
4
DISCUSSION AND CONCLUSIONS
In this article we present a novel prognosis prediction methodology based on an enrichment analysis. This type of analysis is normally used in contexts such as gene expression
analysis for the identification of functional annotations that
might be used to explain the differences in expression. Here
we propose to use enrichment analysis for the identification
of ontology terms that might be used to explain the differences between the group of patients for whom a given disease event occurred and the group of patients for whom it
did not occur. The ontology terms considered to be enriched
will assist in the creation of profiles for individual patients.
These profiles will then be used to evaluate for new patients
if the event might occur or not.
An important aspect of the present analysis is the dataset: it
contains data from patients, and was collected in the context
of their medical evaluation. As such, it reflects two important aspects of the nature of clinical records: only the
information deemed relevant by the medical experts is pre-
4
sent; not all of the information is available for all of the patients.
Our interest is precisely in evaluating if it is feasible to extract relevant knowledge from controlled vocabularies that
can enrich the dataset, and thus allow its exploitation with
data mining algorithms.
In a first approach, we will test only two vocabularies: the
Gene Ontology and the NCIt. Although this means that
some of the features will not be considered due to the inexistence of annotations, we expect to be able to evaluate the
applicability of the methodology.
The data under analysis in this study has been provided by
several Portuguese institutions, including hospitals and molecular biology research laboratories.
ACKNOWLEDGEMENTS
This work was supported by the FCT through the Multiannual
Funding
Program,
the
doctoral
grant
SFRH/BD/65257/2009
and
the
SOMER
project
(PTDC/EIA-EIA/119119/2010).
The authors would like to thank to Alexandra R. Fernandes,
Susana Santos and Dr. Nuno Cardim for collecting and
providing the dataset.
REFERENCES
Ashburner,M., Ball,C.A., Blake,J.A., Botstein,D., Butler,H., Cherry,J.M., Davis,A.P., Dolinski,K., Dwight,S.S., Eppig,J.T., Harris,M.A., Hill,D.P., Issel-Tarver,L., Kasarskis,A., Lewis,S.,
Matese,J.C., Richardson,J.E., Ringwald,M., Rubin,G.M. and
Sherlock,G. (2000) Gene Ontology: tool for the unification of
biology. Nat. Gene.t, 25, 25–29.
Robinson,P.N. and Bauer,S. (2011) Introduction to BioOntologies. (Chapter 8) CRC Press Taylor & Francis Group.
Zhang,S., Cao,J., Kong,Y.M. and Scheuermann,R.H. (2010) GOBayes: Gene Ontology-based overrepresentation analysis using
a Bayesian approach. Bioinformatics, 26(7), 905-911.
Leong,H.S. and Kipling,D. (2009) Text-based over-representation
analysis of microarray gene lists with annotation bias. Nucleic
Acids Res. 37(11), e79.
Hoehndorf,R., Dumontier,M., Gkoutos,G.V. (2012) Identifying
aberrant pathways through integrated analysis of knowledge in
pharmacogenomics. Bioinformatics, 28(16), 2169-75.
LePendu,P., Musen,M.A., Shah, N.H. (2011) Enabling enrichment
analysis with the Human Disease Ontology. J. Biomed. Inform.
44(Suppl 1), S31-8.
Huang,D.W., Sherman,B.T. and Lempicki, R.A. (2009) Bioinformatics enrichment tools: paths toward the comprehensive functional analysis of large gene lists. Nucleic Acids Res. 37(1), 113.
Bauer,B., Gagneur,J. and Robinson,P.N. (2010) GOing Bayesian:
model-based gene set analysis of genome-scale data. Nucleic
Acids Res. 38(11), 3523-3532.
Paper B
Enrichment analysis applied to disease prognosis
Khatri,P., Draghici,S., Ostermeier,G.C. and Krawetz,S.A. (2002)
Profiling gene expression using onto-express. Genomics 79,
266–270.
Subramanian,A., Tamayo,P., Mootha,V.K., Mukherjee,S.,
Ebert,B.L., Gillette,M.A.,
Paulovich,A., Pomeroy,S.L.,
Golub,T.R., Lander,E.S. and Mesirov,J.P. (2005) Gene set enrichment analysis: a knowledge-based approach for interpreting genome-wide expression profiles. Proc. Natl. Acad. Sci.
USA 102, 15545–15550.
Martin,D., Brun,C., Remy,E., Mouren,P., Thieffry,D. and Jacq,B.
(2004) GOToolBox: functional analysis of gene datasets based
on Gene Ontology. Genome Biol. 5, R101.
Pesquita,C., Faria,D., Falcão,A.O., Lord,P. and Couto,F.M. (2009)
Semantic Similarity in Biomedical Ontologies. PLoS Comput.
Biol. 5(7): e1000443.
Ferreira,J. and Couto,F. (2011) Generic semantic relatedness
measure for biomedical ontologies. International Conference on
Biomedical Ontologies (ICBO), 2011.
Breiman,L. (2001). Random forests. Mach. Learn. 45, 5–32.
Berner,E.S. (Editor) (2007) Clinical decision support systems:
theory and practice. (Chapter 3) Health Informatics Series, 2nd
Edition.
Maron,B.J., Maron,M.S., Wigle,E.D. and Braunwald,E. (2009)
The 50-Year History, Controversy, and Clinical Implications of
Left Ventricular Outflow Tract Obstruction in Hypertrophic
Cardiomyopathy: from Idiopathic Hypertrophic Subaortic Stenosis to Hypertrophic Cardiomyopathy. J. Am. Coll. Cardiol.
54, 191-200.
Alcalai,R., Seidman,J.G. and Seidman,C.E. (2008) Genetic Basis
of Hypertrophic Cardiomyopathy: from Bench to the Clinics. J.
Cardiovasc. Electrophysiol. 19, 104-110.
Harvard
Sarcomere
Mutation
Database:
http://genepath.med.harvard.edu/~seidman/cg3/
Sioutos,N., Coronado,S., Haber,M.W., Hartel,F.W., Shaiu,W.L.
and Wright,L.W. (2007) NCI Thesaurus: a Semantic Model Integrating Cancer-Related Clinical and Molecular Information. J.
Biomed. Inform. 40, 30-43.
Systematized Nomenclature of Medicine-Clinical Terms
(SNOMED), http://www.ihtsdo.org/snomed-ct/
Beisswanger,E.,
Lee,V.,
Kim,J.,
Rebholz-Schuhmann,D.,
Splendiani,A., Dameron, O., Schulz, S., Hahn,U. (2008) Gene
Regulation Ontology (GRO): Design Principles and Use Cases.
Stud. Health Technol. Inform. 136, 9–14.
Eilbeck,K., Lewis,S.E., Mungall,C.J., Yandell,M., Stein,L., Durbin,R., Ashburner,M. (2005) The Sequence Ontology: a Tool
for the Unification of Genome Annotations. Genome Biol. 6,
R44
Machado,C.M., Couto,F.M., Fernandes,A.R., Santos,S. and
Freitas,A.T. (2012) Toward a translational medicine approach
for hypertrophic cardiomyopathy. International Conference on
Information Technology in Bio- and Medical Informatics
(ITBAM), 2012.
Paper B
5
Integration of the Anatomical Therapeutic Chemical Classification
System and DrugBank using OWL and text-mining
Samuel Croset 1*, Robert Hoehndorf 2, Dietrich Rebholz-Schuhmann 1
1
European Bioinformatics Institute, Wellcome Trust Genome Campus, Cambridge, CB10 1SD UK
2
Department of Genetics, University of Cambridge, Downing Street, Cambridge, CB2 3EH, UK
*Email: croset@ebi.ac.uk
ABSTRACT
The field of drug discovery has experienced in the recent
years a growth in the number of computational methods
used to predict new biological actions for chemical compounds. In order to evaluate predictions and to gain insights
into the potential usage of drugs, the Anatomical Therapeutic
Chemical Classification System (ATC) serves as an internationally accepted gold standard. However, this classification
was not initially developed to accomplish the mentioned task
and therefore lacks connections with other biological databases. In order to maximize its interoperability with DrugBank, a major provider of drug knowledge, we improved by
over 13% the number of current mappings with the ATC. We
then converted the classification into a Web Ontology Language (OWL) representation, in order to allow queries and
exploration of its content via a reasoning engine. The work is
accessible via a web interface (www.ebi.ac.uk/Rebholzsrv/atc) to assist future drug discovery and repurposing initiatives.
1
INTRODUCTION
Since the early seventies, the Anatomical Therapeutic
Chemical Classification System (ATC) has provided a
standard and international taxonomy used to classify and
compare therapeutic compounds (Ivanova et al., 2010).
Among others, the classification helps to generate the metrics around the comparison of drug usage between countries
or health care settings. The ATC features five levels of taxonomic hierarchy, from broader concepts to more specific
one. The first level indicates the anatomical location at
which the drug acts. The second level to the fourth level
describes the pharmacological or chemical groups to which
the drug belongs. The fifth level represents the chemical
substance itself. As the therapeutic categories of the classified drugs are available from the ATC, this classification is
also used as a gold standard for drug indications and mode
of action by databases such as DrugBank (Knox et al., 2011)
or PharmGKB (Hewett et al., 2002). Even though this usage
*
of the ATC differs from the original one, it seems likely that
it will increase in importance as a means to analyze structured and normalized drug pharmacological information.
For instance, computational drug discovery methods and
drug repurposing attempts already evaluate the predictions
made against the ATC (Iorio et al., 2010; Campillos et al.,
2008; Gottlieb et al., 2011;Tatonetti et al., 2012). In order to
improve the classification for this task and to enable its interoperability with other types of biological data it is necessary to map the therapeutic compounds described within the
ATC to entries of drug-related databases. Among these,
DrugBank provides a central hub and broad connectivity
with other biological data providers such as Uniprot or
Chembl (Knox et al., 2011). The Chemical compounds present within DrugBank are already partially curated and
linked to ATC entries, but some drugs are not annotated or
they are mapped to ATC categories that have been obsoleted. In order to improve the interoperability between the
ATC and DrugBank, we have used a text-mining tool to
generate new mappings not previously referenced. Moreover, the taxonomic structure of the ATC enables its representation with the Web Ontology Language (OWL), which
facilitates current and future integration and allows querying
of the data in an automated way. Our work extends the version already available (Hoehndorf et al., 2012) by providing
a querying and browsing application as well as a thorough
mapping to DrugBank. The outcome of the work is compatible with the standard Semantic Web tools and libraries and
should facilitate future evaluations and analyses done
around drug pharmacology.
2
MATERIALS AND METHODS
The integration of DrugBank with the ATC is decomposed
in three steps: Conversion of the repositories into Java objects, analysis of the current interoperability between the
resources and conversion to an OWL representation of the
integrated content. The source code is freely available at
https://github.com/loopasam/ATCExplorer.
To whom correspondence should be addressed.
Paper C
1
S.Croset et al.
2.1
Parsing of the original data
DrugBank
has
been
downloaded
from
http://www.drugbank.ca/downloads (June 10, 2012), parsed
and converted into Java objects for easier handling. The
ATC has been purchased in ASCII format (January 2012)
from http://www.whocc.no/atc_ddd_publications/order/ and
also converted into Java objects.
2.2
3.1
Mapping of DrugBank compounds to ATC
therapeutic categories
A dictionary for DrugBank compounds has been created and
contains information about the drugs synonyms. The synonyms are string of characters or numerical codes describing
the different brand names of the drug and the CAS numbers,
as appearing in DrugBank.
The labels of the ATC entries describing drugs (fifth level)
have been compared to each entry of the dictionary, by using the Java library LingPipe 4.1.0. A mapping was considered to be true when the string of the dictionary matched
exactly the string of the label, without considering capital
letters. In case of true mapping, the DrugBank identifier of
the synonym was mapped to the corresponding ATC therapeutic category.
2.3
RESULTS
The ATC and DrugBank both refer to therapeutic compounds using their own identification system. However,
some of these drugs are structurally identical in the two datasets and a mapping between the two resources enables a
better interoperability. Some DrugBank compounds have
already been mapped to ATC categories by DrugBank’s
2
Interoperability between the ATC and DrugBank
We have first compared the distribution of mappings already described in DrugBank in comparison to the textmining predictions. In the context of this analysis, we define
‘mapping’ as the following: A mapping is an association
between a DrugBank entry and an ATC category. DrugBank
compounds can be mapped to more than one ATC category,
and an ATC category can encompass more than one drug.
Figure 1 summarizes the distributions. Most of the mappings already present in DrugBank have been confirmed by
OWL representation
The categories of the ATC and DrugBank entries have been
converted as OWL classes and merged together. The taxonomic structure of the classification has been preserved using SubClassOf axioms assertions between the children and
the parent categories. Aside the parenthood assertions, no
other OWL constructs are necessary to capture the original
ATC hierarchy. The mappings between DrugBank compounds to their respective ATC therapeutic categories have
been captured with OWL SubClassOf axioms, e.g.:
atc:B01AE02 (label: Lepirudin)
owl:SubClassOf
drugank:DB00001 (label: Lepirudin)
The OWL conversion and the question answering over the
integrated dataset have been done via the OWL API (version 3.2.4) and the Elk reasoner (version 0.2.0 for the OWL
API). The data integration and reasoning have been on a
laptop computer featuring an Intel Core i7 CPU and 8 Gb of
memory. The web application is built on the top of the Play!
framework (version 1.2.4) and is freely available at
www.ebi.ac.uk/Rebholz-srv/atc.
3
curation team, but some erroneous or missing information is
still present within the repository. We have analyzed this
discrepancy by text-mining and suggested a list of new
mappings not previously referenced. The integrated datasets
have then been converted into an OWL representation,
thereby enabling semantic querying.
Figure 1: Comparison of mappings between the ATC and
DrugBank compounds. Set A shows the curated mappings
only found in DrugBank. Set C represents the new mappings
found fond text-mining. Set B is the intersection of set A and
C, namely the mappings that are already present within DrugBank, confirmed by text-mining. Set D shows the DrugBank
mappings to obsolete ATC categories.
text-mining (1572 mappings, Figure 1 – Set B), showing an
agreement between the two methodologies. Some of the
mappings are difficult to identify by text-mining only, as
they correspond to ATC categories with generic or broad
names, such as ‘various’ (atc:M02AX10) or present some
lexical variances. Such mappings are belonging to the Set A
of the Figure 1. We found 272 new mappings with textmining (Set C) not previously indexed by DrugBank. No
evaluation has been done over the text-mining predictions,
but the list of new mappings has been handed over to the
DrugBank curation team for inclusion in the database after
Paper C
internal manual verification. Some DrugBank compounds (8
– Set D) are linked to ATC categories no longer existing.
This erroneous information will be removed from DrugBank
by its curation team. This straightforward text-mining approach (exact matching on words) improved the interoperability between the two datasets by increasing by 13% the
number of asserted relations. The new information will be
added to the main DrugBank database for the community to
profit.
3.2
Question-answering over the integrated data
The taxonomic structure of the ATC can support a representation in OWL in order to benefit from the Semantic Web
tools, particularly automated reasoning engines. Reasoners
can classify and help to check for consistency among the
integrated domain knowledge. They are also used for question-answering over ontologies. In order to achieve such a
task with our integrated dataset, we have converted the 5717
classes of the ATC with their corresponding 1503 DrugBank mapping into OWL classes (giving a total of 7220
classes). The OWL ATC representation contains only subclass axioms; it therefore available in the OWL2 EL profile
permitting the use of the Elk reasoner for time-efficient
question-answering. Elk classifies the ontology in 3.01 seconds and can reclassify the ontology in less than a second
with our settings (data not shown). This feature enables its
use as a backend engine on a web server, for answering live
queries from users. A web form accepts OWL expressions
in Manchester syntax and retrieves the sub and super classes
of the expression definitions. Figure 2 shows for instance
the subclasses satisfying the OWL expression “G01AA
and DrugBankCompound” as it appears on the web
application.
4
DISCUSSION
The modeling of domain knowledge in OWL is constrained
by the semantics and the axioms of the language. As the
ATC was originally designed as a classification to capture
statistics about drug metrics, the intended relations between
categories and sub-categories are not formally described.
However, the taxonomic structure of the ATC can be interpreted as mathematical sets and subsets, which can themselves be transcribed as OWL classes and subclass relationships. The native semantics of the relations are not yet perfectly captured by OWL, but nevertheless facilitates an intuitive formal representation and serves well the purpose of
exploring and querying the ATC.
DrugBank OWL classes have been asserted as super classes
of their corresponding ATC classes, e.g.:
atc:B01AE02 (label: Lepirudin)
owl:SubClassOf
drugbank:DB00001 (label: Lepirudin)
Figure 2: Screenshot of the web interface used to explore and
query the ATC classification.
An OWL equivalent class axiom seems at first more
appropriate to capture this relationship. However, the ATC
therapeutic categories capture the usage of the drug in regards to its anatomical effect on the body. Therefore, some
drugs have multiple ATC codes and are present several
times in the ATC in order to represent the possible different
usages of the chemical. For instance the bio-activity of the
Betamethasone is described by the ATC codes
atc:A07EA04 (label: Betamethasone) and
atc:C05AA05 (label: Betamethasone). This
compound is also described in DrugBank, but with a unique
accession identifier drugbank:DB00443 (label:
Betamethasone), representing the compound as a generic artifact, without the usage information attached to it as
in the ATC. An OWL equivalent class assertion
would wrongly result in an equivalence between the ATC
categories atc:A07EA04 and atc:C05AA05. The discrepancy between the different meanings of the two repositories while referring to the same chemical compounds
(structure-wise) is solved with the subclass axioms. A compound as described in DrugBank is considered more generic
than a compound the same compound as described in the
ATC.
The improved interoperability between DrugBank and the
ATC paves the way for integrative analysis using the results
of recent large-scale studies (Tatonetti et al., 2012; Gottlieb
et al., 2011). Indeed, as ATC categories have been adopted
as standards for pharmacological effect description, the query interface presented here can facilitate the comparison of
predictions made from one dataset versus another. The ATC
OWL classification is also useful to evaluate drugrepurposing predictions, which would be defined as a new
mapping of a drug (from DrugBank) to an ATC category.
For instance, a method could predict that a drug ‘A’ (referenced in DrugBank) would act as an ATC category ‘B’. The
currently known effect of the compound ‘A’ could be re-
Paper C
3
S.Croset et al.
trieved via a reasoning engine over the OWL ATC or via the
interface and compared to the one predicted by the method.
5
CONCLUSION
We have improved the interoperability between two central
biomedical repositories, DrugBank and the ATC. The result
of the work is likely to profit the community using these
resources as the new and corrected mappings should be
available in the native database, thanks to the collaboration
of the DrugBank curation team. Our OWL representation of
the ATC extends the previously-described one by providing
integration with DrugBank as well as a visual interface to
explore and query the classification. The integrated dataset
will also help computational drug discovery methodologies
to evaluate and validate their results against the gold standard reference that is the ATC.
REFERENCES
Campillos,M. et al. (2008) Drug target identification using sideeffect similarity. Science, 321, 263–266.
Gottlieb,A. et al. (2011) PREDICT: a method for inferring novel
drug indications with application to personalized medicine.
Molecular Systems Biology, 7, 496.
Hewett,M. et al. (2002) PharmGKB: the Pharmacogenetics
Knowledge Base. Nucleic Acids Research, 30, 163–165.
Hoehndorf,R. et al. (2012) Identifying aberrant pathways through
integrated analysis of knowledge in pharmacogenomics. Bioinformatics, 1–7.
Iorio,F. et al. (2010) Discovery of drug mode of action and drug
repositioning from transcriptional responses. Proceedings of
the National Academy of Sciences of the United States of
America, 107, 14621–14626.
Ivanova,E.P. et al. (2010) Guidelines for ATC classification and
DDD assignment. Vegetatio, 70, 93–103.
Knox,C. et al. (2011) DrugBank 3.0: a comprehensive resource for
“omics” research on drugs. Nucleic Acids Research, 39,
D1035–D1041.
Tatonetti,N.P. et al. (2012) Data-Driven Prediction of Drug Effects
and Interactions. Science Translational Medicine, 4,
125ra31–125ra31.
4
Paper C
Using Ontologies to Study Cell Transitions
Ludger Jansen1 2, Georg Fuellen3, Ulf Leser4, Andreas Kurtz5 6
1
Philosophisches Seminar, RWTH Aachen University, Germany; 2 Institute of Philosophy, University of Rostock, Germany;
IBIMA,University of Rostock, Germany; 4 Institute for Computer Science, Humboldt Universität zu Berlin, Germany;
5
BCRT, Charite Berlin, Germany; 6 Seoul National University, South Korea
3
ABSTRACT
described cell reprogramming of fibroblasts back to pluripotency (also known as generation of iPS, induced pluripotent stem cells) [1], hundreds of papers have dissected
the reprogramming process and the cellular disposition of
pluripotency at an ever-increasing resolution, reviewed in,
e.g., [2] and [3]. This corpus is currently underused as
there is no formal representation of the findings.
There exist already several ontologies in the domain of
cell biology, e.g. the cell type ontology (CL; cf. [4], [5]).
[5] proposed formal definitions for CL classes, referring
to properties of cells such as expressed proteins, activated
biological processes, or the phenotypic characteristics
associated with a cell. The Virtual Physiological Human
project (www.ricardo.eu) attempts to provide interoperability between different databases and tools related to human physiology and gene expression; the associated software Phenomeblast (code.google.com/p/phenomeblast) is
an ontology-based tool for aligning and comparing phenotypes across species. However, most efforts focus on anatomical features and only rarely address the cell level (cf.
[6], [7], [8], [9] and [10]). What is missing is a tool to
represent and to compare cellular phenotypes and their
dynamics.
Cell transitions, be it reprogramming of somatic cells to
pluripotency or trans-differentiation between cells, is a
hot topic in current biomedical research. The large corpus
of recent literature in this area is underused, as results are
only represented in natural language, impeding their systematic aggregation and usage. Scientific understanding
of the complex molecular mechanisms underlying cell
transitions could be improved by making essential pieces
of knowledge available in a formal (and thus computable)
manner. We describe the outline of two ontologies for cell
phenotypes and for cellular mechanisms which enable
representation of data curated from the literature or obtained by bioinformatics analyses for building a knowledge base about phenotypes and mechanisms involved in
cellular reprogramming. In particular, we discuss how
comprehensive ontologies of cell phenotypes and of
changes in mechanisms can be designed using the entityquality (EQ) model. Our design allows deep insights into
the relationship between the continuants (cell phenotypes)
and the occurrents (cell mechanism changes) involved in
cellular reprogramming. Further, our ontologies allow the
application of algorithms for similarity searches in the
spaces of cell phenotypes and mechanisms, and, especially, changes of mechanisms during cellular transitions.
1
2
CELL MECHANISMS AND CHANGES
We distinguish between two types of processes going on
in a cell: microscale mechanisms and macroscale changes
thereof. Microscale mechanisms are the interactions between molecules going on in a cell at a certain time, while
a macroscale change is the transition from one set of microscale mechanisms going on at one point of time to another such set at a later time. In order to transfer ontologybased annotation and search strategies from phenotypes at
the anatomical level [11] to the domain of cell phenotypes
and mechanism changes we need to be able to formally
describe both (a) cell phenotypes and (b) mechanism
changes. Phenotypes are usually described by means of
the entity-quality syntax (EQ; [13], [14]). To apply the
EQ syntax to the cell level, we outlined two ontologies, an
ontology of cell parts (Figure 1) and an ontology of microscale mechanisms (Figure 2) to be used in combination
with a small set of standardized modifiers (as ‘qualities’).
BACKGROUND
The (artificial) induction of cell transitions has recently
attracted a lot of attention. A cell phenotype (or cell type)
can be defined by the cell’s repertoire of molecules and
structural components, together with the specific morphology and function they bring with them. A cell transition is a transition of a cell from one phenotype to another. For example, the phenotype of epithelial cells is
distinct from the phenotype of fibroblasts. Programming
of cells is the induction of a cell phenotype transition, e.g.
from fibroblast to epithelial cell. Reprogramming is the
artificially induced transition of a cell to a cell phenotype,
which it (or its predecessor) had in the past. Potency can
be defined as the disposition of a cell to transition naturally into another cell phenotype; pluripotency is the ability of a cell to transition naturally into all cell
(pheno)types of the body. Since Takahashi and Yamanaka
Paper D
1
L. Jansen et al.
n
Figure 1. Outline of an ontology of cell parts and its use to describe cell phenotypes. The figure shows a structure by which
cell phenotypes, here epithelial cells, mesenchymal cells and embryonic stem cells (ESC), can be formally represented, using entity terms (shown on the left hand side) and PATO-analogous quality modifiers (shown on the right hand side). Terms
referring to cells are indicated in yellow, terms relating to structures in red, to ultrastructures in blue, and to molecules in
green.
cific change modifiers (as ‘qualities’); see Figure 2 for an
example. In our framework, a pluripotent cell can be
characterized by the disposition of pluripotency, but also
by its expression data (about genes, proteins etc.), from
which relevant microscale mechanisms can be inferred. A
cell transition e.g., of a fibroblast into a pluripotent cell
can be described by comparing the expression data of
both cell phenotypes, which capture macroscale changes
in microscale mechanisms. The latter includes the start-up
of the interactions between genes/proteins relevant for the
induction of pluripotency; such a start-up may happen
because the cell starts to produce more instances of the
molecule types participating in this type of interaction. In
our framework, a pluripotent cell realizes dispositions for
mechanisms relevant for pluripotency that may be described by a network of interactions. Further, a cell transition from fibroblast to pluripotent cell realizes dispositions for changes in mechanisms. After transition, the cell
is characterized by the microscale mechanisms relevant
for the pluripotent phenotype.
Our ontologies are designed to be combined with specific modifiers within the EQ framework. As shown on
the right-hand side of Figure 1, the ontology of cell phenotypes can be used to collect annotations for cell pheno-
To describe cell phenotypes and the transitions between
them we refer to entities belonging to distinct ontological
categories [12]:
(1) Independent continuants: Cells and their organelles
as well as molecules are spatial entities existing as
spatial wholes at every time they exist.
(2) Dependent continuants: Any property of a cell or a
molecule, be it a quality or a disposition, also exists as a whole at every time it exists. Any such
property is ontologically dependent on its particular bearer: It cannot exist without it.
(3) Occurrents: Interactions, inhibitions, stimulations
as well as transitions are temporally extended processes and do not exist as a whole at any time at
which they happen.
Cell phenotype data describe as well continuants (like
cellular components and dispositions) as occurrents,
namely the molecular interactions going on in a cell at a
certain time. Cell transition data primarily describe occurrents, namely macroscale changes of microscale mechanisms. Within the EQ framework, we can describe such
macroscale changes of microscale mechanisms by pairing
terms for microscale mechanisms (as ‘entities’) with spe-
Paper D
2
Using Ontologies to Study Cell Transitions
Figure 2. Outline of an ontology of cell mechanisms and its use to describe cell transitions. The figure shows a structure
by which mechanism changes can be formally represented, using entity terms (shown on the left hand side) and quality
modifiers (shown on the right hand side). The colour code follows the code used in Figure 3: occurrents relevant for cell
phenotypes are indicated in yellow, occurrents relevant for ultrastructures in blue, and occurrents directly involving molecules in green. ‘Up’ and ‘down’ are intended to indicate relative changes: ‘Interaction Occludin-JAM Up’ states that there
is a development in the cell to feature more interactions of this kind, no matter how many of them there have been before.
Hence there is no direct connection between types of such relational changes and their start- and end-states.
of the microscale mechanisms relevant for TJ, which are
the macroscale changes associated with TJ formation.
Within the framework of the EQ syntax, qualities help to
describe these changes of the microscale mechanisms.
The example hierarchy in Figure 2 reflects our example of
the epithelial-mesenchymal transition (EMT) and its reversal (MET, observed during reprogramming). MET can
be defined in terms of ‘network of mechanisms relevant
for epithelial cell’ as the entity term and ‘up’ as the quality modifier; for a more complete description, the network
of mechanisms relevant for mesenchymal cell’ goes
‘down’ simultaneously.
types such as fibroblast, epithelial cell and pluripotent
stem cells. We can set up annotation profiles of cells, consisting of sets of EQ pairs that describe them. For example, the profile of epithelial cells includes that the
genes/proteins Occludin, JAM and Claudins as well as
tight junctions (TJs) are ‘present’, and cell membranes are
‘joined’. For this purpose, we use a number of standardized modifiers like ‘present’, ‘absent’, ‘up’ and ‘down’,
which will also be integrated within an ontology like
PATO [13].
The ontology of cell mechanism changes, on the other
hand, is designed to be combined with modifiers like ‘up’
and ‘down’ in order to yield descriptions for macroscale
changes of the microscale mechanisms going on within a
cell (‘up’ for start-up; ‘down’ for shutdown). The righthand side of Figure 2 features, e.g., the specific changes
3
DISCUSSION
Both ontologies allow for annotation propagation. In
[11], annotations for anatomical entities are propagated up
Paper D
3
L.Jansen et al.
lecular entities. In addition, we can also cluster the macroscale changes that transition cells from one phenotype to
another. A cluster of cells shares aspects of components
and microscale mechanisms. Generally speaking, similar
phenotypes correspond to similar cells, and similar mechanism changes correspond to similar cell transitions.
Thus, boundaries between clusters of cells that are ‘next
neighbours’ (e.g. pluripotent embryonic stem cells and
epiblast stem cells) as well as between cells on opposite
ends of a developmental spectrum (e.g. mesenchymal
cells and epithelial cells) can be defined by clustering
based on expert annotations and on bioinformatics analyses of experimental data. Clustering of mechanism
changes (that is, of macroscale changes in microscale
mechanisms) will in turn generate clusters of similar
mechanism changes with a large distance between them.
The cause for this large distance then is the existence of
strongly dissimilar cells.
a hierarchy of is_a and part_of relations, such that a parent receives all the annotations of its children. However,
given the usual all-some semantics, the mereological hierarchy cannot be used in the same way as in [11] for cell
phenotypes and mechanism changes. Therefore, we decided to model mereological relations with has_part instead of part_of. We did this for the following reasons:
First, in anatomical hierarchies, parts determine the
wholes they belong to. E.g., a finger is always part of a
hand. A molecular entity like Occludin, however, can
belong to a wide range of cell phenotypes, while a certain
phenotype (by definition) has to possess certain molecular
entities. As here the whole determines the parts, we need
to use the has_part relation. Second, cell mechanisms, as
we understand them here, are occurrents, and initial temporal parts can happen without the event being completed.
Again, we need to employ the has_part hierarchy, from
whole processes to their necessary parts (e.g., from Network_of_mechanisms_relevant_for_TJ to the Interaction_Occludin_JAM). When employing annotation propagation, therefore, as a rule, a whole process will have a
higher information content than its necessary parts).
The ontologies outlined above enable similarity
searches across cell phenotypes and mechanism changes
in analogy to [11]. Such searches may compare, e.g.,
EMT/MET and reprogramming data. In simplified terms,
an MET [15] consists in, first, the formation of adherens
junctions (AJ) and, second, the formation of tight junctions (TJ). We represent the MET as the start-up of the
microscale mechanisms relevant for an epithelial cell,
which has as one of its parts TJ formation that is, in turn,
represented as the start-up of the mechanisms relevant for
a TJ. This is the inverse of an EMT (which happens in
development, metastasis and fibrosis). ExprEssence and
related tools ([16], [17], [18], [19]) can be employed for
generating annotations about mechanism changes relevant
for a certain transition by means of high-throughput data
analysis. The more mechanisms are annotated, the better
we can estimate how similar biological processes are.
Ultimately, any set of cell transitions can be compared
(using data coded in EQ syntax) with respect to the underlying mechanisms, demonstrating the power of our approach. Our ontology design principles thus enable a kind
of BLAST search in the space of annotations (for mechanisms), with similar goals such as highlighting relationships (between mechanisms, based on basic mechanisms
as building blocks), and eventually estimating their evolutionary history.
While our individuation criteria for cell phenotype are
very fine-grained (even a tiny change in the molecular
repertoire constituted a change in phenotype), we can
construct more coarse-grained cell types by clustering of
cell phenotypes based on similarity, considering the presence or absence of (ultra-)structural components and mo-
4
4
CONCLUSION
We outlined how to design ontologies that enable to (1)
formally represent cell phenotypes and mechanism
changes behind cell transitions such as (re-)programming,
and to (2) develop algorithms exploiting this framework,
including clustering and searching for similar cell phenotypes and mechanism changes. Both ontologies support
manual curation of publication data, annotation propagation and information content measurement, as well as the
inclusion of results from high-throughput data analysis.
Our use of EQ-syntax allows the systematic encoding of
annotation profiles of cell phenotypes and mechanism
changes. The terms for both types of entities are organized in hierarchies ranging from molecular to (ultra)structural to morphological entities. Annotation profiles can then be obtained using (1) data curation from
publications or by (2) high-throughput data analysis. In
ontological terms, bioinformatics tools such as ExprEssence can be used as an instrument for deriving mechanistic information from high-throughput data, turning information about continuants into information about occurrents by differential analysis. The starting point for expert
curation, possibly supported by text mining, must be a set
of carefully selected papers.
Given a rich annotated knowledge base, existing approaches for ontology-based similarity measurements [11]
can be applied to the domains of cell phenotypes and cellular mechanism changes. This would yield two important
functionalities: It allows clustering of cell phenotypes
(and of mechanism changes) by similarity, providing important information for an operational definition of cell
phenotypes, and it allows similarity search in the spaces
of mechanism changes and of cell phenotypes.
Paper D
Using Ontologies to Study Cell Transitions
8. Hoehndorf, R., A. Oellrich, and D. RebholzSchuhmann, Interoperability between phenotype and
anatomy ontologies. Bioinformatics, 2010. 26(24): p.
3112-8.
9. Hoehndorf, R., P.N. Schofield, and G.V. Gkoutos,
PhenomeNET: a whole-phenome approach to disease
gene discovery. Nucleic Acids Research, 2011.
39(18): p. e119.
10. Hoehndorf, R., et al., Interoperability between
biomedical ontologies through relation expansion,
upper-level ontologies and automatic reasoning. PLoS
One, 2011. 6(7): p. e22006.
11. Washington, N.L., et al., Linking human diseases to
animal models using ontology-based phenotype
annotation. PLoS biology, 2009. 7(11): p. e1000247.
12. Jansen, L., Categories: The Top Level Ontology, in
Applied Ontology, K. Munn and B. Smith, Editors.
2008, Ontos, Frankfurt. p. 173-196.
13. Gkoutos, G.V., et al., Using ontologies to describe
mouse phenotypes. Genome Biol, 2005. 6(1): p. R8.
14. Mungall, C.J., et al., Integrating phenotype ontologies
across multiple species. Genome biology, 2010. 11(1):
p. R2.
15. Thiery, J.P. and J.P. Sleeman, Complex networks
orchestrate epithelial-mesenchymal transitions. Nature
reviews. Molecular cell biology, 2006. 7(2): p. 131-42.
16. Ideker, T., et al., Discovering regulatory and signalling
circuits
in
molecular
interaction
networks.
Bioinformatics, 2002. 18 Suppl 1: p. S233-40.
17. Guo, Z., et al., Edge-based scoring and searching
method for identifying condition-responsive proteinprotein interaction sub-network. Bioinformatics, 2007.
23(16): p. 2121-8.
18. Warsow, G., et al., ExprEssence--revealing the
essence of differential experimental data in the context
of an interaction/regulation net-work. BMC systems
biology, 2010. 4: p. 164.
19. Kim, Y., et al., Principal network analysis:
identification of subnetworks representing major
dynamics using gene expression data. Bioinformatics,
2011. 27(3): p. 391-8.
To further refine and populate the ontologies, we currently explore the option to work together with collaborators in the DFG SPP 1356 (http://www.spp1356.de) on
pluripotency and cellular reprogramming, and similar
initiatives, and we are looking for funding. The size of the
final artifacts is obviously a function of time and efforts
invested in its development. While the number of relevant
entities is limited for cell anatomy and cell types (several
thousands), it is very large and virtually unlimited for
molecular entities. Naturally, we cannot expect complete
coverage here.
To evaluate our approach, we intent to compare similarity search results based on high-throughput data analysis
only to results based on employing the ontologies integrating high-throughput data, (ultra)structural data and
morphological data, and further to compare both sets of
results with the expectations of domain experts..To avoid
a garbage-in, garbage-out scenario, the application domain must be strictly limited, e.g. to data describing reprogramming and EMT experiments, so that the input
data can all be validated by domain experts.
ACKNOWLEDGEMENTS
Starting from research by GF, LJ and GF developed the ontologies while AK and UL provided domain knowledge. LJ
wrote the first version of this paper, drawing on a larger unpublished manuscript by the authors. All authors read and
approved the text. DFG support to AK and UL (AK 851/3-1,
LE 1428/4-1), GF (FU 583/2-1, FU 583/2-2) and LJ (JA
1904/2-1) is gratefully acknowledged.
REFERENCES
1. Takahashi, K. and S. Yamanaka, Induction of
pluripotent stem cells from mouse embryonic and
adult fibroblast cultures by defined factors. Cell, 2006.
126(4): p. 663-76.
2. Ho, R., C. Chronis, and K. Plath, Mechanistic insights
into reprogramming to induced pluripotency. Journal
of cellular physiology, 2011. 226(4): p. 868-78.
3. Jaenisch, R. and R. Young, Stem cells, the molecular
circuitry of pluripotency and nuclear reprogramming.
Cell, 2008. 132(4): p. 567-82.
4. Bard, J., S.Y. Rhee, and M. Ashburner, An ontology
for cell types. Genome biology, 2005. 6(2): p. R21.
5. Meehan, T.F., et al., Logical development of the cell
ontology. BMC Bioinformatics, 2011. 12: p. 6.
6. Gunsalus, K.C., et al., RNAiDB and PhenoBlast: web
tools for genome-wide phenotypic mapping projects.
Nucleic Acids Research, 2004. 32(Database issue): p.
D406-10.
7. Hoehndorf, R., et al., Relations as patterns: bridging
the gap between OBO and OWL. BMC
Bioinformatics, 2010. 11: p. 441.
Paper D
5
Automatically transforming pre- to post-composed phenotypes:
EQ-lising HPO and MP
Anika Oellrich1 , Christoph Grabmüller1 , and Dietrich Rebholz-Schuhmann1
1 European
Bioinformatics Institute, Wellcome Trust Genome Campus, Hinxton, CB10 1SD, UK
ABSTRACT
Large-scale mutagenesis projects are ongoing to better understand
human pathology and with that identify possible prevention and treatment
mechanisms. Those projects do not only record the genotype but also
the phenotype of genetically altered organisms. Consequently, phenotype
association studies become more and more important with the ever
increasing amount of available data. Thus far, phenotype data is stored in
species-specific databases that are not inter-operable due to the differences
in their phenotype representations. One suggested solution to the lack of
integration is the use of Entity-Quality (EQ) representations. However, this
requires a transformation of the phenotype data contained in the speciesspecific databases into corresponding EQ statements. If phenotypes are
represented with an ontology, the transformation of data may happen based
on the ontology, albeit this process may be slow if executed manually.
Here, we report on our ongoing efforts to develop a method (called EQliser) for the automated generation of EQ representations from phenotype
ontology concept labels. We implemented the suggested method in a
prototype and applied it to a subset of Mammalian and Human Phenotype
Ontology concepts. In the case of MP, we were able to identify the correct EQ
representation in over 52% of structure and process phenotypes. However,
applying the EQ-liser prototype to the Human Phenotype Ontology yields a
correct EQ representation in only 13.3% of the investigated cases.
With the application of the prototype to two phenotype ontologies, we
were able to identify common patterns of mistakes when generating the EQ
representation. Correcting those mistakes will pave the way to a speciesindependent solutions to automatically derive EQ representations from
phenotype ontology concept labels. Furthermore, we were able to identify
inconsistencies in the existing manually defined EQ representations of
current phenotype ontologies. Correcting those inconsistencies will improve
the quality of the manually defined EQ representations.
1
INTRODUCTION
Improvements in sequencing technologies led to ambitious projects,
aiming at the identification of a species phenome by targeted
mutation of the genome, e.g. the International Mouse Phenotyping
Consortium (IMPC) (Abbott 2010, Bogue and Grubb 2004).
Phenotype data resulting from those mutagenesis experiments is
captured in species-specific Model Organism Databases (MODs),
allowing a structured representation of the contained phenotype data
in support of comparative phenomics (Leonelli and Ankeny 2012).
Together with the number of available MODs (Blake et al. 2011,
Drysdale and Consortium 2008, Amberger et al. 2011), the number
of species-specific phenotype ontologies increased. Applying these
species-specific phenotype ontologies to represent phenotype data
in MODs, hinders the integration of phenotype data across MODs.
In order to facilitate integration across those MODs and enable
a knowledge flow, mechanisms are required bridging the speciesspecific phenotype ontologies.
In addition to ontology alignment algorithms, one suggested
bridging mechanism found increasing application: the EntityQuality (EQ) representation of phenotypes (Mungall et al. 2010).
Using the EQ representation to describe a phenotype means that this
phenotype is decomposed into an affected entity which is further
described with a quality, e.g. decreased body weight. Representing
phenotypes as a composition of an entity and quality is also
called post-composition. EQ descriptions have been successfully
applied in a number of studies, focusing on cross-species phenotype
integration (Washington et al. 2009, Chen et al. 2012, Hoehndorf
et al. 2011). Even though EQ representations are only available
for a subset of concepts of species-specific phenotype ontologies,
those studies have shown already promising results. However, the
studies could certainly benefit from more accessible data, ready to
be integrated into their frameworks.
Species-specific phenotype ontologies include, among others, the
Mammalian Phenotype Ontology (MP) (Smith et al. 2005), the
Human Phenotype Ontology (HPO) (Robinson et al. 2008) and the
Worm Phenotype Ontology (WBPhenotype) (Schindelman et al.
2011). Those phenotype ontologies provide the concepts ready
for annotation and are therefore also referred to as pre-composed
phenotype ontologies.
Thus far, post-composed phenotype representations are produced
manually which leads presumably to a high quality but is slow.
First, the species-specific pre-composed phenotypes are created, and
once a version is finalised, the corresponding EQ statements are
generated. Due to generating those EQ statements manually, only
a subset of concepts of the pre-composed phenotype ontologies is
available in EQ.
Furthermore, as an ontology is a community effort, it is subject to
change. Concepts evolve, get obsolete or simply change over time
and keeping the EQ representations updated is a very important
requirement. Since the EQ representations are determined in a
manual process, the coverage of the resources is still very limited
and mistakes introduced as part of manual curation process restrict
the beneficial outcomes. Developing an automated method being
capable of generating an EQ representation from a phenotype
concept could help this process, ensure high quality of the
EQ representations and keep up with the pace of the ontology
development cycle.
In this manuscript, we report about our ongoing efforts to develop
a method (called EQ-liser) transforming pre-composed phenotype
ontologies into a post-composed representation using EQ. After
developing a prototype and applying it to MP and HPO concepts,
we could derive a subset of areas which need improvement before a
generalised transformation from pre-composed into post-composed
phenotype representation is possible. Furthermore, applying our
approach does not only provide a decomposition of phenotypes, it
also facilitate the discovery of inconsistencies in the so far manually
Paper E
1
Anika Oellrich1 , Christoph Grabmüller1 , and Dietrich Rebholz-Schuhmann1
assigned EQ statements. Moreover, the approach also elucidates
inconsistencies in the concept labels of pre-composed phenotype
ontologies.
During evaluation, our prototype correctly generated the EQ
statement for over 52% of the test set of MP concepts and could also
identify a number of errors in the existing EQ representations for
both HPO and MP. We were also able to identify a number of label
inconsistencies within HPO, creating obstacles in an automated
generation of EQ statements. Generated results as well as the
implemented source code are available on our project web page1 ,
together with more information about the project as such.
described in section 2.3. The composition of one particular PATO
concept corresponds to all PATO concepts whose terms form a
subset of the stemmed words contained in the concept name.
After filtering special characters and removing stop words from
the concept names and synonyms, the remaining textual content
was stemmed using a Porter (Porter 1980) stemmer provided
by Snowball3 . The stemmer was applied to all concept names
and synonyms. Stemmed concept labels and synonyms were then
pairwise compared and each concept entirely contained in another
(either label or synonym) was recorded. Applying this process we
retrieved 1,453 PATO concepts (out of 2,290) with a corresponding
cross product.
2
2.3
METHOD AND MATERIALS
Transforming pre-composed phenotype representations into postcomposed requires the identification of constituents in concept
labels, i.e. the EQ-liser has to identify the entity and the quality.
To illustrate the post-composition of the MP concept abnormal
otolithic membrane (MP:0002895), the manually assigned EQ
statement is provided here:
[Term]
id: MP:0002895 !
intersection_of:
intersection_of:
intersection_of:
2.1
Figure 1 shows the processing steps to derive the EQ statement from
a MP or HPO phenotype concept. Each of the steps is explained in
more detail in the following paragraphs.
abnormal otolithic membrane
PATO:0000001 ! quality
inheres_in MA:0002842 ! otolithic membrane
qualifier PATO:0000460 ! abnormal
Input data
In the existing, manually derived EQ statements, the entity
constituent is represented with a number of OBO Foundry
ontologies (Smith et al. 2007) and the quality constituent is always
represented using the Phenotypic quality And Trait Ontology
(PATO) (Mungall et al. 2010, Mabee et al. 2007). Entity constituent
filling ontologies also differ with the species. Supporting all those
ontologies would be out of scope of our preliminary study. We
therefore limited our approach to two species-specific ontologies,
HPO and MP, and also limited those further to phenotype concept
being represented in EQ using the Mouse Anatomy Ontology (MA)
(Hayamizu et al. 2005), the Gene Ontology (GO) (Ashburner
et al. 2000), the Foundational Model of Anatomy Ontology
(FMA) (Rosse and Mejino 2003) and PATO. We consider this
to be corresponding to structural and process phenotypes. We
downloaded a version of the two phenotype ontologies as .tbl
files2 and their corresponding EQ representation on the 03.05.2012,
with 9,795 HPO concepts and 9,127 MP concepts, and 4,783 and
6,579 respectively possessing an manually assigned EQ statement.
After reduction based on their manually assigned EQ statements to
structural and process phenotypes, the MP concepts were reduced
to 3,761 and the HPO concepts were reduced to 3,268.
2.2
Deriving PATO cross products
A subset of the PATO concepts constitute a composition of
other PATO concepts. For instance, the concept decreased depth
(PATO:0001472) could be represented using the PATO concept
decreased (PATO:0001997) and depth (PATO:0001595). To achieve
a term-wise composition of PATO concepts, we downloaded the
PATO .tbl file and applied the filtering and stemming algorithm as
1
2
2
Overview EQ-liser prototype
http://code.google.com/p/eqliser/
http://www.berkeleybop.org/ontologies/
Fig. 1: Shows the individual steps executed with EQ-liser to
decompose a phenotype ontology based on concept names.
The first step (see figure 1) in processing the ontology’s
downloaded .tbl file was the filtering for special characters.
Therefore, the concept labels contained in the downloaded .tbl files4
of the ontologies were analysed for their orthographic correctness
(Schober et al. 2009), i.e. special characters, such as e.g. “%” or “-”,
were excluded. Such special characters – often special punctuation
– potentially cause problems when matching differently punctuated
concept labels from several ontologies. Stop words, such as “in”
3
4
Paper E
http://snowball.tartarus.org/
provides a tabular view an ontology’s data; generated from .obo files
or “the” are part of the common English language, considered
not to carry any discriminatory information and consequently can
be removed before analysis to reduce noise and potential errors
resulting from their inclusion.
After character filtering and stop word removal from all the
concept labels and their synonyms, we used LingPipe (Carpenter
2007) to recognise entities and qualities from MP and HPO
concepts. The dictionaries for LingPipe were compiled by using the
labels and synonyms provided by the ontology files for FMA, MA
and PATO. For GO, we used an alternative approach described in
(Gaudan et al. 2008) but also implemented as LingPipe annotation
server. An individual tagging server was set up for each ontology.
As those servers work all in parallel, they might assign overlapping
annotations which could potentially result in too many annotations
assigned by the automated method. E.g. in the case of enlarged
dorsal root ganglion (MP:0008490), an MA annotation for dorsal
root ganglion (MA:0000232) and a PATO annotation for dorsal
(PATO:0001233) is assigned. To avoid this behaviour, we ran a filter
process after assigning LingPipe annotations and removed those
annotations being entirely encapsulated in another. Filtering GO
annotations is not yet possible due to the current implementation
of this server but will be supported in later versions.
A last step, was the replacement of LingPipe’s PATO annotations
and summarise those, where possible, with cross products 2.2.
Consequently, in the before mentioned example of decreased
palatal depth, the two LingPipe annotations would be replaced now
with one single annotation decreased depth. In addition, absent
(PATO:0000462) is replaced in all automated EQ statements with
lacks all parts of type (PATO:0002000) which is commonly used in
the manual assigned EQ descriptions.
2.4
Evaluation
To evaluate our results, we introduced a two-step evaluation
process. We first evaluated the obtained EQ statements to the
available, manually assigned EQ representations of structural and
process phenotypes. In a second step, we investigated a subset
of 50 EQ statements of each ontology where automated method
and manual curator do not assign any shared concepts. Common
patterns were identified causing a disagreement in the automatic EQ
representation and are discussed in sections 3.3 and 3.4, for MP and
HPO respectively.
3
RESULTS AND DISCUSSION
The transformation of the pre-composed representation into the
post-composed representation requires decomposing the concept
labels, identification of entities and qualities and the correct
generation of the post-composed representation. The entities have
to be matched to ontological concepts, which are provided from
other OBO Foundry ontologies. As test scenario, we have tested
the EQ-liser method on MP and HPO concept labels. Note, that all
transformations are only concerned with phenotype representations
that are concerned with phenotype structures and processes.
3.1 EQ-lising the Mammalian Phenotype Ontology
When decomposing structure and process MP phenotypes based
on their labels, 3,549 concept labels (out of 3,761) could be
transformed. When comparing those 3,549 concepts to the manually
assigned EQ statements, 23.7% received a correct post-composition
assigned with EQ-liser, only based on concept labels. Exploiting
also synonyms, we achieved an increase by 6.7 %. If we relax
the criterion of what constitutes a correct match to allowing the
automated method to assign more annotations than a manual curator
would do, we achieve a correct EQ statement for 52.2% of MP
concepts. This relaxed criterion is safe to apply as the automatically
generated EQ statements will undergo curator approval and
removing additionally assigned annotations is not a problem.
Obtaining the required annotations for EQ statements from concept
names in more than half of MP’s structure and process phenotypes
constitutes a promising start for a generalised decomposition
method. Completely erroneous post-composed representations have
only been generated for 5.6 % of the concepts. These numbers
demonstrate that the pre-composed concept labels of MP are already
well structured and that the automatic transformation – with a grain
of salt – does generate post-composed representations that correctly
reflect the semantics of the pre-composed representation.
3.2 EQ-lising the Human Phenotype Ontology
To determine whether the transformations performs also well on
an alternative pre-composed phenotype ontology, we also applied
EQ-liser to the HPO concept labels. Since HPO and MP are both
designed for two mammal species, we expect that both ontologies
share a subset of similar phenotype concepts and. Again, only
structural and process phenotypes have been considered, in postcomposition represented with FMA, GO and PATO.
We considered 3,268 pre-composed concepts, where 2,731 have
obtained an automatically assigned EQ statement. From these, 231
(8.5 %) are exact matches to the 31 annotations that have been
manually assigned. We can increase the yield to 249 cases (9.5
%) including synonyms. Again relaxing the matching criterion, we
obtain the correct annotation in 13.3%. However, in 25.8% of the
cases, automatically (EQ-liser) and manually (existing) assigned
EQ statements do not share any annotations. As the results show,
decomposing mouse phenotype concepts leads to better results than
decomposing human phenotype concepts based on lexical features
of their labels and synonyms.
3.3
Mismatches in EQ-lising MP
Manually investigating 50 MP concepts where automated and
manual EQ statements entirely disagree show common patterns
for all three constituents: structure, process and quality. A
number of mismatches were caused by assigning wrong PATO
annotations due to particular extension or replacement patterns
in the manually designed EQ statement which cannot yet be
picked up with the automated procedure. E.g. the quality of
increased mitochondrial proliferation (MP:0006038) is represented
in the manually assigned EQ statements with an increased
rate (PATO:0000912). However, the automated method assigns
instead increased (PATO:0000470) as quality for this particular
MP concept. Similarly, any concept name possessing the phrase
increased activity will be annotated in the manually assigned EQ
statements with increased rate (PATO:0000912) which is not yet
possible automatically. Furthermore, any concept containing a
description possessing the phrase increased ... number will be
represented in the manual EQ statements with has extra parts of type
(PATO:0002001). The same examples hold true replacing increased
with decreased in the concept labels. All those examples provided
here could be changed with conditional replacement rules for PATO
Paper E
3
Anika Oellrich1 , Christoph Grabmüller1 , and Dietrich Rebholz-Schuhmann1
concepts and consequently lead to a reduction of the contradictory
cases and more correctly identified EQ statements.
Additional mismatches were caused due to mostly not or wrongly
identifying the structural component of the phenotype. This happens
when the naming of the affected anatomical structure differs in
both MA and MP. This is mostly the case due to singular/plural
divergence. E.g. the MA annotation lumbar vertebra (MA:0000312)
is not automatically assigned to the MP concept increased lumbar
vertebrae number (MP:0004650) due to vertebra being used in
singular in MA but in MP in plural. Another source of mismatches
in the anatomical structure is due to shortened expressions, e.g. MP
uses coat while MA uses coat hair. The described mismatches could
be addressed by adding additional terms to the dictionary underlying
the LingPipe MA annotation server or applying a stemming to
both concept labels and synonyms, and the underlying annotation
dictionary.
The third type of investigated mismatches concerned mismatches
in the process constituent of the EQ statements. Mismatches caused
in the process constituent were caused, e.g. due to not covering
synonyms in the current implementation of the GO server. An
example falling into this category are concept names containing
salivation which are not recognised as saliva secretion. Other
mismatches resulted from different word forms used to express the
same concept, e.g. smooth muscle contractility and smooth muscle
contraction. A small fraction in the process constituent mismatches
are also caused by singular/plural-conflicts, e.g. MP uses plural cilia
while GO uses singular cilium. Both synonyms and singular/pluralconflicts could be addressed by adding those to the dictionary
underlying the current GO server or apply stemming before entity
recognition.
Among those 50 concepts, we could also identify a wrongly
assigned manual EQ statements in two cases (corresponds to 4% of
the investigated cases). Those cases were reported to the developers
of the EQ statements and got corrected. Those wrongly assigned
manual definitions were mainly due to old construction patterns and
newly added concepts in the constituent ontologies.
3.4
3.5
Mismatches in EQ-lising HPO
One source of causing mismatches on the quality defining part of a
phenotype, are differences in either using a noun or an adjective to
describe a certain aspect. E.g. all HPO concepts containing either
abnormality or abnormalities are not automatically annotated with
abnormal (PATO:0000460) due to the differences in word type used.
Furthermore, all concepts containing abnormality or abnormalities
in their names are manually assigned the PATO annotation quality
(PATO:0000001) which cannot be derived from the concept name as
such. Furthermore, some words inside the HPO concept names are
extended in the manually assigned EQ statements. E.g. Irregular
epiphysis of the middle phalanx of the 4th finger (HP:0009219) is
manually annotated with irregular density (PATO:0002141). The
identified mismatches can be corrected by adding special handling
rules to the concept decomposition for particular HPO concept
names.
Mismatches in the structural components of phenotypes were
partially due to differences in naming of the anatomical component
in HPO and FMA. E.g., while FMA chose to name fingers (index
finger or ring finger), HPO assigns numbers to fingers, such as 2nd
finger or fourth finger. Further to that, HPO is not consistent in
numbering enitities, e.g. it is thumb in all concepts concerning the
4
first finger and it is second toe versus 2nd finger. Furthermore, the
HPO is not well standardised in the selection of singularity versus
plurality with nouns (phalanges versus phalanx). Another group of
mismatches in anatomical structures is arising from the generation
of contractions when selecting concept labels suggested by FMA,
e.g. premolar instead of premolar tooth or metatarsal instead of
metatarsal bone. Most of the currently identified mismatches in the
structural constituents of HPO concepts can be addressed by adding
terms to the dictionary of the LingPipe FMA annotation server.
Process constituents were in the chosen subset only, and as
partially the case of MP mismatches, due to not supporting
synonyms in the current implementation of the GO server. E.g.
Abnormality of valine metabolism (HP:0010914) does not obtain the
GO annotation valine metabolic process (GO:0006573). Those type
of mismatches can be corrected in future versions of the EQ-liser
method by including synonyms in the current version of the GO
annotation server.
One group of mismatches which did not occur in the MP concept
name decomposition and only rarely occurred when applying the
method to the HPO concept names, is the co-existence of identical
concepts in different ontologies. Even though the OBO Foundry
aims at the orthogonality of the ontologies, this criteria is not
fulfilled in all cases. E.g. both, FMA and GO, host the concept
Chromosome (GO:0005694, FMA:67093) and the developer of
the EQ statements is free to chose either one, which leads to
inconsistencies in the resource. Another example of a double
existing concepts is Anosmia (HP:0000458, PATO:0000817). As
those concepts should be removed during the process of quality
assessment through the OBO Foundry, no action is required to
include this aspect into the decomposition method.
For three manually analysed concepts (6% of the investigated
cases), the manually assigned EQ statements were inconsistent.
These inconsistencies were reported to, confirmed and corrected by
the HPO EQ statement developers and are available as a new version
to the user community.
Towards a generalised phenotype decomposition
Even though the decomposition for HPO concepts does not work
yet as well as automatically generating EQ statements for MP
concepts, the changes required for either ontology are similar and
with their rectification most of the mismatches investigated would
be addressed. Covering the correct set of annotations for 52% of the
structural and process phenotypes contained in MP is a promising
start to develop an automated method capable of automatically
deriving EQ statements from pre-composed phenotype ontologies.
However, given the close development of the MP and HPO
EQ statements, the method has to be further tested on other
pre-composed phenotype ontologies, such as the WBPhenotype
for which also already a decomposed subset of concepts exist.
Once evaluated on another pre-composed phenotype ontology, we
hypothesise that the performances of our suggested method will
increase again and we will be able to successfully decompose
phenotype statements into their constituents for all species as long
as constituents ontologies are available from the OBO Foundry5 .
5
Paper E
http://www.obofoundry.org/
4
CONCLUSION
Applying the suggested method to the generation of EQ statements
from MP concept labels of structural and process phenotypes yields
a strictly correct EQ statement in 30% of the cases. Assuming
that a curator will approve the EQ statements before they are used
community wide, additionally assigned annotations can be easily
removed from a correct EQ statement. Using this assumption to
relax the correctness criterion, we can identify the correct subset
of EQ statement constituents in over 52% of MP’s structural and
process phenotypes. To achieve a similar rate for the decomposition
of HPO based on their labels, the identified problems have to be
addressed. The correction of the identified problems will enable a
better identification of the EQ statements from concept labels. Once
the flaws have been corrected the method can be implemented in
a generalised manner to derive EQ statements from a variety of
pre-composed phenotype statements which will ease the integration
of species-specific pre-composed phenotype information into a
species-independent framework.
Independently from deriving decomposed phenotype expression,
the application of the method also allows for the identification of
inconsistencies within the labels of the concepts. While MA and MP
follow a rigorous naming scheme and hence facilitate integration,
HPO and FMA diverge from each other, creating obstacles for
a possible integration. Furthermore, HPO does not consistently
name its own concepts which also hinders an integration based
on lexical attributes, confuses users of the ontology and prevents
easy integration of human data into other frameworks based on a
decomposed presentation.
Despite allowing for the decomposition of concepts which have
no EQ statement yet, the method has also proven useful to identify
flaws in the manually assigned EQ statements. By applying the
prototype of the method to concept labels of MP and HPO,
inconsistencies were identified and corrected accordingly. This
procedure improved the quality of the existing EQ statements and
consequently of all methods applying the EQ statements such as
PhenomeNET (Hoehndorf et al. 2011) and MouseFinder (Chen et al.
2012).
5
ACKNOWLEDGEMENTS
The authors thank Georgios V. Gkoutos for his close collaboration in
analysing potential errors of the EQ-liser method. He also provided
valuable explanations for patterns contained in the EQ statements
of the Mammalian and Human Phenotype Ontology. In addition,
the authors are also grateful to Irina Colgiu for her fast and reliable
implementation of the GO server, used in this study for annotation
purposes. Furthermore, the authors would like to thank Maria
Liakata for valuable input on the draft of this manuscript.
REFERENCES
Alison Abbott. Mouse project to find each gene’s role. Nature, 465
(7297):410, May 2010.
Joanna Amberger, Carol Bocchini, and Ada Hamosh. A new face
and new challenges for Online Mendelian Inheritance in Man
(OMIM R ). Human mutation, 32(5):564–7, May 2011.
M Ashburner et al. Gene ontology: tool for the unification of
biology. The Gene Ontology Consortium. Nat Genet, 25(1):25–9,
May 2000.
Judith A Blake et al. The Mouse Genome Database (MGD): premier
model organism resource for mammalian genomics and genetics.
Nucleic Acids Res, 39(Database issue):D842–8, Jan 2011.
Molly A Bogue and Stephen C Grubb. The Mouse Phenome Project.
Genetica, 122(1):71–4, Sep 2004.
Bob Carpenter. LingPipe for 99.99% recall of gene mentions.
Proceedings of the 2nd BioCreative workshop, 2007.
Chao-Kung Chen et al. MouseFinder: candidate disease genes from
mouse phenotype data. Human mutation, Feb 2012.
Rachel Drysdale and FlyBase Consortium. FlyBase : a database
for the Drosophila research community. Methods Mol Biol, 420:
45–59, Jan 2008.
S Gaudan, A Jimeno Yepes, V Lee, and D Rebholz-Schuhmann.
Combining Evidence, Specificity, and Proximity towards the
Normalization of Gene Ontology Terms in Text. EURASIP
journal on bioinformatics & systems biology, page 342746, Jan
2008.
Terry F Hayamizu, Mary Mangan, John P Corradi, James A Kadin,
and Martin Ringwald. The Adult Mouse Anatomical Dictionary:
a tool for annotating and integrating data. Genome Biol, 6(3):
R29, Jan 2005.
Robert Hoehndorf, Paul N Schofield, and Georgios V Gkoutos.
PhenomeNET: a whole-phenome approach to disease gene
discovery. Nucleic Acids Res, 39(18):e119, Oct 2011.
Sabina Leonelli and Rachel A Ankeny. Re-thinking organisms: The
impact of databases on model organism biology. Stud Hist Philos
Biol Biomed Sci, 43(1):29–36, Mar 2012.
Paula M Mabee et al. Phenotype ontologies: the bridge between
genomics and evolution. Trends Ecol Evol (Amst), 22(7):345–50,
Jul 2007.
Christopher J Mungall et al. Integrating phenotype ontologies across
multiple species. Genome Biol, 11(1):R2, Jan 2010.
M F Porter. An algorithm for suffix stripping. Program, 1980.
Peter N Robinson et al. The Human Phenotype Ontology: a tool for
annotating and analyzing human hereditary disease. Am J Hum
Genet, 83(5):610–5, Nov 2008.
Cornelius Rosse and José L V Mejino. A reference ontology for
biomedical informatics: the Foundational Model of Anatomy.
Journal of biomedical informatics, 36(6):478–500, Dec 2003.
Gary Schindelman, Jolene Fernandes, Carol Bastiani, Karen Yook,
and Paul Sternberg. Worm Phenotype Ontology: integrating
phenotype data within and beyond the C. elegans community.
BMC Bioinformatics, 12(1):32, Jan 2011.
Daniel Schober et al. Survey-based naming conventions for use in
OBO Foundry ontology development. BMC Bioinformatics, 10:
125, Jan 2009.
Barry Smith et al. The OBO Foundry: coordinated evolution
of ontologies to support biomedical data integration. Nat
Biotechnol, 25(11):1251–5, Nov 2007.
Cynthia L Smith, Carroll-Ann W Goldsmith, and Janan T Eppig.
The Mammalian Phenotype Ontology as a tool for annotating,
analyzing and comparing phenotypic information. Genome Biol,
6(1):R7, Jan 2005.
Nicole L Washington et al. Linking human diseases to animal
models using ontology-based phenotype annotation. PLoS Biol,
7(11):e1000247, Nov 2009.
Paper E
5
The mouse pathology ontology, MPATH; structure and applications
Paul N. Schofield1,2, John P. Sundberg2, Beth A. Sundberg2, Colin McKerlie3, George
V. Gkoutos4
1
Dept of Physiology, Development and Neuroscience, University of Cambridge, Downing Street, Cambridge,
CB2 3EG, UK
2
The Jackson Laboratory, 600, Main Street, Bar Harbor, ME 04609-1500, USA
3
Genetics and Genomic Biology Program, Hospital for Sick Children, Toronto, Canada
Dept of Computer Science, University of Aberystwyth, Old College, King Street, SY23 2AX, Wales
4
ABSTRACT
The advent of large scale pathology-based phenotyping of
mice requires a relatively simple nomenclature and coding
system that can be integrated into data collection platforms
(computerized medical record keeping systems) to enable
the pathologist to rapidly screen and accurately record observations. The data generated needs to be easily and consistently retrieved in a form that can be analyzed computationally. MPATH provides such a platform which, when integrated into a medical records database enables diagnoses to
be automatically entered, spelled correctly and consistently
(critical for data retrieval), and coded. Built on an ontology
platform, diagnoses can be investigated in populations (epidemiology or genome wide association mapping) on a very
specific diagnosis or a class of diseases. This enables investigators to interrogate large datasets at a variety of depths,
use semantic analysis to identify the relations between diseases in different species and integrate pathology data with
other data types, such as pharmacogenomics.
1
INTRODUCTION
Since the late eighteenth century when achromatic lenses
and reliable histological stains began to be available, investigators of anatomic pathology, and particularly in the mid nineteenth century the innovators of cellular pathology such
as Rudolf Virchow, developed and applied terminologies to
describe their observations. These depended on the “school”
to which the pathologists belonged, but more importantly on
the etiologic or mechanistic paradigm in which they were
working. Whilst one of the great achievements of the nineteenth century was the recognition of the universality of
pathological processes and entities; their occurrence in
multiple species as recognisable manifestations of the same
underlying processes (Sundberg & Schofield, 2009), it was
!
*
!
a century before a broadly accepted and rationally structured
pathology was developed. The development of pathology
terminologies has to an extent occurred independently of
disease terminologies and nosologies, partly as a result of
the much longer history of classifying diseases, and partly
due to the inherited preconceptions of the nature of disease
in clinical medicine.
The distinction between pathological descriptions of disease, clinical descriptions of disease, disorders and predispositions is still not satisfactorily resolved, although in recent years there have been attempts to rationalise the definitions of these concepts (Scheuermann et al., 2009) and their
relation to each other as part of a broadly applicable model
of disease other than a “bag” of manifestations or phenotypes which are found in that class of individuals and form
the basis of a diagnosis. Issues about severity, time course,
organ involvement etc. are beginning to be addressed, but it
is remarkable that even treating diseases as a “bag “ of
phenotypes has been shown to be a powerful approach in
establishing the relationships between diseases and the presence of related diseases in different organisms (Chen et al.,
2012; Hoehndorf et al., 2011; Oellrich et al., 2012; Washington et al., 2009). What has recently been identified as
important, nevertheless, is that the tissue-specific resolution
of the recording of lesions, and the ability to record the pattern of disease within an individual has proved vital for
GWAS mapping of predisposing genetic variants in inbred
strains of mice, as each class of lesion can be analysed in
isolation (Berndt et al., 2011; Li et al., 2012).
The discipline of pathology may be broken down into
clinical and anatomic pathology, the former is concerned
with clinical chemistry, hematology, clinical microbiology
and emerging subspecialities such as molecular diagnostics
and proteomics, the latter with the histological, histochemical or immunohistochemical observations of the alterations
in tissue composition or architecture. Both branches of the
medical specialty, which are increasingly merging, may be
To whom correspondence should be addressed.
Paper F
1
P.N. Schofield et al.
viewed as aspects of phenotyping, and both provide subtypes of the clinical signs associated with ongoing disease
processes, the results of developmental abnormalities, or the
historical presence of disease.
2
2.1
RESULTS AND DISCUSSION
Anatomic pathology nomenclature and its
applications
The universality of the repertoire of responses to underlying
genetic or extrinsic insults means that the gross and histopathologically-defined phenotypes are some of the most
useful phenotypes for relating diseases between different
species and constitute the most species-agnostic phenotype
descriptors. This makes a pathologic term-based ontology a
crucial tool in experimental and clinical phenotype data capture (Schofield et al., 2011).
The development of systematic human pathologic nomenclatures has been driven by the efforts of the American College of Pathologists, initially with the development of the
pathology specific nomenclature (SNOP) over 40 years ago
(Cornet & de Keizer, 2008) to the current SNOMED –CT
with cross references to UMLS, the NCI thesaurus and other
terminologies. The ICD (World Health Organisation, 2008),
now in its 11th revision, originally derived for epidemiological coding, and the associated ICD-O v-3 for cancer also
contains descriptions of many pathological lesions, and the
latter is particularly useful for neoplasia.
The other driver for pathologic terminology standardisation has been coding of lesions from toxicopathology and
again the American Society of Toxicopathology working
with Registry of Industrial Toxicology Animal-data (RITA)
database group in Europe has produced several internationally accepted nomenclature systems, particularly focusing
on proliferative lesions. Recently the STP has undertaken a
major harmonization exercise for rodent pathology – the
INHAND (International Harmonization of Nomenclature
and Diagnostic Criteria for Lesions in Rats and Mice ) initiative (Mann et al., 2012). So far this group has reported on
the hepaticobiliary, respiratory, nervous and urinary systems
(Frazier et al., 2011; Kaufmann et al., 2011; Renne et al.,
2009; Thoolen et al., 2010). For some time the National
Cancer Institute’s Mouse Models of Human Cancer consortium (MMHCC) has been examining the classification of
tumours in genetically engineered mice and itself has produced a consensus base terminology for neoplasias of the
major organ systems presented in a series of papers over the
last decade (Marks, 2009).
Despite the huge value of these resources none is currently
constructed as an ontology with meaningful relations to
support inference and automated reasoning, and to that end
we developed MPATH, as an ontology to describe lesions
that arise in laboratory mice.
2!
2.2
A post-composition strategy for pathology
coding
Traditionally pathologists have relied on a narrative form of
recording their definitive diagnoses, making use of morphologic, etiologic, and disease-based terms that collectively provide a diagnosis useful for clinical patient management. This is particularly important for non–neoplastic
lesions where it can be complex to capture important subtleties of distribution, severity, microscopic sub-type and anatomical location for example. Whilst this is the gold standard, it is not possible to compute on data recorded in this
way and it is very difficult to tabulate and quantitatively
analyse the collected information. There are strong arguments, mainly from experience in toxicologic pathology,
that a descriptive (anatomic) rather than diagnostic coding is
the most objective and useful way to code pathology-based
observations. This is particularly relevant to examination of
mutant mice where traditional etiologic or summative diagnostic terms are simply not available because of the novelty
of the lesion or it’s presentation – this is particularly the
case where mice are manipulated to model human conditions that have not been previously seen, for example lung
or mammary tumours (Berndt et al., 2011; Derksen et al.,
2006; Meuwissen & Berns, 2005) which have not previously been reported to occur spontaneously in mice. In
many cases, a disease diagnosis implies a particular pathogenesis or etiology based on the spontaneous disease, which
is not appropriate for the disease caused by genetic and
sometimes both genetic and external challenge combined.
This latter issue is of particular concern to practicing pathologists and in the development of MPATH we have been
urged to include some diagnostic terms as well as descriptive anatomic ones.
The MPATH ontology was constructed ab initio by a
group of clinical and veterinary pathologists in 2000 and has
since been revised and augmented by an evolving group of
US and European pathologists on a regular basis. It is clear
from more than a decade of experience that expert input and
manual curation are essential to generate an accurate and
functional resource. One strategy for building the ontology
has been to integrate it into large-scale phenotyping and
diagnostic programs so that the pathologists use it on a daily
basis and have fields to add missing terms or synonyms that
they are more familiar with thereby constantly increasing its
coverage and utilitarian value.
MPATH’s top level distinction is between pathophysiology (pathological processes) and anatomic pathology
(pathological entities). One issue which we met with frequently was the normal practice of referring to observation
of a physical lesion by using the process term; for example
“necrosis” or “sclerosis”. Thus the noun describing the realworld entity observed is homonymous with the inferred
process. This problem has been addressed through the textual and logical definitions of terms but is a recurrent source
of confusion in formal treatments of pathology nomencla-
Paper F
The mouse pathology ontology, MPATH; structure and applications
ture. Pathologists using MPATH almost exclusively use the
anatomic pathology segment of the ontology with the exception of describing inflammation or other general processes
where the process is described using qualifiers such as
“acute” which are logically process-specific, consistent with
the use of PATO qualifiers (see below). The upper levels of
MPATH’s anatomic pathology branch include six broad
domains familiar to all traditions of pathology training and
comprehensively covering all known lesions; cell and tissue
damage, circulatory disorder, developmental and structural
abnormality, growth and differentiation defect, healing and
repair structure, and neoplasm, and are as orthogonal as is
feasible given the complexities of pathobiology. The upper
levels of MPATH’s pathophysiology branch denote pathological processes that underlie lesions and include six broad
domains; cell and tissue damage process, defective growth
and differentiation process, developmental process abnormalities, healing and repair process, immunopathology and
neoplasia. All pathological processes and entities can be
placed within these upper level domains, which will be familiar to all pathologists and are common to all amniotes.
MPATH contains 880 core pathology terms in an almost
exclusively is_a hierarchy nine layers deep. Currently, almost 90% of the terms have textual definitions. Each class
is in the mouse pathology namespace and is uniquely identified
by
a
URI
of
the
form:
http://purl.obolibrary.org/OBO/MPATH_n. The main ontology is available in both the OBO Flatfile Format and the
Web Ontology Language (OWL). MPATH is housed in a
subversion repository and is made available via OBO registry,
Bioportal
and
on
the
project’s
website
http://mpath.googlecode.com/.
MPATH contains relationships and other logical axioms to
other ontologies such the Gene Ontology (GO) (Ashburner
et al., 2000), Cell Type ontology (CL) (Bard et al., 2005)
and the Phenotype And Trait Ontology (PATO) (Gkoutos
et al., 2004). For example, the MPATH term transitional
cell metaplasia (MPATH:172) represents a metaplastic response of the transitional epithelium, for example in the
bladder to give squamous metaplasia and glandular metaplasia. To allow computational access to these relations, we use
the derives-from relation and relate metaplasia
(MPATH:549) (an MPATH term that denotes an abnormal
transformation of a differentiated adult cell or tissue of one
kind into a differentiated tissue of another kind) with the CL
term transitional epithelial cell (CL:0000244).
Many tissue responses are common to multiple anatomical sites and as far as possible the verbosity of specifying a
particular response in multiple tissues has been avoided, the
additional topographical or anatomical information for description coming from an anatomy ontology, generally the
MA (Hayamizu et al., 2005) or EMAP ontologies
(Richardson et al., 2010) for the mouse, however, there is
often an intrinsic anatomical element embedded in the term
or traditional pathology includes information about the cell
type or tissue of origin. This is most frequent with the neoplasias and we felt that such terms were best included in
their familiar form. Most descriptions are then postcomposed from a combination of an MPATH term and an
anatomical (MA) or cell type (CL) (Bard et al., 2005) component.
In addition to the core terms it is important to describe organ specific topography, distribution, microscopic character,
duration/chronicity, and severity. These qualifiers or modifiers are generally applicable across a wide range of organs
and lesions and so need to be coded separately to the core
terms to allow post-composition as required. The pattern we
have adopted is very close to that recommended by the
INHAND proposals and also includes “compound” terms
which lie beneath a definitive diagnosis or disease level of
description, but bundle defined sets of descriptive terms, for
example “nephropathy”, “alopecia”, glomerulonephritis”
which are in common use and well understood. These qualifiers have been incorporated into PATO, and some examples are given in table 1. "#$! %&'(&$)*! +,'! -,./,%01)! /(2
&#,3,)*!4$%-'0/&0,1%!01!567"8!0%!%9..('0%$4!01!:0)9'$!
;(!01!(1!$<(./3$!0339%&'(&$4!01!:0)9'$!;=>
2.3
Implementation of MPATH coding strategy
The strategy adopted was originally designed to describe
histopathology images for the Pathbase mouse pathology
database (Schofield et al., 2010), but lends itself readily to a
wide range of coding applications. The MPATH strategy
has been adopted by two major high-throughput studies. A
combination of MPATH and PATO is being used for the
capture of pathology data from the genome-wide mutant
mouse phenotyping project, KOMP2 run as part of the International Mouse Phenotyping Consortium (Brown & Moore,
2012), where the MPATH approach is being used in the
primary phenotyping pipeline by the Toronto Centre for
Phenogenomics and other centres carrying out histopathology. MPATH has also been adopted for the MoDIS database (Sundberg et al., 2008) to capture and analyse pathology data from a massive aging study which has systematically phenotyped the most important 31 inbred mouse
Figure 1 Post-composition strategy: elements of the compound
description are specified on the left hand side of the figure and
specific examples given for three observations which taken together are indicative of foreign body pneumonia
Paper F
3!
P.N. Schofield et al.
strains in current use. Complete necropsies of mice were
carried out at 12 and 20 months of age (cross-sectional
study) and moribund mice in the life span (longitudinal
study). Nearly 2000 mice were necropsied, generating more
than 50,000 slides (Sundberg et al., 2011). Lesion incidence
and severity data for all organs is now being applied in
highly successful GWAS studies of age-associated disease
(Sundberg et al., 2011).
MPATH has proved to be additionally useful in dealing
with the recoding of legacy data from non-standard nomenclatures, permitting integration of otherwise siloed data.
Examples are the ERA database (Tapio et al., 2008) and
with the Northwestern University Janus radiobiology database (http://janus.northwestern.edu/janus2/) who have coded
50,000 individual mouse records to MPATH to link the two
datasets.
2.4
tion and lexical matching to core ontologies. This approach
can be useful for suggesting definitions where the class label
is a composite of for example, anatomy and process
(MA+GO). Automated decomposition of unilexical terms
such as are found in the neoplasias is much more difficult
though approaches with text mining definitions from other
ontologies such as NCIt (Sioutos et al., 2007) for lexically
matching labels may be useful to expert curators in establishing more simple definitions for these classes.
Table 1. Examples of qualifiers now incorporated into PATO
Paper F
PATO
Class name
Definition
Severity
0000461
normal
no lesions
mild
moderate
marked
severe
per-acute
Lesion dependent;
often size, number and characteristics.
Duration
0000394
0000395
0000465
0000396
0002387
0000389
acute
0002091
subacute
0001863
chronic
0002387
chronic-active
0000627
focal
0002388
focally extensive
0001791
0002389
multifocal
multifocal to
coalescing
0000330
random
0001566
diffuse
0000635
generalized
0000634
unilateral
0000618
bilateral
0002389
segmental
MPATH as a core ontology for PATO-based
logical definitions
The PATO framework was built with the intention of providing an integration platform for phenotype data between
species and between data types (Gkoutos et al., 2004). According to the PATO framework, phenotype data can be
described by utilising species-specific ontologies (such as
the various anatomy ontologies) or species-agnostic ontologies such as GO with the various qualities provided by the
PATO ontology in order to describe affected entities in a
phenotype manifestation. PATO can be used for annotation
either directly in a so-called post-composed (postcoordinated) manner or for providing logical definitions
(equivalence axioms) to ontologies containing a set of precomposed (pre-coordinated) phenotype terms (Gkoutos et
al., 2004; Gkoutos et al., 2009a; Gkoutos et al., 2009b;
Mungall et al., 2009). For further discussion see (Schofield
et al., 2012)
Rather than using a pre-composed phenotype ontology
such as MPO (Smith & Eppig, 2009) or HPO (Robinson et
al., 2008), phenotypes may be described using the Entity–
Quality (EQ) formalism. In the EQ method, a phenotype is
characterized by an affected Entity and a Quality (from
PATO) that specifies how the entity is affected. The affected entity can either be a biological function or process
such as specified in GO, or an anatomical entity. The phylogenetic conservation, at least within the amniotes, of most
histopathologic lesions or processes makes MPATH an important core ontology in writing logical definitions and we
have used it extensively in defining classes in the major precomposed phenotype ontologies and MPATH is an important component ontology of our recently developed semantic
approaches to comparative phenomics – PhenomeNET and
Mousefinder (Chen et al., 2012; Hoehndorf et al., 2011).
Composition of logical definitions is a time-consuming
task for which there are currently several approaches to
automation using class label segmentation, entity recogni-
4!
Qualifier
Distribution
extremely acute
and aggressive
beginning abruptly
with marked intensity
between acute and
chronic
slow progress and
long continuance
coexistence of
chronic process
and superimposed acute
process
single well delineated lesion
single lesion with
expansion into
surrounding tissue
multiple lesions
multiple lesions
some interconnecting with
each other
no appreciable
pattern
not circumscribed
or limited
affecting all regions without
specificity of
distribution
confined to one
side only
involving both
sides
relating to a segment
The mouse pathology ontology, MPATH; structure and applications
2.5
REFERENCES
Future directions
Whilst MPATH was originally designed to support rodent, and particularly mouse pathology the extensive overlap with human clinical pathology means that most of the
terms may be used in a human context and linked to the
foundational model of anatomy (FMA) (Rosse & Mejino,
2003) as the anatomy ontology. Extending MPATH to become a Mammalian phenotype ontology encompassing human pathology is a major undertaking, but we have established that the current structure and upper level classes
would readily support the inclusion of human terminology.
Initially we will import terms for neoplasias from the
CINEAS codes (Central Information System for Hereditary
Diseases and Synonyms; http://www.cineas.org/; Prof Rolf
Sijmons, pers, comm). SNOMED-CT, UMLS and ICD-O
v3 will be mined for terms not currently in MPATH which
relate to anatomic pathology. Terms already covered by
existing ontologies such as Disease Ontology (DO) (Schriml
et al., 2012) may be referenced using MIREOT (Courtot et
al., 2011). DO classifies diseases largely by anatomical site
and not by disease process or class, and overlaps only
slightly with MPATH as it is concerned with summative
diagnostic entities for the main part. For example there is no
“inflammation” superclass in DO for the tissue specific inflammatory conditions described. Use of MPATH to construct logical definitions for DO classes would potentially
add a further dimension to the richness and applicability of
DO.
The power of the description of pathological lesions to
discriminate between diseases and therefore between models of human disease is substantial. We recently estimated
the information content (IC) of pre-composed MP ontology
terms used to code phenotypes in the EUMODIC mouse
phenotyping pipeline (Morgan et al., 2012) which included
or excluded anatomic pathology descriptions, using their
logical definitions. Pathology-related phenotypes were
shown to have a significantly greater discriminatory power
than other in vivo assays, strongly supporting the use of
these assays in the development of mouse models of human
diseases (Schofield et al. 2011).
Further development and application of MPATH will inevitably depend on community engagement and we encourage anyone with an interest to provide feedback.
ACKNOWLEDGEMENTS
The authors would like to thank those who have contributed
to the development and application of MPATH over the
years. This work was funded by the European Commission.
Contract QLRI-1999-00320, the Ellison Medical Foundation, National Institutes of Health (AG25707, for the Shock
Aging Center, CA89713, and AR056635 to JPS, and
HG004838-04 to PNS.
Ashburner, M., Ball, C.A., Blake, J.A., Botstein, D., Butler, H.,
Cherry, J.M., Davis, A.P., Dolinski, K., Dwight, S.S.,
Eppig, J.T., Harris, M.A., Hill, D.P., Issel-Tarver, L.,
Kasarskis, A., Lewis, S., Matese, J.C., Richardson, J.E.,
Ringwald, M., Rubin, G.M. & Sherlock, G. (2000). Gene
ontology: tool for the unification of biology. The Gene
Ontology Consortium. Nat Genet, 25, 25-9.
Bard, J., Rhee, S.Y. & Ashburner, M. (2005). An ontology for cell
types. Genome Biol, 6, R21.
Berndt, A., Cario, C.L., Silva, K.A., Kennedy, V.E., Harrison,
D.E., Paigen, B. & Sundberg, J.P. (2011). Identification
of fat4 and tsc22d1 as novel candidate genes for spontaneous pulmonary adenomas. Cancer Res, 71, 5779-91.
Brown, S.D. & Moore, M.W. (2012). Towards an encyclopaedia of
mammalian gene function: the International Mouse Phenotyping Consortium. Dis Model Mech, 5, 289-92.
Chen, C.K., Mungall, C.J., Gkoutos, G.V., Doelken, S.C., Kohler,
S., Ruef, B.J., Smith, C., Westerfield, M., Robinson,
P.N., Lewis, S.E., Schofield, P.N. & Smedley, D. (2012).
MouseFinder: Candidate disease genes from mouse phenotype data. Hum Mutat, 33, 858-66.
Cornet, R. & de Keizer, N. (2008). Forty years of SNOMED: a
literature review. BMC Med Inform Decis Mak, 8 Suppl
1, S2.
Courtot, M., Frank, G., Allyson, L.L., James, M., Daniel, S., Ryan,
R.B. & Alan, R. (2011). MIREOT: The minimum information to reference an external ontology term. Appl. Ontol., 6, 23-33.
Derksen, P.W., Liu, X., Saridin, F., van der Gulden, H., Zevenhoven, J., Evers, B., van Beijnum, J.R., Griffioen, A.W.,
Vink, J., Krimpenfort, P., Peterse, J.L., Cardiff, R.D.,
Berns, A. & Jonkers, J. (2006). Somatic inactivation of
E-cadherin and p53 in mice leads to metastatic lobular
mammary carcinoma through induction of anoikis resistance and angiogenesis. Cancer Cell, 10, 437-49.
Frazier, K.S., Seely, J.C., Hard, G.C., Betton, G., Burnett, R., Nakatsuji, S., Nishikawa, A., Durchfeld-Meyer, B. & Bube,
A. (2011). Proliferative and nonproliferative lesions of
the rat and mouse urinary system. Toxicol Pathol, 40,
14S-86S.
Gkoutos, G.V., Green, E.C.J., Mallon, A.-M., Hancock, J.M. &
Davidson, D. (2004). Building mouse phenotype ontologies. Pac. Symp. Biocomputing, 9, 178-189.
Gkoutos, G.V., Mungall, C., Dolken, S., Ashburner, M., Lewis, S.,
Hancock, J., Schofield, P., Kohler, S. & Robinson, P.N.
(2009a). Entity/quality-based logical definitions for the
human skeletal phenome using PATO. Conf Proc IEEE
Eng Med Biol Soc, 1, 7069-72.
Gkoutos, G.V., Mungall, C., Dolken, S., Ashburner, M., Lewis, S.,
Hancock, J., Schofield, P., Kohler, S. & Robinson, P.N.
(2009b). Entity/quality-based logical definitions for the
Paper F
5!
P.N. Schofield et al.
human skeletal phenome using PATO. Conf Proc IEEE
Eng Med Biol Soc, 2009, 7069-72.
Hayamizu, T.F., Mangan, M., Corradi, J.P., Kadin, J.A. & Ringwald, M. (2005). The adult mouse anatomical dictionary:
a tool for annotating and integrating data. Genome Biol,
6, R29.
Hoehndorf, R., Schofield, P.N. & Gkoutos, G.V. (2011). PhenomeNET: a whole-phenome approach to disease gene
discovery. Nucleic Acids Res, 39, e119.
Kaufmann, W., Bolon, B., Bradley, A., Butt, M., Czasch, S., Garman, R.H., George, C., Groters, S., Krinke, G., Little, P.,
McKay, J., Narama, I., Rao, D., Shibutani, M. & Sills, R.
(2011). Proliferative and nonproliferative lesions of the
rat and mouse central and peripheral nervous systems.
Toxicol Pathol, 40, 87S-157S.
Li, Q., Berndt, A., Guo, H., Sundberg, J.P. & Uitto, J. (2012). A
Novel Animal Model for Pseudoxanthoma Elasticum:
The KK/HlJ Mouse. Am J Pathol.
Mann, P.C., Vahle, J., Keenan, C.M., Baker, J.F., Bradley, A.E.,
Goodman, D.G., Harada, T., Herbert, R., Kaufmann, W.,
Kellner, R., Nolte, T., Rittinghausen, S. & Tanaka, T.
(2012). International harmonization of toxicologic pathology nomenclature: an overview and review of basic
principles. Toxicol Pathol, 40, 7S-13S.
Marks, C. (2009). Mouse Models of Human Cancers Consortium
(MMHCC) from the NCI. Dis Model Mech, 2, 111.
Meuwissen, R. & Berns, A. (2005). Mouse models for human lung
cancer. Genes Dev, 19, 643-64.
Morgan, H., Beck, T., Blake, A., Gates, H., Adams, N., Debouzy,
G., Leblanc, S., Lengger, C., Maier, H., Melvin, D.,
Meziane, H., Richardson, D., Wells, S., White, J., Wood,
J., de Angelis, M.H., Brown, S.D., Hancock, J.M. &
Mallon, A.M. (2012). EuroPhenome: a repository for
high-throughput mouse phenotyping data. Nucleic Acids
Res, 38, D577-85.
Mungall, C.J., Gkoutos, G.V., Smith, C.L., Haendel, M.A., Lewis,
S.E. & Ashburner, M. (2009). Integrating phenotype ontologies across multiple species. Genome Biol, 11, R2.
Oellrich, A., Hoehndorf, R., Gkoutos, G.V. & RebholzSchuhmann, D. (2012). Improving disease gene prioritization by comparing the semantic similarity of phenotypes in mice with those of human diseases. PLoS One,
7, e38937.
Renne, R., Brix, A., Harkema, J., Herbert, R., Kittel, B., Lewis, D.,
March, T., Nagano, K., Pino, M., Rittinghausen, S.,
Rosenbruch, M., Tellier, P. & Wohrmann, T. (2009).
Proliferative and nonproliferative lesions of the rat and
mouse respiratory tract. Toxicol Pathol, 37, 5S-73S.
Richardson, L., Venkataraman, S., Stevenson, P., Yang, Y., Burton, N., Rao, J., Fisher, M., Baldock, R.A., Davidson,
D.R. & Christiansen, J.H. (2010). EMAGE mouse embryo spatial gene expression database: 2010 update. Nucleic Acids Res, 38, D703-9.
6!
Robinson, P.N., Kohler, S., Bauer, S., Seelow, D., Horn, D. &
Mundlos, S. (2008). The Human Phenotype Ontology: a
tool for annotating and analyzing human hereditary disease. Am J Hum Genet, 83, 610-5.
Rosse, C. & Mejino, J.L., Jr. (2003). A reference ontology for
biomedical informatics: the Foundational Model of
Anatomy. J Biomed Inform, 36, 478-500.
Scheuermann, R.H., Ceusters, W. & Smith, B. (2009). Toward an
ontological treatment of disease and diagnosis. Summit
on Translat Bioinforma, 2009, 116-20.
Schofield, P.N., Gruenberger, M. & Sundberg, J.P. (2010). Pathbase and the MPATH Ontology: Community Resources
for Mouse Histopathology. Vet Pathol, 47, 1016-20.
Schofield, P.N., Sundberg, J.P., Hoehndorf, R. & Gkoutos, G.V.
(2012). New approaches to the representation and analysis of phenotype knowledge in human diseases and their
animal models. Brief Funct Genomics, 10, 258-65.
Schofield, P.N., Vogel, P., Gkoutos, G.V. & Sundberg, J.P. (2011).
Exploring the elephant: histopathology in highthroughput phenotyping of mutant mice. Dis Model
Mech, 5, 19-25.
Schriml, L.M., Arze, C., Nadendla, S., Chang, Y.W., Mazaitis, M.,
Felix, V., Feng, G. & Kibbe, W.A. (2012). Disease Ontology: a backbone for disease semantic integration. Nucleic Acids Res, 40, D940-6.
Sioutos N, Coronado Sd, Haber MW, Hartel FW, Shaiu W-L,
Wright LW (2007). NCI Thesaurus: A semantic model
integrating cancer-related clinical and molecular information. J Biomed Inform, 40,30-43.
Smith, C.L. & Eppig, J.T. (2009). The mammalian phenotype ontology: enabling robust annotation and comparative
analysis. Wiley Interdiscip Rev Syst Biol Med, 1, 390-9.
Sundberg, J., Berndt, A., Sundberg, B., Silva, K., A. , Kennedy, V.,
Bronson, R., Yuan, R., Paigen, B., Harrison, D. &
Schofield, P., N. (2011). The mouse as a model for understanding chronic diseases of aging: the histopathologic basis of aging in inbred mice. Pathobiology of
Aging & Age-related Diseases, 1, 71719.
Sundberg, J.P. & Schofield, P.N. (2009). One medicine, one pathology, and the one health concept. J Am Vet Med Assoc, 234, 1530-1.
Sundberg, J.P., Sundberg, B.A. & Schofield, P. (2008). Integrating
mouse anatomy and pathology ontologies into a phenotyping database: tools for data capture and training.
Mamm Genome, 19, 413-9.
Tapio, S., Schofield, P.N., Adelmann, C., Atkinson, M.J., Bard,
J.L., Bijwaard, H., Birschwilks, M., Dubus, P., Fiette, L.,
Gerber, G., Gruenberger, M., Quintanilla-Martinez, L.,
Rozell, B., Saigusa, S., Warren, M., Watson, C.R. &
Grosche, B. (2008). Progress in updating the European
Radiobiology Archives. Int J Radiat Biol, 84, 930-6.
Thoolen, B., Maronpot, R.R., Harada, T., Nyska, A., Rousseaux,
C., Nolte, T., Malarkey, D.E., Kaufmann, W., Kuttler,
K., Deschl, U., Nakae, D., Gregson, R., Vinlove, M.P.,
Paper F
The mouse pathology ontology, MPATH; structure and applications
Brix, A.E., Singh, B., Belpoggi, F. & Ward, J.M. (2010).
Proliferative and nonproliferative lesions of the rat and
mouse hepatobiliary system. Toxicol Pathol, 38, 5S-81S.
Washington, N.L., Haendel, M.A., Mungall, C.J., Ashburner, M.,
Westerfield, M. & Lewis, S.E. (2009). Linking human
diseases to animal models using ontology-based phenotype annotation. PLoS Biol, 7, e1000247.
World Health Organisation. (2008). International Statistical Classification of Diseases and Health Related Problems
(The) ICD-10. WHO: Geneva.
Paper F
7!
Functions, roles and dispositions revisited.
A new classification of realizables
Johannes Röhl, Ludger Jansen
Institute of Philosophy, University of Rostock, Germany
ABSTRACT
The concept of a function is central both to biology and to
technology. But there is an intricate debate how functions
as well as related entities like dispositions and roles are to
be represented in top level ontologies and how they are to
be related. We review important philosophical accounts
and ontological models for functions and roles and discuss
three models for the relation of functions and dispositions.
We conclude that mainly because of the need to account
for malfunctioning, functions should not be treated as a
subtype of dispositions, but as their sibling category.
1
an agent or a group of agents involving a plan about the
future use of this artifact (or of artifacts of this type).
Table 1: Definitions of the children of realizables in
BFO 1.1.1 and 2 (‘Graz release’)
INTRODUCTION
The concept of a function is central to biology as well as
to psychology, technology and engineering. However,
realizable entities like functions, dispositions and roles are
notoriously difficult to understand and there is no
consensus how to model them within a top level ontology.
Among other things, this is witnessed by the Basic Formal
Ontology (BFO; http://www.ifomis.org/bfo). BFO version
up to 1.1.1 contained these three categories as immediate
children of the category Realizable, that were jointly
exhaustive and pairwise disjoint. However, in the
transition to the new version BFO 2 it is planned to
position Function as a subtype of Disposition (Table 1).
In order to bring new light into this problem, this paper
discusses the relation of functions to plans (§ 2) and the
relation of functions to dispositions (§ 3).
2
2.1
FUNCTIONS AND PLANS
Design functions of artifacts
In ancient Latin, the word “functio” was used to describe
the duties of certain positions: It was the functio of, say, a
quaestor to raise taxes. This was, what being a quaestor
was about. In general, not only official positions, but all
things created by men could be ascribed the use for which
they were created as their function or, more precisely,
their design function. A screwdriver can be said to have
the design function to drive screws because it is produced
with the plan to be used for this purpose. A design
function, then, is not a property that inheres in the
functional artifact, but it is the content of an ascription by
Definition in BFO 1.1.11
Definition in BFO 22
Disposition = A realizable
entity that essentially causes
a specific process or transformation in the object in
which it inheres, under
specific circumstances and in
conjunction with the laws of
nature. A general formula for
dispositions is: X (object)
has the disposition D to
(transform, initiate a process)
R under conditions C.
b is a disposition means: b is
a realizable entity & b’s
bearer is some material entity & b is such that if it
ceases to exist, then its
bearer is physically changed,
& b’s realization occurs
when and because this bearer
is in some special physical
circumstances, & this realization occurs in virtue of the
bearer’s physical make-up.
Function = A realizable
entity the manifestation of
which is an essentially enddirected activity of a
continuant entity in virtue of
that continuant entity being a
specific kind of entity in the
kind or kinds of contexts that
it is made for.
A function is a disposition
that exists in virtue of the
bearer’s physical make-up
and this physical make-up is
something the bearer possesses because it came into
being, either through evolution (in the case of natural
biological entities) or
through intentional design
(in the case of artifacts), in
order to realize processes of
a certain sort.
Role = A realizable entity
the manifestation of which
brings about some result or
end that is not essential to a
continuant in virtue of the
kind of thing that it is but
that can be served or
participated in by that kind
of continuant in some kinds
of natural, social or
institutional contexts.
b is a role means: b is a realizable entity & b exists because there is some single
bearer that is in some special
physical, social, or institutional set of circumstances in
which this bearer does not
have to be & b is not such
that, if it ceases to exist, then
the physical make-up of the
bearer is thereby changed.
1
http://jowl.ontologyonline.org/bfo.html (last access 12.09.2012).
http://bfo.googlecode.com/svn/releases/2012-07-20-graz/owlgroup/bfo.owl (last access 12.09.2012).
2
Paper G
1
J. Röhl, L. Jansen
They are made for a certain purpose, their function. Let us
call this the planning account of design function. On this
account, the truth-maker of a function ascription is a plan.
From the point of view of the planning account, we can
say the following about (design) functions:
• Design functions are grounded in designers’ function
ascriptions.
• Designers can decide on functions independently of
the physical structure of artifacts.
• Thus, independently from its physical structure and
its dispositions, an artifact can have any function.
• It will, though, not be able to realize its function
unless it possesses a corresponding disposition to do
so.
2.2
longer is an actual plan, but a plan within the fiction of
Mother Nature designing its creatures.
Both accounts show that pre-Darwinian biology had no
problem with applying a variant of the planning account
to biological functions. While biological entities are still
treated as functional wholes, the planning account is no
longer viable for the post-Darwinian biology of today. But
then the problem arises: What is the ground for a
biological function ascription? And this comes down to:
What is a biological function?
2.4
The concept of a function has been controversially
discussed in the philosophy of biology and elsewhere and
several accounts have been proposed to illuminate the
nature of biological functions (cf. (Ariew et al. 2002) and
(Krohs/Kroes 2009) for recent contributions to this
debate, cf. also (Johansson et al. 2005) and (Burek et al.
2006) for an elaborate formal ontology of functions).
Among the different approaches to functions the most
straightforward is the causal role analysis: That X has
function F simply means that X has the disposition to
causally contribute to some output O of a complex system
S ((Cummins 1975) as characterized by (Boorse 2002);
see there for further references).
A well-known problem is that this account is extremely
broad and admits many unintuitive functions: It implies
e.g. that clouds should be ascribed the function to produce
rain, because they undoubtedly have a central causal role
in the production of rain (more examples and further
criticism in (Boorse 2002)). To avoid this broadness,
further conditions have to be added in order to narrow
down possible functions of a thing. Intuitively, functions
are connected either with some intention as in artifactual
functions or a (not necessarily intended or conscious) goal
in biological functions. A formal characterization of such
a goal-contribution approach has been suggested by
(Röhl 2012). However, many philosophers of science see
even such “deflationary” teleological accounts as
untenable in post-Darwinian biology.
As an alternative, etiological acccounts of biological
functions have been suggested (Wright 1973). Instead of
looking forward to a goal, a function is taken to be
dependent on the history of a type of biological entity
exemplifying the function, i.e. the development by
evolutionary selection that accounts for its existence in
the first place. However, according to the etiological
account, in the first generation of a biological type there
cannot be any functions for definitional reasons, although
the actual structure of the organs would be “functional”,
which is very counterintuitive. Moreover, a certain body
part may acquire new uses and functions during the
evolutionary history of a species, while the early history
Use functions of artifacts
The planning account can easily be modified in order to
deal with use functions of artifacts. Use functions are
directed at those activities that users actually use things
for. (Cf. (Mizoguchi et al. 2012, 110) for the distinction of
design function and use function and (Preston 2003) for
problems of this distinction.) If I use my screw driver to
open my paint cans, it has the use function to open paint
cans. It has not been produced for this purpose, hence the
use function can differ from its design function, though it
might be just the same. Moreover, one and the same thing
can have many different use functions at different
occasions. This account can be extended to biomedical
entities. If I use digitalis to kill my wife, I have a certain
action plan that involves the participation of both a probe
of digitalis and my wife with a certain intended outcome.
2.3
The problem of biological functions
In pre-Darwinian biology, organisms and their parts were
described as if they, too, were something created – either,
allegorically spoken, by a personified Nature, or by God
as a creator. In the latter case, ascribing functions to
biological entities could be conceived of as reading the
mind of God before the act of creation and as a
reconstruction of the reasoning underlying his creation.
From this point of view, the planning account can be
extended to biological functions: The truth-maker of the
ascription of a biological function is, in this framework,
God’s plan for his creation. In the former case we have
something like an as-if parlance, which can be found, e.g.,
in Aristotle: Although Aristotle rejected the idea that the
universe or life had a beginning in time, he often says that
Nature has well organized its creatures. We suggest to
read this as a fictionalist account of biological function,
meaning: Were this plant or animal brought about by
Mother Nature (a very intelligent designer), she would
have done so for good reason. Hence, the planning
account can be uphold with a small modification: The
truth-maker of the ascription of a biological function no
2
Philosophical accounts of biological
functions
Paper G
Functions, roles and dispositions revisited
functions and dispositions take place “in virtue of the
bearer's physical makeup” whereas the realizations of
roles are not grounded in the physical structure of the
bearer, but dependent on circumstances. Functions are
distinguished from dispositions by the additional
condition that the function bearer possesses the physical
structure that grounds the function because of how it came
to be there in the first place: In the case of artifactual
functions by the intentional design and production, in the
case of biological functions by a history of evolutionary
selection. So BFO-roles are closer to optional, accidental
“use functions” whereas BFO-functions and BFOdispositions are determined by their causally relevant
internal structure and thus close to the goal-contribution
account as it is said that the realization of a (biological)
function “helps to realize the characteristic physiology
and life pattern for an organism of the relevant type”
(Arp/Smith 2008: 2). The specific difference of
dispositions and functions is again the historical
(evolutionary) or the intentional (design) component,
respectively. Similarly, these intentional and historical
criteria are used in BFO 2 as the specific difference of
functions as opposed to non-functional dispositions.
and hence the reason for its existence remain the same
(Boorse 2002: 66).
In face of these difficulties, Boorse's “general goalcontribution” approach can be considered the “minimal
core” of the concept of a function, with system S, system
part X and goal G: “X performs function Z in the G-ing of
S at t if and only if at t, the Z-ing of X is a causal
contribution to G” (Boorse 2002: 70). This is still weak,
because functions could be performed only once and
accidentally and fulfill this definition. Boorse relies on
distinctions like the normal function of a type as opposed
to accidental functions or deviations of single tokens to
avoid those counterintuitive accidental functions.
The concept of a systemic function as suggested by
Mizoguchi et al. (2012: 109) is very similar to this
account. With the introduction of a “systemic context” C
in addition to a system S and a system part A they define:
“C is a systemic context for S and according to C, A is a
component of a subsystem of S, the goal of this subsystem
is to realize the goal of C, and some behaviors of A play
the (functional) role determined by C.” This can be
applied both to biological and to technical functions. A
systemic context C for the human liver (= A) would be the
human digestive system (goal: digestion of food and
extractions of nutrients) and within the subsystem of fat
digestion the function of the liver is the production of bile.
The point is that the systemic function is contextdependent in a very specific way, i.e. via a system its
bearer is part of. Hence the function of a thing can change
depending on the system it is a component of.
2.5
Table 2: Function, role, and disposition in BFO
Functions in BFO
In the older versions of BFO, functions, dispositions and
roles are sibling subclasses of the class “realizable
dependent continuant” (Arp/Smith 2008). This common
superclass implies that all three are:
• continuants, i.e. they are wholly present at any time
of their existence;
• dependent (like qualities) on the independent
continuant (some material thing or system) that is
their bearer;
• realizable, which means that they are essentially
connected to certain processes, their realizations, and
• the bearers participate in the realization processes.
These are the processes they are roles, dispositions or
functions for.
Note that realizables do not have to be (always or ever)
realized (Röhl/Jansen 2011), as, e.g., in the case of a
safety mechanism the function of which will only be
realized if certain conditions obtain (and they may never
obtain).
For the specific differences between functions, roles
and dispositions Arp and Smith draw on elements of the
debates sketched above (Table 2). The realizations of
Disposi- Artifact
tion
function
Biological Role
function
Grounding
internal
internal
internal
external
Modal status
mixed
essential?
essential?
accidental
Relevance of
history
no
yes
yes
yes
Dependent on
intentions
no
yes
no
yes
3
3.1
FUNCTIONS AND DISPOSITIONS
Dispositions
A common philosophical position takes dispositions as a
type of properties (Ellis/Lierse 1994). A disposition is a
causal property that is linked to a realization, i.e. to a specific behaviour or process which the individual that bears
the disposition will show under certain circumstances or
as a response to a certain trigger. Something is watersoluble if it dissolves when put in water. In this fashion
dispositions establish a link between (independent) continuants (stable things) and occurrents (processes) and the
fundamental connection is the following: Continuant S
has disposition D for realization P and, in case P occurs,
S, the bearer of the disposition, is a participant of this
process P. Dispositions are often treated as a special kind
Paper G
3
J. Röhl, L. Jansen
of dependent continuants that are linked to a process of
realization by a respective formal relationship to a realization process (Arp/Smith 2008), (Roehl/Jansen 2011),
(Jansen 2007). Note that the terminology is often confusing as dispositions and functions are sometimes named
according to their realization processes. In the Gene Ontology (http://www.geneontology.org), e.g., many subclasses of the Molecular Function Ontology, are described
by the term “activity” (as in “bioactive lipid receptor activity”) which, paradoxically, does not denote the actual
acting (process). In this paper, function, disposition and
role will be held strictly distinct from the processes that
are their realizations.
3.2
the function. This occurs frequently in biology and
medicine, and hypo- or hyperfunctions are often
disorders, e.g., hypotension and hypertension disorders of
the cardiovascular system with regard to the parameter
blood pressure. This definition of malfunction
presupposes the very possibility of a function being
present without the corresponding disposition. Hence they
cannot be identical.
Malfunctioning is therefore a clear indicator of the
normative dimension of functions. More generally, we
can make value judgements about artifacts or body parts
with respect to their function: Something may be a bad
saw or a good heart. This normative aspect of functions
is in general not shared by dispositions. Dispositions can
be blocked or incompletely realized, but their bearers are
not evaluated in a normative fashion. Therefore it does
not seem to be appropriate to classify functions as
dispositions. E.g., a lung with a carcinoma may still have
the function to serve as an oxygen provider for the body,
but the function may no longer be realized because the
corresponding disposition (to be able to serve as an
oxygen provider for the body) is no longer present.
According to Vermaas/Houkes (2003), any theory of
function has to give an account for the normative aspect
of functions and for the possibility of malfunctions. Now
there seem to be at least two distinct ways to interpret this
case:
(1) The token lung has lost the disposition and, because
functions (a) are dispositions or (b) are dependent on
them, it has also lost the function. Malfunction for a
token then means that this token has simply lost the
function, rather than that it still has the function
without being able to realize it. But we can speak of
malfunction rather than nonfunction, because the type
lung typically has both the disposition and the
corresponding function to provide oxygene (Boorse
2002: 89; McLaughlin 2009). As McLaughlin points
out, being a token of a type involves an evaluative
dimension. Other possible sources of the normativity
for functions are a means-end relationship or a
hierarchical part-whole relation (McLaughlin 2009).
These aspects come together in a systemic goalcontribution account, because here the functioning
and the malfunction of a part in the functional
hierarchy is evaulated with respect to its working as a
means to the end/goal of the whole.
(2) The function is ontologically independent of the
disposition. The function of a lung as a normative
ascription is still there, but because the corresponding
disposition is not, there is malfunction. The task of
medicine would then be to restore the disposition so
the the organ would be (fully) functional again.
While there is no clear-cut argument to decide between
these two options, option (2) better allows to account for
Are functions dispositions?
The central difference between dispositions on the one
hand, and functions and roles on the other lies in their
context-dependence. Continuants may lose or acquire
dispositions, but not without fundamental changes within
the bearer. In contrast, many functions can be performed
by different types of bearers and an object may have
different functions in different contexts without any
change in itself. Chopsticks, for example, have the
function to support eating. Similar sticks found in the
wood do not have any such function, though they may
have the very same physical structure and hence the same
dispositions. Dispositions, that is, are purely internally
grounded, while the function of the chopsticks is a
historical property due to the way this artifact has been
produced. On the other end of the spectrum, social
functions and roles are context-dependent or “externally
grounded” in the respective context. Biological functions,
like those of organs, enzymes etc. are somewhere in
between, with an entity usually fulfilling several functions
in a certain range of contexts. They are objective systemic
functions in the sense mentioned above and not merely
ascribed by an agent; their context-dependence is fixed by
the functional hierarchy of the respective physiological
system. An organ like the liver has many functions like
production of bile, glycogene storage, cholesterol
synthesis etc. But all these are fixed by the respective
physiological systems the liver and its products are
functionally involved in. They are not as arbitrary or
flexible as the screwdriver that could serve the use
functions (i.e. roles) of a can opener or a weapon.
Malfunctioning is (i) having a function but (ii) not
living up to it. In a case of malfunctioning the realization
of the function either does not happen at all or in an
insufficient way. Technically speaking, the output of the
thing or system is not in the standard or target range (Del
Frate 2012). If the realization can be measured
quantitatively, we can distinguish between hypofunction
and hyperfunction, that is staying below a lower threshold
or exceeding an upper limit for the output parameter of
4
Paper G
Functions, roles and dispositions revisited
healing processes. For according to (2), a healing process
consists in restoring a disposition where there is a
function without its corresponding disposition. According
to (1), however, there cannot be functions without
corresponding disposition (e.g., because functions are
these dispositions). Thus a healing process according to
(1) consists in restoring the very function.
3.3
realization for the function type, and because the normal
realization is dependent on the corresponding disposition,
we have a correspondence of function and disposition at
the type level or for prototypical tokens. But this is to be
distinguished from a token-level dependence of the
function on the corresponding disposition. If we want to
accommodate malfunctioning, we have to reject the latter.
Do functions depend on dispositions?
4
For all of these reasons, we should assume that functions
are not identical to dispositions. Nevertheless, even if they
are distinct entities, functions could ontologically depend
on dispositions. On the planning account, functions of
artifacts are clearly independent from the dispositions of
their bearers, as due to the fallibility of human designers
the one could occur without the other. From the point of
the theistic extension of the planning account to biological
functions, however, the existence of a function implies the
existence of the corresponding disposition in typical
cases, given the usual assumptions of God’s omniscience
and benevolence. But even if a biological function is
typically accompanied by a disposition, this concurrence
is not universal, as proved by malfunctioning.
The dispositions as part of the internal structure of a
thing determine whether it can fulfill the respective
function in a given context. Johansson (2004) calls this
the “substratum” of a function. While the function itself is
independent from its substratum, its realization depends
on its existence. This dependence can be a generic one,
because sometimes different dispositions or structures can
ground the same function: E.g., the cooling function of a
cooler can be implemented in different technical setups
(Johansson 2004, 66).
As we know biological functions only through their
actual realizations, we would have no reason to ascribe
them, unless instances of a certain kind typically
displayed that behaviour and, a fortiori, possessed a
corresponding substratum disposition. How would we
know the biological function of, say, a heart, if hearts did
not typically have the disposition to pump blood and
typically realized this function? So there should be some
connection between the function and the disposition of the
organ.
On the other hand, many diseases like heart
insufficiency are characterized by the very contrast
between functions and the lack of corresponding
dispositions, and so is malfunctioning in general.
Malfunctioning artifacts or diseased organs are
characterised by the loss of the disposition to fulfill their
function. We conclude that the corresponding disposition
is only necessary for the realization of a function, not for
the function itself. Because in biological (and many
artifactual) cases we can evaluate the performance of
token functions with respect to what is a normal
CONCLUSION
We can summarize the discussion by suggesting a new
classification of realizables. It concurs with BFO 1.1.1 in
treating functions as siblings of dispositions rather than
special dispositions. It makes use of two independent
criteria: essence optionality and structure optionality. A
realizable can be optional given a certain physical
structure of its bearer. All realizables that are externally
grounded, i.e. in some context, are optional in this sense,
e.g. all roles. In contrast, dispositions are internally
grounded, based on the bearer's physical structure and
therefore not optional given the bearer’s physical
structure. Because a bearer can gain and lose dispositions
some dispositions are also optional. But what is at stake
here, is optionality given the essence of the bearer. If a
disposition is optional in this sense, a bearer can lose it
without ceasing to be. However, not all dispositions are
optional given the essence of the bearer. Some
dispositions, like the disposition of a proton to attract
electrons, are essential: Losing this dispostions would
imply that the proton ceases to be a proton, i.e. that it
ceases to exist. Functions are essential in this sense:
Given the essence of being a heart, it is not optional to
have the function to pump blood. And given the essence
of being a screwdriver, it is not optional to have the
function to manipulate screws. We thus end up with a
cross classification of realizables, presented in Table 3.
Table 3: A new cross-classification of realizables
Internally
grounded
(= non-optional
given the physical
structure)
Externally
grounded
(= optional
given the physical
structure)
Essential
(= non-optional
given the essence)
Essential
disposition
Function
Accidental
(= optional given
the essence)
Accidental
disposition
Role
Functions are externally grounded. We argued that there
are good arguments not to treat functions as dispositions,
Paper G
5
J. Röhl, L. Jansen
Cummins, R. (1975): Functional analysis. Journal of
Philosophy, 72:741–765.
Ellis B, Lierse C (1994): Dispositional Essentialism,
Australasian Journal of Philosophy 72: 27-45.
Del Frate, L. (2012): Preliminaries to a Formal Ontology of
Failure in Engineering Artifacts,in Donnelly , M./ Guizzardi,
G. (eds.): Formal Ontology in Information Systems (FOIS
2012), ISO Press, 107-130.
Jansen, L. (2007): Tendencies and other Realizables in Medical
Information Sciences, The Monist 90/4, 534-555.
Johansson, I. (2004): Ontological Investigations, Heusenstamm.
Johansson, I./Smith, B./Munn, K./Tsikolia, N./Elsner, K./Ernst,
D./Siebert, D. (2005): Functional Anatomy: A Taxonomic
Proposal, Acta Biotheoretica 53: 153-166.
Krohs, U./Kroes, P. (eds.) (2009): Functions in biological and
artificial worlds. Comparative philosophical perspectives,
Cambridge, Mass.
Loebe, F. (2007): Abstract vs. social roles. Towards a general
theoretical account of roles, Applied Ontology 2/2, 127-158.
McLaughlin, P. (2009): Functions and Norms, in Krohs/Kroes
(eds.) 2009, 93-102.
Mizoguchi, R. Kitamura, Y., Borgo, S. (2012): Towards a
Unified Definition of Function, in Donnelly , M./ Guizzardi,
G. (eds.): Formal Ontology in Information Systems (FOIS
2012), ISO Press, 103-116.
Preston, B. (2003): Of Marigold beer – a reply to Vermaas and
Houkes, British Journal for the Philosophy of Science 54,
601-612.
Röhl, J. (2012): Mechanisms in biomedical ontology, Journal of
Biomedical Semantics 2012, 3(Suppl 2):S9
Röhl, J./Jansen, L. (2011): Representing dispositions, Journal of
Biomedical Semantics 2011, 2(Suppl 4):S4
Smith et al. (2005): Relations in Biomedical Ontologies.
Genome Biology 2005 6:R46.
Spear, A. D. (2006): Ontology for the 21st Century. An
Introduction with Recommendations (BFO Manual),
http://www.ifomis.org/bfo/documents/manual.pdf.
Vermaas, P. E. (2009): On Unification: Taking Technical
functions as Objective (and Biological functions as
Subjective, in: Krohs/Kroes (eds.) 2009, 70-87.
Vermaas, P./Houkes, W. (2003): Ascribing functions to
Technical Artefacts: A Challenge to Etiological Accounts of
Functions. The British Journal for the Philosophy of Science, 54,
261-89.
Wright, L. (1973): Functions. The Philosophical Review, 82,
139–168.
nor to make functions dependent on dispositions. This
distinction is our central disagreement with the BFO 2
suggestion discussed above. We also define roles in a
rather narrow way, different from Mizoguchi et al. 2012
and other papers (for a very wide notion of role cf. Loebe
2007): On our account, roles are never essential for its
bearer. The way of speaking that would assign an
essential “breather” or “eater” role to a human being is
not to be taken ontologically serious. Breathing and eating
are processes, not functions or roles. Humans have to
participate in breathing and eating processes on a regular
basis, but it is not their role to breathe. Therefore in our
classification scheme what are called “use functions” in
the literature (cf. 2.2 above) are roles, in agreement with
the BFO conception of roles.
All in all, we discussed three different models for the
ontological analysis of functions:
• The planning account was able to deal with design
and use functions of artifacts, but not with biological
functions.
• Equating functions with dispositions leads to
problems with malfunctions.
• Only treating functions as a sibling category of
disposition own was able to circumvent these
problems.
On this latter account, functions are not only disjoint from
dispositions, they are also ontologically independent from
dispositions. Functions are, however, normally and mostly
accompanied by corresponding dispositions. This is the
reason why it is so difficult to distinguish between these
categories. Malfunctioning, however, requires them to be
distinct categories: It happens in case the corresponding
disposition is lacking.
ACKNOWLEDGMENTS
This work was supported by DFG grant JA 1904/2-1
within the project GoodOD. Many thanks to Andrew
Spear who provided us with a recent draft of the followup version of (Spear 2006) and to three anonymous
referees for critical and helpful comments.
REFERENCES
Ariew, A., Cummins, R,. and Perlman, M. (2002): Functions.
New essays in the Philosophy of Psychology and Biology,
Oxford.
Arp, R., Smith, B. (2008): Function, Role, and Disposition in
Basic Formal Ontology, Proceedings of Bio-Ontologies
Workshop (ISMB2008), 45–48.
Boorse, C. (2002): A Rebuttal on Functions, in Ariew et al., 63112.
Burek, P./Hoehndorf, R./Loebe, F./Visagie, J./Herre, H./Kelso,
J.(2006): A top-level ontology of functions and its
application in the open biomedical ontologies, Bioinformatics
22, No.14, e66-e73.
6
Paper G
Function of Bio-Models: Linking Structure to Behaviour
Clemens Beckstein, Christian Knüpfer
Artificial Intelligence Group, University of Jena, Germany
ABSTRACT
Computational models in Systems Biology are used in simulation
experiments for addressing biological questions. Generally, these
questions are causal questions asking for mechanistic explanations of
biological phenomena. An epistemological analysis of the role and use
of models in Systems Biology is an important prerequisite for computer
support for answering these questions.
The notion of biological function can also be applied to computational models in Systems Biology. Models play a specific role
(teleological function) in the scientific process of finding explanations
for dynamical phenomena. In order to fulfil this role a model has to be
used in simulation experiments (pragmatical function). A simulation
experiment always refers to a specific situation and a state of the
model and the modelled system (conditional function).
We claim that the function of computational models refers to both
the simulation experiment executed by software (intrinsic function)
and the biological experiment which produces the phenomena under
investigation (extrinsic function).
In this paper we describe the different functional aspects of computational models in Systems Biology. This description is conceptually
embedded in our “meaning facets” framework which systematises the
interpretation of models in structural, functional and behavioural facets.
Briefly, the function is linking between the structure and the behaviour
of a model. A thorough analysis of the function of bio-models therefore
is an important first step towards a comprehensive formal representation of the knowledge involved in the modelling process. Any progress
in this area will in turn improve computer-supported modelling. In this
paper we use our conceptual framework to review formal accounts
for functional aspects of models, such as checklists, ontologies, and
formal languages and outline a strategy for developing an ontology for
describing the intention of bio-models.
1 INTRODUCTION
Models in Systems Biology have two essential features: They are
dynamic and computational. “Dynamic” refers to the fact that the
models are used in simulation experiments in order to generate
temporal behaviours. “Computational” means that the simulation
experiments are executed by software. One conclusion from the latter
feature is that the models have to be encoded in some computerunderstandable format. For short, we call a computational dynamic
model of a biological systems a “bio-model”.
Generally, a mathematical model establishes a relation between a
system under observation (what Rosen calls the “natural system”)
and a formal system. Rosen calls this the “Modeling Relation” [1]. In
order to be useful the modelling relation has to be an isomorphism:
the structure of the formal system can be mapped onto the structure
of the natural system.
The notion of “function” as used in physiology incorporates two
aspects of a biological entity: (1) The function states a role of an
entity played as a component of an encompassing process. The
biological function is therefore tied to a specific process. (2) The
function characterises the behaviour which the entity has to exhibit
for fulfilling its role. Biological function links system structure (the
entities and relation) to behaviour. The most famous example is the
function of genes which links the genotype to the phenotype.
The notion of function can also be applied to bio-models: The
function of a bio-model describes its role in the scientific process of
finding mechanistic explanations for biological phenomena. Beside
these teleological aspects of a model’s function there are pragmatical
aspects: The function of a bio-model also describes the use of the
model in simulation experiments for generating behaviours. Thus,
function links a model (structure) to its behaviours. To summarise,
the function of bio-models describes why and how to use models in
simulation experiments.
In this paper we investigate the function of bio-models (Sec. 3) and
whether and how the function can be formalised (Sec. 4). We claim
that the function of a bio-model links its structure to its behaviour.
Before we describe the function of a bio-model in detail we will
introduce the “meaning facets” (Sec. 2) which provide a framework
to systematically describe the knowledge involved in creating and
using bio-models.
2 MEANING FACETS OF BIO-MODELS
In our analysis of the modelling process in Systems Biology we make
two important observations about bio-models [2]. First, a computational model has a dual interpretation: In order to be used in computer
simulations the encoded model has to be intrinsically interpreted with
respect to the encoding format used. This can be done without referring to the modelled natural system. Furthermore, a model has to
be related to the natural system (cf. Rosen’s modelling relation [1]
mentioned above), i.e. it has to be extrinsically interpreted. Second,
dynamic models are considered on three levels: Models are systems
of components and relations (model structure). Models are used in
simulation experiments for answering biological questions (model
function). Models exhibit temporal changes (model behaviour). The
three levels of bio-models are inspired by functional modelling in
engineering (see, e.g., [3]).
A complete description of a bio-model has to encompass all six
“meaning facets” [2], i.e. the intrinsic and extrinsic sides on each of
the three levels. If we look at the scientific process of modelling a
biological system in order to explain data observed in experiments
we can clearly identify the three levels and the dual interpretation
(see Fig. 1). There are more details about the six meaning facets
[4]. However, in this paper we will focus on the functional facets of
bio-models.
3 FUNCTION OF BIO-MODELS
A bio-model has an intrinsic and an extrinsic function (Sec. 2) which
can be understood in teleological and in pragmatical terms (Sec. 1).
There is a third perspective on the function of bio-models, which
we call “conditional”: The role played by an entity depends on the
Paper H
1
Function of Bio-Models: Linking Structure to Behaviour
computer
model system
results
simulation
modelling
intrinsic
model
competence
intention
performance
extrinsic
dynamics
explanation
intrinsic
intrinsic
extrinsic
extrinsic
experiment
target system
data
reality
Figure 1. Structure (blue/left), function (yellow/middle), and behaviour (green/right) of a bio-model. The model relates the (intrinsic) computer representation
with the (extrinsic) biological reality. (1) Structure: The biological target system is transferred into a model which can in turn be intrinsically interpreted as a
formal system. This establishes a modelling relation between the two systems. If there is a valid mapping between the components of the target system and the
formal system, we call the model a competence model. (2) Function: The intention of the model is its use in simulation experiments for explaining biological
phenomena observed in biological experiments. (3) Behaviour: The simulation experiments produces results which can be interpreted as the dynamics of the
model. This dynamics can be related to the interpreted data of the biological experiments. If the behaviour of the model is similar to the behaviour of the
biological system, we call the model a performance model with respect to the corresponding biological phenomena. Explanation is using a competence model in
an simulation experiment which makes it a performance model with respect to the biological phenomena to explain.
context; in different situations or under different conditions an entity
may have different roles.
3.1
Intended Use (Teleological Function)
Bio-Models are used in simulation experiments in order to answer
questions about the biological system under investigation. The questions may regard the explanation of observed behaviours or the
prediction of possible behaviours. What is an accepted explanation
or prediction depends on the scientific field and community [5]. Furthermore, specific assumptions restrict the permitted answers, e.g. by
stating that a specific reaction is very fast. The extrinsic teleological
function of bio-models refers to the questions addressed by the model
and the assumptions restricting the answers.
The intended use of the model in simulation experiments has to
reflect the questions. Depending on the kind of questions different
types of simulation experiments may be appropriate. Often, different
simulation experiments have to be combined in order to yield the
desired outcome. Constraints which are in line with the assumptions
are imposed on the simulation experiments. Such constraints may
include value restrictions, ratios between values, and conservation
rules. The intrinsic teleological function of a bio-model describes its
intended use and imposed constraints.
2
3.2
Model Instantiation (Conditional Function)
In general, the addressed questions refer to certain boundary conditions and to a specific initial state of the experimentally observed
biological system. The boundary conditions determine the environment of the biological system (e.g. temperature, pH, nutrition) and
may be reflected by corresponding kinetic data. Often, plausible
ranges are given for some of the conditional values instead of single
values. The extrinsic conditional function of a bio-model is expressed
in terms of boundary conditions and initial states.
A bio-model contains state variables and formal parameters. In
order to be used in simulation experiments, the model must be fully
instantiated, i.e. assign concrete values to all parameters. Furthermore, the initial values for all state variables have to be chosen.
The intrinsic conditional function of a bio-model makes the model
ready to be used in simulation experiments by means of parameter
instantiation and state initialising.
3.3
Experimental Setup (Pragmatical Function)
As mentioned above, bio-models explain or predict the behaviour of
the modelled biological system. This teleological function requires a
complementary description of the experimental settings which lead
to the observed behaviour and allow for the verification of the predicted behaviour. Usually the experimental data is transformed into
Paper H
Functional Aspect
intended use
intrinsic
extrinsic
model instantiation
intrinsic
extrinsic
experimental setup
intrinsic
extrinsic
Checklists
Languages
Ontologies
MIASE
unknown
SED-ML
unknown
see Sec. 4.4
see Sec. 4.4
MIASE
MIBBI
SED-ML
SABIO-RK
useless
XCO
MIASE
MIBBI
SED-ML
FuGE
KiSAO
MMO
Table 1. Formal approaches for functional aspects of bio-models. See main
text for details.
the final observations by means of result calculations. The extrinsic
pragmatical function of a bio-model describes the experimental settings and result calculations related to the dynamical phenomena
under investigation.
Bio-Models are used in simulation experiments. The setup of the
simulation experiments precisely describes the procedure applied
to the instantiated model. This involves the simulation algorithm
used and specific settings for this algorithm. In addition, the exact
steps, their order and applied perturbations have to be specified. Postprocessing of the raw data from the simulation experiments generates
the desired outcome. The intrinsic pragmatical function of a biomodel describes the setup of the simulation experiments applied to
the model structure and the post-processing which finally produces
the model behaviour.
4 FORMAL REPRESENTATION OF FUNCTION
In this section we will briefly review existing approaches for formalising the different functional aspects of bio-models (see Sec. 3).
Tab. 1 provides an overview of the formal approaches. The classification of the formal approaches in checklists, languages and ontologies
is motivated by [6]. There are some gaps in Tab. 1. We are not aware
of any checklists or languages for the extrinsic teleological function of bio-models. In Sec. 4.4 we therefore propose an ontology
for teleological functions which could be a starting point for such
efforts. Ontologies for intrinsic model instantiation seem not to be
very useful: There is not much conceptual knowledge involved in
assigning values to parameters and variables. This is not the case for
the (extrinsic) boundary conditions. The Experimental Conditions
Ontology (XCO) [7] provides a rich vocabulary of experimental
conditions for phenotype experiments. The Measurement method
ontology (MMO) [7] can be used for specifying the measurement
method in extrinsic experimental setup descriptions. Because XCO
and MMO are slightly out of our scope we will not provide further
details. All other formal approaches displayed in Tab. 1 are discussed
in the following sections.
4.1
MIASE: A Checklist for Simulation Experiments
To start the formalisation of a specific kind of scientific data, like
bio-models, experiment protocols, or experimental results, the responsible community first of all has to be agree about the information
needed. So-called “Minimum Information Checklists” state what information has as least to be described. Checklists are semi-formal in
that they structure the information. However, the information is still
formulated in natural language.
The most important checklist for bio-models is the Minimal Information Requested In the Annotation of Models (MIRIAM, [8]).
MIRIAM describes what information must be provided for exchanging models. Although it mainly refers to model structure MIRIAM
also requests that a model should be able to be simulated and to reproduce relevant published results. The concrete intrinsic functional
description of the simulation experiments producing these results
is the main focus of the Minimum Information About a Simulation
Experiment (MIASE, [9]). MIASE requests information about model
instantiation (conditional function), the exact experimental setup, and
the necessary post-processing (pragmatical function). It also partly
concerns the intended use (teleological function) to the effect that
the type of simulation experiment has to be described.
For the description of the extrinsic function there are a lot of specific checklists listed in the MIBBI portal (Minimum Information for
Biological and Biomedical Investigations, [10]). The listed checklists concern information about the boundary conditions (conditional
function) and the experimental settings (pragmatical function) for
specific types of biological experiments.
4.2
SED-ML: A Language for Simulation Experiments
In most cases the checklists can be translated to a elementary data
model which can then be extended to a formal language.
Such a formal language for describing simulation experiments is
SED-ML (Simulation Experiment Description Markup Language,
[9]) which is able to specify the type of simulation (intrinsic teleological function), the model instantiation and initial values (intrinsic
conditional function), and the setup and post-processing of the
simulation experiments (intrinsic pragmatical function). However,
SED-ML can not fully describe the intended use and the imposed
constraints. The Systems Biology Markup Language (SBML, [11])
used for the encoding of the model structure is also able to determine
parameter values and initial values. But, in order to be able to reuse
models in different simulation experiments we suggest to clearly
separate descriptions of models (in SBML) from descriptions of their
use (in SED-ML).
There are languages for describing extrinsic experimental conditions and specific biological experiments. SABIO-RK (System for
the Analysis of Biochemical Pathways – Reaction Kinetics, [12]),
for example., describes experimental and environmental conditions
for measurements of kinetic data. The Functional Genomics Experiment Object Model (FuGE, [13]) describes experiments in functional
genomics. We will not go into detail here.
4.3
KiSAO: An Ontology for Simulation Algorithms
Ontologies formalise conceptual knowledge. They can provide
vocabularies for formal languages.
The Kinetic Simulation Algorithm Ontology (KiSAO, [14]) is employed within SED-ML to precisely specify the algorithms used
for the simulation experiment. Thus, KiSAO contributes to the
description of the intrinsic pragmatical function of bio-models.
4.4
An Ontology for Intention of Bio-Models
How could an ontology for the teleological function of bio-models
look like? The questions about the biological system under investigation which are addressed by the model are formulated in natural
language. There is a wide diversity of such questions. However, if
we focus on the actual tasks tackled by simulating the model, we
are able to identify patterns in the questions. Common tasks include:
Paper H
3
Function of Bio-Models: Linking Structure to Behaviour
(1) the approximation of observed behaviour, (2) the investigation of
the variability in behaviour, (3) the demonstration of the ability for
specific kinds of behaviour, and (4) the examination of the influence
of parameters to the behaviour.
Each task requires a different corresponding simulation type. We
could accordingly classify the intended uses of bio-models: (1) time
series (eventually including parameter fitting), (2) bifurcation analysis, (3) stability analysis, and (4) parameter scan. This list of tasks
and corresponding intended uses is far from complete. However, it
outlines the strategy for developing an ontology for the intention of
bio-models.
5 RELATED WORK
The dual interpretation of bio-models is rooted in the “knowledge
representation hypothesis” from Artificial Intelligence:
“Any mechanically embodied intelligent process will be comprised of structural ingredients that a) we as external observers
naturally take to represent a propositional account of the knowledge that the overall process exhibits, and b) independent of
such external semantical attribution, play a formal but causal
and essential role in engendering the behavior that manifests
that knowledge.” – [15, p.15]
Simon generalises this duality to all kinds of artifacts which
serve as interfaces between an inner and an outer environment
[16]. Rosen’s modelling relation [1] is a congruence between an
extrinsic/outer natural system and an intrinsic/inner formal system
established by a model.
There are other conceptual frameworks for modelling and simulation in general [17; 18], and in particular for bio-modelling [6].
However, our meaning facets are more rigid and provide much more
details.
Zeigler’s notion of an “experimental frame” [18] resembles the
model instantiation (conditional function) and the experimental setup
(pragmatical function) presented in this paper. Furthermore, the
experimental frame “is the operational formulation of the objectives
that motivate a modeling and simulation project” [18, p.27], i.e. it
also describes the teleological function of a model.
The field of functional modelling relates structure, behaviour and
function of engineering artifacts. [3] reviews the different approaches
to formalise function and its relations to structure and behaviour.
Two different notions of function are employed in [3]: On the one
hand, function is mediating between structure and behaviour and
determines the “structural behaviours”, i.e. all possible behaviours
the model is able to show. This is more or less what we call conditional and pragmatical function. On the other hand, function refers
to the intention of the modeller and restricts the possible behaviours
to the “expected behaviours”. This notion of function as purpose corresponds to our teleological function. In short, functional modelling
addresses the “the questions of what the device and its components do
or what the purpose of the device and its components are” [3, p.149].
Joining these two sides function becomes “the bridge between human intention and physical behavior of artifacts” [19, p.271]. The
distinction between structural and expected behaviours originates
from [20].
We transfer the notion of function in biology to modelling in
systems biology. There are some strong parallels between function in biology and function of bio-models. The idea that function
4
links between structure and behaviour is deeply rooted in molecular
biology, as, e.g., stated in [21, p.0712]:
“If one such behavior seems useful (to the organism), it becomes
a candidate for explaining why the network itself was selected,
i.e., it is seen as a potential purpose for the network.”
However, we will not discuss the notion of function in biology further
here. [22] compares the notion of function in biology and technology
and examines disanalogies and parallels.
There are some formal approaches to biological function. Ontologies like EcoCyc [23] and the Gene Ontology [24] list molecular
functions played by biological entities. [25] presents an ontology of
biological functions which formalises three functional aspects: the
so-called “function structure”, the realisation and the has-function
relation, which could be related to our teleological, pragmatical and
conditional function, respectively.
6 CONCLUSION
We have applied the notion of function to computational models in
Systems Biology. Function is the link between the model structure
and the model behaviour. The intrinsic function of bio-models describes three aspects of the model’s use: Why should the model be
used in simulation experiments (teleological function)? Which model
instance should be used in simulation experiments (conditional function)? How should the model be used in simulation experiments
(pragmatical function)? The extrinsic function of a bio-model refers
to the biological questions which are addressed by the model.
The systematisation of the functional aspects of bio-models was
used to review corresponding formal accounts. Some functional
aspects are well covered by checklists, languages and associated
ontologies. However, there is no ontology for the biological questions
and the intended use of bio-models. We outlined a strategy for the
development of such an ontology.
Our epistemological analysis of functional aspects of computational models in Systems Biology and their use in simulation
experiments provides an important prerequisite for formalising the
involved knowledge. Ultimately, this will improve any computersupported research method for answering biological questions by
means of bio-models.
REFERENCES
[1]Robert Rosen. Anticipatory systems. Pergamon Press, Oxford,
UK, 1985.
[2]Christian Knüpfer, Clemens Beckstein, and Peter Dittrich. Towards a semantic description of bio-models: Meaning facets –
a case study. In Sophia Ananiadou and Juliane Fluck, editors,
Proceedings of the Second International Symposium on Semantic
Mining in Biomedicine (SMBM 2006), Jena, April 9-12, 2006,
CEUR-WS, pages 97–100, Aachen, 2006. RWTH University.
[3]Mustafa Suphi Erden, Hitoshi Komoto, Thom van Beek,
Valentina D’Amelio, Erika Echavarria, and Tetsuo Tomiyama. A
review of function modeling: Approaches and applications. AI
EDAM, 22(02):147–169, 2008.
[4]Christian Knüpfer, Clemens Beckstein, and Peter Dittrich. How
to formalise the meaning of a bio-model: A case study. In
BioSysBio 2007: Systems Biology, Bioinformatics, Synthetic
Biology, Meeting abstracts. 11.-13. January Manchester, UK,
volume 1, Suppl 1 of BMC Systems Biology, page 28, 2007.
Paper H
[5]Peter Machamer, Lindley Darden, and Carl F. Craver. Thinking
about mechanisms. Philosophy of Science, 67(1):pp. 1–25, 2000.
[6]Vijayalakshmi Chelliah, Lukas Endler, Nick Juty, Camille Laibe,
Chen Li, Nicolas Rodriguez, and Nicolas Le Novère. Data
integration and semantic enrichment of systems biology models
and simulations. Data Integration in the Life Sciences, pages
5–15, 2009.
[7]Mary Shimoyama, Rajni Nigam, Leslie Sanders McIntosh,
Rakesh Nagarajan, Treva Rice, D. C. Rao, and Melinda R.
Dwinell. Three ontologies to define phenotype measurement
data. Front Genet, 3:87–87, 2012.
[8]Nicolas Le Novère, Andrew Finney, Michael Hucka, Upinder S
Bhalla, Fabien Campagne, Julio Collado-Vides, Edmund J
Crampin, Matt Halstead, Edda Klipp, Pedro Mendes, Poul
Nielsen, Herbert Sauro, Bruce Shapiro, Jacky L Snoep, Hugh D
Spence, and Barry L Wanner. Minimum information requested in
the annotation of biochemical models (MIRIAM). Nat Biotech,
23(12):1509–1515, December 2005.
[9]Dagmar Köhn and Nicolas Le Novère. SED-ML – an XML
format for the implementation of the MIASE guidelines. In
Computational Methods in Systems Biology. Proceedings of the
6th International Conference CMSB 2008, Rostock, Germany,
October 12-15, 2008., Lecture Notes in Computer Science, pages
176–190, Berlin, Heidelberg, 2008. Springer.
[10]Chris F. Taylor, Dawn Field, Susanna-Assunta Sansone, Jan
Aerts, Rolf Apweiler, Michael Ashburner, Catherine A. Ball,
Pierre-Alain Binz, Molly Bogue, Tim Booth, Alvis Brazma,
Ryan R. Brinkman, Adam Michael Clark, Eric W. Deutsch,
Oliver Fiehn, Jennifer Fostel, Peter Ghazal, Frank Gibson, Tanya
Gray, Graeme Grimes, John M. Hancock, Nigel W. Hardy, Henning Hermjakob, Randall K. Julian, Matthew Kane, Carsten
Kettner, Christopher Kinsinger, Eugene Kolker, Martin Kuiper,
Nicolas Le Novère, Jim Leebens-Mack, Suzanna E. Lewis, Phillip Lord, Ann-Marie Mallon, Nishanth Marthandan, Hiroshi
Masuya, Ruth McNally, Alexander Mehrle, Norman Morrison,
Sandra Orchard, John Quackenbush, James M. Reecy, Donald G. Robertson, Philippe Rocca-Serra, Henry Rodriguez,
Heiko Rosenfelder, Javier Santoyo-Lopez, Richard H. Scheuermann, Daniel Schober, Barry Smith, Jason Snape, Christian J.
Stoeckert, Keith Tipton, Peter Sterk, Andreas Untergasser,
Jo Vandesompele, and Stefan Wiemann. Promoting coherent
minimum reporting guidelines for biological and biomedical
investigations: the MIBBI project. Nat Biotech, 26(8):889–896,
August 2008.
[11]Michael Hucka, Andrew Finney, Herbert M. Sauro, Hamid Bolouri, John C. Doyle, Hiroaki Kitano, Adam P. Arkin, Benjamin J.
Bornstein, Dennis Bray, Athel Cornish-Bowden, Autumn A.
Cuellar, Serge Dronov, Ernst Dieter Gilles, Martin Ginkel,
Victoria Gor, Igor I. Goryanin, Warren J. Hedley, T. Charles
Hodgman, Jan-Hendrik S. Hofmeyr, Peter J. Hunter, Nick S. Juty,
Jay L. Kasberger, Andreas Kremling, Ursula Kummer, Nicolas
Le Novère, Leslie M. Loew, Daniel Lucio, Pedro Mendes, Eric
Minch, Eric D. Mjolsness, Yoichi Nakayama, Melanie R. Nelson,
Poul F. Nielsen, Takeshi Sakurada, James C. Schaff, Bruce E.
Shapiro, Thomas Simon Shimizu, Hugh D. Spence, Jörg Stelling,
Koichi Takahashi, Masaru Tomita, John M. Wagner, and Jian
Wang. The systems biology markup language (SBML): a medium for representation and exchange of biochemical network
models. Bioinformatics, 19(4):524–531, Mar 2003.
[12]Ulrike Wittig, Renate Kania, Martin Golebiewski, Maja Rey,
Lei Shi, Lenneke Jong, Enkhjargal Algaa, Andreas Weidemann,
Heidrun Sauer-Danzwith, Saqib Mir, Olga Krebs, Meik Bittkowski, Elina Wetsch, Isabel Rojas, and Wolfgang Müller 0001.
SABIO-RK – database for biochemical reaction kinetics. Nucleic
Acids Res, 40(Database issue):790–796, January 2012.
[13]Khalid Belhajjame, Andrew R. Jones, and Norman W. Paton. A
toolkit for capturing and sharing FuGE experiments. Bioinformatics, 24(22):2647–2649, November 2008.
[14]Mélanie Courtot, Nick Juty, Christian Knüpfer, Dagmar Waltemath, Anna Zhukova, Andreas Dräger, Michel Dumontier,
Andrew Finney, Martin Golebiewski, Janna Hastings, Stefan
Hoops, Sarah Keating, Douglas B Kell, Samuel Kerrien, James
Lawson, Allyson Lister, James Lu, Rainer Machne, Pedro
Mendes, Matthew Pocock, Nicolas Rodriguez, Alice Villeger,
Darren J Wilkinson, Sarala Wimalaratne, Camille Laibe, Michael Hucka, and Nicolas Le Novère. Controlled vocabularies
and semantics in systems biology. Molecular Systems Biology,
7, oct 2011.
[15]Brian C. Smith. Reflection and Semantics in a Procedural
Language. PhD thesis, Massachusetts Institute of Technology,
1982.
[16]Herbert A. Simon. The Sciences of the Artificial. MIT Press,
Cambridge, MA, 1969.
[17]George J. Klir. Architecture of Systems Problem Solving. Plenum
Press, New York, 1985.
[18]Bernard P. Zeigler, Herbert Praehofer, and Tag G. Kim. Theory
of Modeling and Simulation. Academic Press, 2 edition, 2000.
[19]Yasushi Umeda and Tetsuo Tomiyama. FBS modeling: modeling
scheme of function for conceptual design. In Working Papers of
the 9th Int. Workshop on Qualitative Reasoning About Physical
Systems, Amsterdam, pages 271–278, 1995.
[20]John S. Gero. Design prototypes: a knowledge representation
schema for design. AI Mag., 11(4):26–36, 1990.
[21]Arthur D Lander. A calculus of purpose. PLoS Biol, 2(6):e164,
06 2004.
[22]Ulrich Krohs and Peter Kroes. Philosophical perspectives on
organismic and artifactual functions. In Ulrich Krohs and
Peter Kroes, editors, Functions in Biological and Artificial
Worlds: Comparative Philosophical Perspectives, Vienna Series
in Theoretical Biology. MIT Press, 2009.
[23]Peter D. Karp. An ontology for biological function based on
molecular interactions. Bioinformatics, 16(3):269–285, 2000.
[24]Michael Ashburner, Catherine A. Ball, Judith A. Blake, David
Botstein, Heather Butler, J. Michael Cherry, Allan P. Davis, Kara
Dolinski, Selina S. Dwight, Janan T. Eppig, Midori A. Harris,
David P. Hill, Laurie Issel-Tarver, Andrew Kasarskis, Suzanna
Lewis, John C. Matese, Joel E. Richardson, Martin Ringwald,
Gerald M. Rubin, and Gavin Sherlock. Gene Ontology: tool
for the unification of biology. Nature Genetics, 25:25–29, 2000.
The Gene Ontology Consortium.
[25]Patryk Burek, Robert Hoehndorf, Frank Loebe, Johann Visagie,
Heinrich Herre, and Janet Kelso. A top-level ontology of
functions and its application in the open biomedical ontologies.
Bioinformatics, 22(14):e66–e73, 2006.
Paper H
5
PhysioMaps of Physiological Processes and their Participants
!"#$%&'()*(+,,-.!"/#01&''()*(2&#'.3(4,5&67(8,&9$:,6;<3(=&,6>%,?(@*(=-,A7,?B(#$:((
C,9$(8*(=&$$#6%.(
1
Biomedical & Health Informatics, Univ. of Washington, USA, 2Univ. of Cambridge, UK, 3Bioinformatics, Aberystwyth University, UK
ABSTRACT
The biological meaning of physics-based simulations of biological processes is usually implicit, and often inaccessible to
biological researchers, or indeed to any researchers beyond
the authors of a particular model. To display and interrogate
the physiological content of such models, we have developed formal semantic models of biological process networks
that we call PhysioMaps. PhysioMaps are semiautomatically
derived by our SemGen software by parsing simulation
model code; visualized using our Chalkboard software, and
qualitatively interrogated for cause-effect relations using the
PathTracing feature of Chalkboard. PhyioMaps are compact,
semantic models of biological process networks through
which the functional content of biosimulation models and
other biomedical resources can be displayed, archived, queried, and integrated.
INTRODUCTION
Physics-based mathematical model encode and simulate
networks of physiological processes in order to test hypotheses and quantitatively predict system behavior. However,
the physical principles and mathematical language of models often render these models opaque to non-mathematical
biologists. If such models were re-cast to be easily accessible to biologists, we claim that the scientific endeavor could
be significantly accelerated. To bridge this gap, we have
developed and implemented “PhysioMaps” that are both (a)
a visualization of the biology implicit in mathematical biosimulation models, and (b) a computable view on the multi-scale, multi-domain “physiome” (Bassingthwaighte,
2000; Hunter et al., 2003). Here we briefly demonstrate our
approach for deriving PhysioMaps from available biosimulation models, for creating graphical displays of the processes and process participants that are implicit in the model
code, and for querying the PhysioMaps to display functional
links between participants.
Within the domain of chemical reaction pathways (“systems biology”), Systems Biology Markup Language
(SBML, 2012) and Systems Biology Graphical Notation
(SBGN, 2012) can represent chemical participants in reaction processes. However, extensions to such notations are
required to represent participants and processes within and
*
To whom correspondence should be addressed.
across multiple scales and domains that are of critical biomedical interest such as cardiovascular function, membrane
electrophysiology, neuroendocrine control systems, fluid
and electrolyte balance, and respiratory function.
The success of SBML, SBGN, and their supporting tools,
build on a semantics of biochemical reaction pathways as do
other resources such as EcoCyc (EcoCyc), KEGG (KEGG,
2005), and Reactome (Reactome, 2005). The PhysioMap
framework generalizes beyond the molecular scale to capture knowledge of multiscale/multidomain biological process based on the semantics of the Ontology of Physics for
Biology (OPB, Cook et al., 2008; Cook et al., 2011).
Here we define the semantic foundations of PhysioMaps,
demonstrate computational derivation of PhysioMaps from
available biosimulation model code, and show how PhysioMaps can visualize and interrogate complex physiological
systems. We discuss a PhysioMap workflow designed to
tighten the loop between physiological modelers and experimentalists for solving outstanding biomedical problems.
SEMANTICS OF OPB AND PHYSIOMAPS
PhysioMaps are cyclic graphs in which edges are operators
that represent biological processes and nodes represent the
biological physical entities that are participants in those processes. As a computational representation of a biological
process network, a PhysioMap is analogous to a biochemical reaction network, in that it can be semantically composed, decomposed, and integrated with other PhysioMaps
to other physiological systems. From one PhysioMap, we
could merge it with others, extract portions (graph subsets),
or perform intersections with other maps, to create new
PhysioMaps that might describe different systems
The OBP is an ontology (Cook et al., 2011) that represents entities and relations of classical physics as applied in
engineering systems dynamics (Karnopp et al., 1990; Borst
et al., 1997) and biological network thermodynamics (Oster,
1973; Person, 1975). OPB leverages formal analogies between physical properties and their quantitative dependencies that apply across all scales and biophysical domains
including chemical reactions, fluid flow, diffusion, electrophysiology, etc. For examples, just as chemical reaction
fluxes depend on differences in the concentration (or, more
accurately, chemical potential), fluid flows are driven by
Paper I
1
D.L.Cook et al.
Figure 2. Prototype work-flow for producing, displaying, and querying PhysioMaps.
Figure 1. Diagram showing separate concerns of the
OPB ontological schemata for: (1) PhysioMaps, (2)
SemSim models, and (3) biosimulation model code.
differences in fluid pressure, ion fluxes by electrochemical
gradients, and so forth. As in various upper-level ontologies
(e.g., BFO) OPB physical entities (“continuants”) are participants (has_participant relation) in processes (“occurrents”)
that can be mapped to entities in available biomedical ontologies (e.g., FMA (FMA, 2011), CL (CL, 2012), GO (GO,
2012), ChEBI (ChEBI, 2006)). OPB is orthogonal to other
ontologies because entities and processes are defined in
thermodynamic terms i.e., they are “dynamical” in accord
with usage in systems dynamics. Thus, for example,
OPB:Dynamical entity is defined, in part, as “the bearer of a
portion of thermodynamic energy” and OPB:Dynamical
process as “the flow of thermodynamic energy between
participating dynamical entities”.
As roadmaps for this paper, Figure 1 shows how OPB
classes and relations provide a class structure for both PhysioMaps and SemSim (semantic simulation) models which
each have instances that are derived from simulation model
code. Figure 2 shows a workflow by which simulation model code is parsed and annotated into a SemSim model, and
then abstracted as a PhysioMap to be visualized and interrogated in our Chalkboard software (Cook et al., 2007). In the
next two sections we present in greater detail these two
steps in the workflow.
GENERATE PHYSIOMAPS FROM MODELS
PhysioMaps are semi-automatically derived from biosimulation model code using SemGen (Neal et al., 1998), an application for annotating, decomposing, merging, and encoding biosimulation models (Beard et al., 2012) in any of several languages (CellML (CellML, 2005), SBML (SBML,
2
2012), JSimMML (JSim, 2006), currently). In a two-step
process, SemGen reads and parses model code and then,
under user guidance, maps each model variable ( “Paorta”,
“concGlucose”, for examples) to instances of
OPB:Dynamical property subclasses (OPB:Fluid pressure,
OPB:Chemical concentration, respectively for the examples). Depending on the physical domain for the dynamical
property class (e.g., OPB:Fluid kinetic domain or
OPB:Chemical kinetic domain), SemGen accesses (via web
services) physical entity reference classes from the FMA,
CL, or ChEBI (or other ontologies) and constructs a composite annotation (Gennari et al., 2009; Gouts et al., 2009) for
each and by which the biological meaning of the mathematical variable is logically declared. In our example, the variable “PLV” could be annotated as OPB:Fluid pressure property_of FMA:Blood in left ventricle.
Having annotated each model variable, SemGen then
maps their mathematical dependencies according to whether
variables are “left-hand sides” (lhs) of equations or are on
the “right-hand side” (rhs) of each equation. From these
mappings, SemGen constructs and displays these mathematical dependencies by which each lhs-variable mathematically determined by each rhs-variable. Of particular concern
here are the lhs-variables that have been annotated as subclasses of OPB:Flow rate property. For example, a typical
cardiovascular dynamic model will have several equations
that compute a fluid dynamic version of Ohm’s law for fluid
flow such as for blood flow from the lumen of the left ventricle (LV), through the aortic valve (with fluid resistance,
RAV), and into the aorta (A). According to Ohm’s law such
flows are driven by differences in pressure (PLV – PA) each
of which is annotated by SemGen to be an instance of
OPB:Fluid pressure which is the fluid kinetic subclass of
OPB:Force property. Thus:
Paper I
FLV,A = (PLV – PA) / RAV
Eq. 1
PhysioMaps of Physiological Processes and their Participants
dictated by physical laws. The physical entities and processes modeled by the code are entirely implicit in the code and
are explicated only, if at all, by superimposed annotations.
SemSim models abstract model variables and equations as
instances of OPB:Dynamical properties and of
OPB:Physical property dependency and formally link the
property instances, via the composite annotation mechanism, to formal representations of process participants, but
not to processes themselves. A PhysioMap represents, specifically, processes and participants to the exclusion of
mathematical detail of how processes occur (i.e., according
to a specific formulation of Ohm’s law used by a modeler).
To recover such details of source code, PhysioMaps can be
linked to the originating SemSim model, and hence to the
originating biosimulation code.
Figure 3. A prototype Chalkboard display of the PhysioMap of a cardiovascular-flow module extracted by
SemGen from a JSim-coded simulation model {Kerckhoffs, et al., 2007}.
DISPLAY AND INTERROGATE PHYSIOMAPS
The definition of OPB:Dynamical process is “...the flow
or the control of flow of thermodynamic energy within or
between participating dynamical entities”. Thus, we are able
to interpret a flow variable as an attribute of each participating entity (e.g., fluid flow rate out of left ventricular) and
identically as an attribute of the process by which blood
flows from one entity to the other (e.g., fluid flow rate
through aortic valve). Given this identity, SemGen makes
two inferences. First, for each variable that is a SemSim
instance of OPB:Flow rate property, it infers the existence
of an instance of the SemSim process class corresponding to
OPB:Energy flow process. Then, for each SemSim instance
of OPB:Force property (that represents a force variable on
the right-hand-side of an equation), SemGen infers the existence of a participating physical entity according to the
composite annotations of the force variable. SemGen then
creates a file which lists all participating entities and all processes linked to each of its participants. Although encompassed by OPB and SemSim, our prototype demonstration
here does not yet include examples of modulating participants such as the participation of the aortic valve as a flow
path to which the resistance variable (RAV of type
OPB:Fluid resistance) applies.
Creating a PhysioMap from mathematical equations results in a major reduction in complexity. PhysioMaps retain
the language-independent biological meaning of model code
while excluding mathematical forms, parametric values, and
program control code, that are generally opaque to nonmathematical experimental biologists or, even, users of alternative modeling platforms. This is apparent in Figure 1
that shows the separation of concerns for model code,
SemSim, and PhysioMap representations. Model code consists solely of (1) model variables, that represent the values
of the physical properties of process participants, and (2)
model equations that encode the mathematical dependencies
Once created, we export PhysioMaps from SemGen into
Chalkboard to graphically displays the PhysioMap and allows users to explore cause/effect relations amongst processes. Chalkboard (Cook et al., 2007) is an editor for the
BioD biological description language (Cook et al., 2001)that
incorporates an extensive graphical vocabulary for biological physical entities and processes. BioD differs only in detail from the SBGN Process Description language. In our
prototype demonstrations here, we have reprogrammed
Chalkboard to read PhysioMap files and to create process
icons (rectangles) linked by arrows to participating entity
icons (circles). Figure 3 is a PhysioMap of a cardiovascular
model extracted using SemGen from a more extensive model (Kerckhoffs et al., 2007) in which a rectangle represent
fluid-flow process (governed by a dependency as in Eq. 1)
for flow between portions of blood located in a circuit of
blood vessels and heart chambers.
Automatically-derived PhysioMaps (as in Fig. 3) present
two common user-interface challenges. First, graphical and
topological layouts, although semi-automated, require user
adjustments to be intuitively satisfying. Second, icon labels
(e.g., “Blood in left ventricle”), that are automatically derived and condensed from the formal SemSim composite
annotations, may be verbose.
Although not illustrated here, PhysioMaps in Chalkboard
can be interrogated using Chalkboard’s Path Tracing feature
which makes qualitative outcome predictions based on the
mathematical dependencies derived in the originating
SemSim model. Path Tracing is a user-interface tool that
replicates the kind of “thought experiments” physiologists
informally reason about the functional implications of their
hypotheses. Thus, experimental perturbations (e.g., an increment in the amount of aortic blood) is mentally propagated through a functional network as increments and decrements in the amounts or flow-rates of connected participants
and processes. As implemented in Chalkboard, Path Tracing
can trace “A-to-B” pathways in complex networks, detect
Paper I
3
D.L.Cook et al.
positive- and negative-feedback loops, and display the
qualitative (up or down) responses of all affected processes
and participants in the net.
We have developed these tools in the context of largescale collaborative projects like the Virtual Physiological
Human (VPH, 2010) and the Virtual Physiological Rat
(VPR, 2012) with the aim to “tighten the loop” between
modelers and experimentalists. PhysioMaps, as derived
from model code, provide experimentalist with a familiar
descriptive view of the functional content of models while
retaining formal traceable links back to the originating code.
Combined with SemGen’s capabilities to decompose, recompose, and encode models, we anticipate that these
methods will facilitate the generation and testing of biomedical hypotheses.
SUMMARY AND CONCLUSIONS
We have introduced the idea of a PhysioMap as a high-level
view of a physiological system and demonstrated extensions
to our SemGen software which derive PhysioMaps from
biosimulation model code using the semantics of classical
physics as represented in the OPB. Thus, in addition to
SemGen’s capabilities for rapid prototyping of models by
decomposing, composing, and re-encoding legacy models,
SemGen can infer the biophysical processes and their participants from the mathematical dependencies coded in biosimulation models. In this first report, we have focused on
flow processes of the OPB:Energy flow process class but we
aim to identify modulation and control processes as subclasses of OPB:Constitutive coupling process (as might be
annotated as kinds of GO:Biological regulation).
Our claim is that we have established a computational
method by which physiological process knowledge, otherwise implicit in the code of multiscale biosimulation models, can be mined and formally expressed using available
biomedical ontologies. We have leveraged the biophysical
semantics of the OPB to infer the existence of such physical
processes along with their participating entities in biosimulation models created by a community of physiologists and
biophysicists to address outstanding problems in multiscale
biomedicine.
ACKNOWLEDGEMENTS
This work was partially funded by the VPH Network of
excellence, EC FP7 project #248502, and by the American
Heart Association.
REFERENCES
Bassingthwaighte (2000). Strategies for the physiome project. Ann
Biomed Eng 28(8): 1043-1058.
Beard, et al. (2012). Multiscale Modeling and Data Integration in
the Virtual Physiological Rat Project. Annals of Biomedical Engineering (in press).
4
Borst, et al. (1997). Engineering ontologies. Int. J. Human–
Computer Studies 46: 365-406.
CellML (CellML language, as of 2012) http://www.cellml.org
ChEBI (Chemical Entities of Biological Interest, as of 2012)
http://www.ebi.ac.uk/chebi/
CL (Cell Ontology, as of 2012) http://www.obofoundry.org/cgibin/detail.cgi?id=cell
Cook, et al. (2011). Physical Properties of Biological Entities: An
Introduction to the Ontology of Physics for Biology. PLoS ONE
6(12): e28708.
Cook, et al. (2001). A basis for a visual language for describing,
archiving and analyzing functional models of complex biological systems. Genome Biol 2(4): RESEARCH0012.
Cook, et al. (2007). Chalkboard: Ontology-Based Pathway Modeling And Qualitative Inference Of Disease Mechanisms. Pac
Symp Biocomput 12: 16-27.
Cook, et al. (2008). Bridging biological ontologies and biosimulation: the Ontology of Physics for Biology. AMIA Annu Symp
Proc: 136-140.
EcoCyc (Encyclopedia of Escherichia coli, as of 2012)
http://ecocyc.org/
FMA (Foundational Model of Anatomy, as of 2012)
http://sig.biostr.washington.edu/projects/fm/
Gennari, et al. (2009). Multiple ontologies in action: Composite
annotations for biosimulation models. J Biomed Inform 44(1):
146-154.
Gkoutos, et al. (2009). Entity/quality-based logical definitions for
the human skeletal phenome using PATO. Conf Proc IEEE Eng
Med Biol Soc 2009: 7069-7072.
GO (Gene Ontology, as of 2012) http://www.geneontology.org/
Hunter, et al. (2003). Integration from proteins to organs: the Physiome Project. Nat Rev Mol Cell Biol 4(3): 237-243.
JSim (JSim Home Page at NSR, as)
http://nsr.bioeng.washington.edu/
Karnopp, et al. (1990). System dynamics: a unified approach. New
York, Wiley.
KEGG (Kyoto Encyclopedia of Genes and Genomes, as of 2012)
http://www.genome.jp/kegg/
Kerckhoffs, et al. (2007). Coupling of a 3D finite element model of
cardiac ventricular mechanics to lumped systems models of the
systemic and pulmonic circulation. Ann Biomed Eng 35(1): 118.
Neal, et al. (1998). The digital anatomist structural abstraction: a
scheme for the spatial description of anatomical entities. Proc
AMIA Symp: 423-427.
OPB (Ontology of Physics for Biology, as of 2012)
http://bioportal.bioontology.org/ontologies/44872
Oster, et al. (1973). Network thermodynamics: dynamic modelling
of biophysical systems. Q Rev Biophys 6(1): 1-134.
Perelson (1975). Network thermodynamics. An overview. Biophys
J 15(7): 667-685.Reactome (Reactome, as of 2012)
http://www.reactome.org
SBGN (Systems Biology Graphical Notation, as of 2012)
http://www.sbgn.org/
SBML (Systems Biology Markup Language, as of 2012)
http://sbml.org/
VPH (Virtual Physiological Human, as of 2012) http://www.vphnoe.eu/
VPR (Virtual Physiological Rat, as of 2012) http://virtualrat.org/
Paper I
Tissue Motifs and Multi-scale Transport Physiology
*de Bono, B. a,c,d, Kasteleyn, P.b, Potikanond, D., Kokash, N.b, Verbeek, F.b and Grenon, P a.
[a] European Bioinformatics Institute, Cambridge, UK;
[b] LIACS, Leiden University, Leiden, the Netherlands;
[c] CHIME Institute, UCL, UK
[d] Auckland Bioengineering Institute, New Zealand
*
ABSTRACT
Motivation: This work describes the underlying notion and
formal representation of tissue motifs in terms of simple
combinations of cell types that are co-located in the same
tissue. The composition of tissue motifs is characterised on
the basis that particular forms of biological interaction can
take place between the cells in the motif. One such biological
interaction involves the diffusion of gene products that are
secreted into the surrounding tissue fluid to reach binding
receptors attached to nearby plasma membranes.
The development of a public knowledgebase of motifs constrained for diffusive interactions would provide a key resource for the interpretation of gene expression data in the
context of inter-cellular communication. To that end, in this
work we also present a software tool prototype that supports
the supervised capture and annotation of histology image
data. The aim of this tool is to support the crowdsourcing of
tissue motif knowledge acquisition by the biomedical community.
1
INTRODUCTION
This work is concerned with the discussion of a particular
element in an overall system of ontology for transport physiology in multicellular organisms. The particular focus of
this paper is to support the functional representation of Tissue Motifs. In this paper, a motif is a recurring structural
pattern in biology that may span across multiple scales of
size.
An important feature of motifs of biological structure is that
many of them have a well-established functional significance. For instance, the association between (i) recurring
linear motifs in primary amino acid or DNA sequence and
(ii) functional properties associated with those biological
structures has been extensively documented (e.g. [1, 2]). At
a higher (i.e. tertiary) protein architecture level, linear motifs
of domain superfamily combinations that correlate with distinct functional categories are also known to be evolutionarily conserved [3]. Given that, despite the extensive combinatorial space available for all possible linear combinations of
molecular structure, only a very small fraction of motifs are
*
To whom correspondence should be addressed.: bdb@ebi.ac.uk
consistently preserved over evolution indicates that such
conserved patterns convey some functional advantage to the
bearer of the motif (e.g. by conferring stable, low energy,
conformation of tertiary folding and quaternary binding).
At a higher size scale of biological structure, basic tissue
organization consists of different proportions of cell types,
as well as the extracellular matrix they secrete. Based on the
number of classes found in the CellType ontology [4], the
size of the cell type repertoire is roughly 1,500 (this number
is comparable to the number of superfamilies that constitute
domain motifs at the polypeptide scale [5]). The cellular
architecture of tissues is understood to be under a number of
distinct functional constraints. One of the most fundamental
of these constraints is the key biophysical requirement that
cells have to be within diffusion distance of at least one capillary to ensure appropriate rates of (i) delivery of supplies
(e.g. oxygen, glucose) and (ii) elimination of waste (e.g.
carbon dioxide, urea).
The density of capillary arborization is, in practice, commensurate to the level of metabolic activity typical of that
tissue. Given that the distance over which diffusion occurs is
approximately 100µm, it is not surprising that most mammalian cells are found within 50µm of a capillary [6]. The
same distance constraint applies to paracrine communication
between cells in which molecules secreted by one cell have
to diffuse to reach the plasma membrane surface of another
cell. The combination of these two distance constraints provides the biophysical basis for the structural definition of a
functional tissue unit consisting of cells that (i) are metabolically dependent on the same capillary and (ii) are in paracrine communication with one another. This tissue unit per
se has a cylindrical shape that shares its long axis with the
feeding capillary, has a radius of 50µm and a height of
100µm. The unordered set of cell types that are found in this
tissue unit provides the elements that constitute a Tissue
Motif, which motif is based on the process of diffusive
communication between cells. This approach to developing
Tissue Motifs also has to account for two key properties of
capillary networks, namely that:
1) any particular cell may be within diffusive range of more
than one capillary, and
2) given that a capillary is about 500µm in length, it is possible that the constitution of the Tissue Motif may alter
Paper J
1
de Bono, B. et al.
along the course of the capillary from the arteriolar to the
venular end.
In order to further develop and apply Tissue Motif
knowledge in support of our study of multi-scale transport
physiology, this paper presents the requirements and progress achieved in:
1) formalizing knowledge representation about Tissue Motifs (Section 2), and
2) acquiring Tissue Motif knowledge through the analysis of
histology images (Section 3).
The final part of the paper (Section 4) outlines the key motivation for building a crowdsourced knowledge repository of
tissue motifs in support of data analysis in molecular biology.
2
REPRESENTING KNOWLEDGE ABOUT
TISSUE MOTIFS
We adopt a formal ontological framework, BFO [7], to provide a formal treatment of Tissue Motifs. BFO has been
chosen for its simplicity and clear-cut distinctions. Furthermore, as that framework has already been applied in related
areas of the biomedical domain, this choice facilitates some
degree of integration of the present treatment with related of
forthcoming formalisations ab initio.
In BFO parlance, the world is made of two main kinds of
things: objects, such as material objects and processes that
involve these objects. We find this high-level dichotomy
adequate for dealing with Tissue Motifs and their role in
physiology processes. According to this view, Tissue Motifs
are on the side of objects insofar as these motifs are patterns
of structural organisation of possibly complex objects (i.e.
tissues). But Tissue Motifs are of course not these objects:
they are not tissues; they are repeating patterns of tissue
structure. In BFO, entities such as patterns fall under a category of so-called Generically Dependent Continuant (which
means that Tissue Motifs need some other entity in order to
exist). We will adopt this view and thus separate (i) motifs
as entities in their own right from (ii) the entities (i.e. tissues) in which they recur as patterns.
While these considerations solve the question of the ontological status of Tissue Motifs, they do little to provide the
formal means for describing and registering the characteristics of Tissue Motifs in general and, in particular, for registering the differential characteristics between distinct motifs.
Certainly, as generically dependent entities, Tissue Motifs
can be characterised as the motifs of some tissue. This however does little more than secure a form of bookkeeping and,
while it is fundamental for some purpose to identify and
collect the association between tissues and their motifs,
more detail is needed. One reason why such associations are
important is that Tissue Motifs give a key to the classifica-
2
tion of tissues. Furthermore, once the description of Tissue
Motifs includes enough of the physical characteristics from
which to derive characterisations of the physiological processes they allow richer characterisations of tissues be
achieved, including characterisations of physiological processes now occurring at the tissue level of granularity.
In this abstract, we sort the elements of a forthcoming tissue
knowledge framework according to the representation they
support, namely:
1) the characterisation of Tissue Motifs through (i) the type
of relationships in which they enter with other entities such
as tissues and material parts of these tissues (e.g. cells and
fluid compartments), as well as (ii) the way they are configured in virtue of presenting a given motif;
2) the elicitation of selected aspects of Tissue Motifs allowing for deriving the characteristics of the physiological processes they enable (spatial relationships and distances, in
particular), as well as the various types of processes in question (e.g. processes of flow, stress transfer, electrical transmission, etc.);
3) the description of emerging biological properties and
functions that tissues have in virtue of presenting given motifs or their combinations.
An interesting and also challenging aspect of such
knowledge representation is that it brings together, through
the central treatment of Tissue Motifs, treatments that are
traditionally circumscribed to areas of the biomedical domain but that lack the required articulation to support a multi-scale ontology of transport physiology. Given Tissue Motifs and their formal account, it is possible to articulate the
description of transport phenomena from scales that range
from the molecular to the organ level. Tissue Motifs, therefore, provide a key bridge for the representation of transport
physiology, which can now be traversed as a network of
connected and interdependent knowledge representations.
3
ACQUIRING KNOWLEDGE ABOUT TISSUE
MOTIFS
The extraction of Tissue Motifs from both 2D and 3D histology images ensures that generated knowledge is linked to
independent and verifiable pictorial evidence about the tissue architecture in which a motif is found. In this section,
we briefly describe the methodology, implemented as a
software tool, that supports the collaborative generation of
Tissue Motif knowledge through a communal annotation
effort.
The annotation of histology image data requires segmentation that is applied to these images. Segmentation circumscribes regional elements that can be discerned in the image
and subsequently semantically annotated. To that end, we
developed a software tool that supports the application of
semantic annotation to histology image segments. Initially,
Paper J
only the use of manually circumscribed segmentation is
permitted, with the aim of also supporting the application of
automated segmentation methodologies in the future.
Currently, the tool allows the annotation of image segments
with terms from either of two ontologies, namely: the Foundational Model of Anatomy (FMA) [8] and the Cell Type
ontology [4]. The Graphical User Interface (GUI) functionality in the software permits users to drag segments from the
image onto tiles in ApiNATOMY [9] treemap graphical
depictions of the two ontologies (Figure 1). Future plans for
the tool is to also integrate RICORDO [10] webservices in
support of storage and querying of the generated metadata.
The software tool per se consists of a webstart JAVA application, based on a well-established atlasing tool [11]. The
GUI displays two panels: (a) One panel in shows image sections and their corresponding segment information, along
with a slider that scrolls through images of a particular object at various levels of resolution (left side of Figure 1). (b)
The other panel is used to display the treemap layout of each
relevant ontology via a series of tabbed views (right side of
Figure 1). Importantly, the application provides a user interface interaction that allows image segment identifier symbols in the image panel to be ‘dragged-and-dropped’ onto
the relevant tile in the treemap panel, by way of creating an
annotation binding. So far, the tool allows users to visually
identify capillaries and their surrounding tissue units as a
means to manually segment and annotate those cells that
make up the Tissue Motif.
The level of inferencing that the annotation tools currently
allows is limited to assigning celltype ontology terms (associated with relevant image segments) to be contained_in the
FMA anatomical entity term with which the image as a
whole is annotated. Future work will also support mereotopological inferencing on the basis of geometrical containment
relations between segments in the same image.
4
knowledgebase of Tissue Motifs that bridge anatomy and
cell-type ontologies in a systematic and comprehensive
manner. Such a goal can only be reached if crowdsourcing
of both images as well as curation effort can be harnessed to
achieve the required coverage of a wide range of mammalian tissue.
REFERENCES
1.
2.
3.
4.
5.
6.
7.
8.
9.
CONCLUSION
Having access to tissue-level networks of cell types that are
able to communicate interchangeably via diffusive processes
is a key first step to integrate cell-specific gene expression
data coupled with experimentally-determined proteinprotein interaction networks. In particular, such integration
would assist with providing a more cellular context to the
interpretation of Genome-wide Association study findings,
through the identification of altered cellular communication
routes as a result of genetic mutation.
10.
11.
While both the representation of Tissue Motifs as well as
tools to acquire such knowledge are at a relatively early
stage of development, the potential for the acquisition approach discussed above is very promising. Our aim is to
build the requisite infrastructure for an open public
Paper J
Hodgman, T.C., The elucidation of protein function
by sequence motif analysis. Comput Appl Biosci,
1989. 5(1): p. 1-13.
Ricke, D.O., et al., Nonrandom patterns of simple
and cryptic triplet repeats in coding and noncoding
sequences. Genomics, 1995. 26(3): p. 510-20.
Vogel, C., et al., Supra-domains: evolutionary units
larger than single protein domains. J Mol Biol,
2004. 336(3): p. 809-23.
Bard, J., S.Y. Rhee, and M. Ashburner, An
ontology for cell types. Genome Biol, 2005. 6(2): p.
R21.
Andreeva, A., et al., Data growth and its impact on
the SCOP database: new developments. Nucleic
Acids Res, 2008. 36(Database issue): p. D419-25.
Renkin, E.M. and C. Crone, Microcirculation and
Capillary Exchange, in Comprehensive Human
Physiology: From Cellular Mechanism to
Integration, R. Greger, Windhorst, U., Editor 1996,
Springer: New York. p. 1965.
Grenon, P., B. Smith, and L. Goldberg, Biodynamic
Ontology: Applying BFO in the Biomedical
Domain, in Ontologies in Medicine2004, IOS
Press: Amsterdam. p. 20-38.
Rosse, C. and J.L. Mejino, Jr., A reference ontology
for biomedical informatics: the Foundational
Model of Anatomy. J Biomed Inform, 2003. 36(6):
p. 478-500.
de Bono, B., P. Grenon, and S.J. Sammut,
ApiNATOMY: a novel toolkit for visualizing
multiscale anatomy schematics with phenotyperelated information. Hum Mutat, 2012. 33(5): p.
837-48.
de Bono, B., et al., The RICORDO approach to
semantic interoperability for biomedical data and
models: strategy, standards and solutions. BMC
Res Notes, 2011. 4: p. 313.
Potikanond, D. and F.J. Verbeek, Visualization and
analysis of 3D gene expression patterns in
zebrafish using web services, in Visualization and
Data Analysis, C.W. Pak, et al., Editors. 2012,
SPIE: Bellingham, WA.
3
de Bono, B. et al.
Figure 1. GUI screenshot for the histology image annotation tool.
4
Paper J
Comparing closely related, semantically rich ontologies:
The GoodOD Similarity Evaluator
Niels Grewe1∗, Daniel Schober2 and Martin Boeker2
1
2
University of Rostock, Rostock, Germany
University Medical Center Freiburg, Freiburg, Germany
ABSTRACT
Objective To provide an integrated cross-platform ontology evaluation tool based on normalisation techniques and ontology similarity
measures.
Background Ontology similarity measures are extensively used in
ontology matching applications but can also be applied to ontology evaluation scenarios (e.g. in ontology learning) when a ‘gold
standard’ ontology is available against which similarity can be computed. Unfortunately, while there are software packages for similarity
measurement available that are well suited for more terminologically
oriented uses, there is no ready to use solution that copes well with
the requirements of more formal ontologies, namely the reliance on
top-level ontologies and the presence of semantically rich axiomatic
class definitions.
Methods We reviewed and applied several similarity measures for
the appraisal of data collected in an ontology teaching experiment.
We also optimised and applied ontology normalisation techniques to
pre-process ontology artefacts in order to produce more consistent
results.
Results We implemented an advanced normalisation procedure to
improve the usefulness of structural similarity measures in the presence of rich class definitions and provide a highly configurable, ready
to use tool for performing comparisons of individual ontologies or
groups of ontologies.
Conclusion Similarity measurements as established in the ontology
alignment communities can also serve specific use-cases in ontology
evaluation, but their application to semantically richer ontologies, as
exemplified by many biomedical ontologies, requires special considerations to be taken into account. We therefore believe that providing
an easily accessible tool for performing similarity measurement under
these conditions is of considerable value to the biomedical ontology
engineering community.
1 INTRODUCTION
Quality assurance and ontology evaluation have become important
topics in recent years. Successful applications of ontologies in real
world usage scenarios can only be expected if they are found to
be pragmatically and representationally adequate. Providing quantitative measures for such adequacy is a considerably hard to reach
desideratum. One use case where quantifiable data about representational adequacy can be gathered is the comparison of (possibly
automatically derived) ontologies or ontology variants against a
pre-established ‘gold standard’ model, providing an assessment of
similarity for the compared ontologies. This question is also of high
relevance for ontology alignment.
∗ to
whom correspondence should be addressed
In this paper, we present a tool that supports gold standard
similarity measures, while taking into account the specific needs
of formal ontologies that make use of a large subset of the expressivity provided by OWL 2. We will describe the challenges
to similarity measurement arising from the pervasive use of toplevel ontologies in many formal ontologies, such as OBO Foundry
compliant [11] biomedical ontologies, and the extensive usage of
complex axiomatic characterisation of classes in an ontology, both
of which introduce a considerable skew in similarity measures, thus
necessitating the development of mitigation strategies.
The original use case of our tool has been the assessment of ontologies produced by students as part of a randomized control study
on an ontology development curriculum [3] against gold standard
models of the same modelling task created by experts according to
established best practises.
2 PROBLEM STATEMENT
2.1 Use case
The intended use case of our tool was the evaluation of the effect of
ontology teaching on the quality of the artefacts produced. In a randomized control trial setting, we had two groups of twelve students
each subjected to different instructional regimes and asked them to
solve different small modelling tasks, such as partially modelling
the anatomy of the stomach. The students were asked to solve these
tasks using the ‘lite’-fragment of the BioTop upper domain ontology
[2] and an additionally provided number of predefined class IDs. For
each of theses exercises, we also provided a predefined expert model
to be used in the evaluation.
The project’s working hypothesis was that effective training
would empower the students to create more correct models and
would lead to increased similarity with the gold standard and also
increased consistency among the student artefacts. It was therefore necessary to measure, among other things, the similarity of
the student models to the expert model and the internal similarity
among the student ontologies in different groups. While we will not
present the results of this study here, we will describe the software
component we designed for performing the similarity analysis.
2.2
Requirements
Based on our use case, we derived a number of requirements for the
software component and the similarity measurement strategy to be
applied:
1. Since our curriculum relied heavily on the use of more expressive OWL 2 constructs, we required those to be taken into
account when computing similarity.
2. Mere syntactic variants should not be regarded as differences
in a relevant sense.
Paper K
1
Grewe, N., Schober D., and Boeker, M.
3. Since most class identifiers were specified beforehand, we
did not require sophisticated lexical matching, and especially
needed to avoid lexical similarities obscuring semantic differences. At the same time, we should be able to account for
unexpected new classes added by students.
4. Since use of top-level ontologies was pervasive both in the curriculum and the experimental setup, differences in the local
domain models should neither be obscured by the top-level, nor
should the effect of the top-level be ignored completely, since
the domain model might make use of the top-level to express
crucial semantic constraints.
D
og
ot
⊤
⊤
A
B
A
B
C
F
D
C
F
E
E
Figure 1. Characteristic extracts of the class C from og and ot
3 BACKGROUND
3.1 Similarity Measures
A similarity measure, in the sense that is relevant here, is a quantitative estimate of the likeness between two classes in different
ontologies or, by some kind of aggregation, of those ontologies as
a whole or of specific parts of them. Formally, it is a function that
maps from pairs of entities (classes or ontologies), into the interval
[0, 1], where 1 indicates sameness or equivalence and 0 indicates
that the entities are orthogonal to one another. An excellent review
of different metrics can be found in [8]. Often, these measures are
collectively described as specifying ‘semantic similarity’, but it is
useful to make some distinctions among them. We will review some
of the more well-known measures in light of our requirements.
3.1.1 Lexical Measures Purely lexical similarity measures can
be used to specifically compute the similarity between resource
identifiers (IRIs), IRI fragments, class labels or other annotations,
for example using the Levenshtein edit distance, which can be easily aggregated to cover the whole ontology. Alternatively, the lexical
information (class labels, IRIs etc.) in an ontology can be mapped to
dimensions to form a vector space model, in which similarity can be
computed as the cosine of angle between the two vectors [8]. Since
they rely only on lexical information and take neither the taxonomic
structure of the ontology nor the explicit semantics of the class definitions into account, we judged those methods to be ill suited for our
approach, even though they partially fulfill requirement 3 in as far as
they allow identifying the most likely pairs of user-created classes.
3.1.2 Structural Measures Structural similarity measures basically treat ontologies as graphs, thereby explicitly taking into
account their taxonomic and relational structure and allowing ontology similarity measurement to benefit from the vast literature in
graph similarity research. Apart from generic measures, such as
those based on graph edit distance [16] or maximal common subgraphs [4], there are several structural metric specifically designed
for comparing ontologies:
Triple Based Entity Similarity Triple based entity similarity tries to
account for the similarity of two ontologies in terms of the similarity of the RDF triples they are composed of [5]. This works by
initial ‘seeding’ the similarity computation with lexical similarities
between two nodes in the RDF graph and iteratively refining this
measure by taking into account the overall similarity of the ontology
computed this way until the similarity difference between two iterations drops below a certain threshold. Obviously, the results of this
method depend on the choice of the lexical similarity measure (e.g.
Levenshtein or Jaro-Winkler distance) and the aggregation method.
The choices of which are also described in [5, 8].
Since the translation of OWL 2 into RDF is conservative [14],
all the explicitly asserted semantics of the ontology are also represented in the graph and contribute to the overall similarity under
the triple based scheme, though some information that is implicit in
class definitions may not be considered. For this reason, we judge it
at least partially fulfill requirement 1. Additionally, even though this
method makes use of lexical similarity to seed the computation, we
still consider it to fulfill requirement 3 as well, because the lexical
similarity is only used to produce a probable initial mapping that is
later refined using the structure. It is somewhat susceptible to also
capturing syntactic variants (requirement 2), though, and does not
make special provisions for requirement 4.
OLA Similarity Another promising structural similarity measure that
is similar to the triple based method is the OLA (OWL-Lite Alignment) measure by Euzenat and Valtchev [7], which operates by
converting the ontology into a labelled graph that explicitly represents most features of the earlier OWL-lite language in terms of
nodes and edges. The similarity is then computed by a recursive
function that operates on the graph and decomposes complex structures in order to derive an overall similarity. While being a good
fit requirements-wise, OLA was not applicable to our use case because our ontology development tasks where formulated and solved
in OWL 2.
Semantic Cotopy and Common Semantic Cotopy The semantic cotopy (SC) and common semantic cotopy (CSC) measures were
defined by Dellschaft and Staab for the evaluation of ontology learning work against a gold standard model [6]. Both compute the so
called ‘taxonomic’ precision and recall, i.e. how well the computed
taxonomy covers the reference taxonomy. This works by extracting, for a given pair of classes, a characteristic extract from both
ontologies that encompasses all subclasses and superclasses of the
target class (cf. figure 1). For example, in figure 1, the characteristic
extracts of the class C from the ontologies og and ot would be
cesc (C, og ) = {⊤, A, C, D, E}
cesc (C, ot ) = {⊤, B, C, E}
The taxonomic precision of ot can then be computed as follows
(where, in this case, C1 and C2 refer to the same class name):
tp(C1 , C2 , og , ot ) =
2
Paper K
|cesc (C1 , og ) ∩ cesc (C2 , ot )|
|cesc (C1 , og )|
The GoodOD Similarity Evaluator
Precision and its derivative measures such as recall and F-measure
can be used to assess how well both ontologies match.
As opposed to semantic cotopy, common semantic cotopy is a
modification that takes into account the fact that some classes from
one ontology might not have a counterpart in the other and picks the
best estimate in this case.
With regard to requirement 1, (common) semantic cotopy will
only capture those aspects of class definitions which have an effect
on the taxonomy, and CSC does fare well with regards to requirement 3 because it can identify structurally similar classes if new
classes were introduced. It is also easy to adapt it to only consider
classes from top-level ontologies if these appear in the characteristic
extract of a class from the local model (requirement 4).
3.1.3 Semantical Measures Apart from lexical and structural
measures, there have been measures proposed that evaluate the similarity of ontologies in terms of their semantics [13, 1, 9]. These
measures strive to compare the models induced by the ontology, e.g.
by determining which consequences are entailed by both ontologies in question. Such assessments are difficult to conduct, because
the number of consequences of an ontology is on principle infinite.
For example, ‘A subClassOf: B’ also entails ‘A subClassOf:
(B and B)’, ‘A subClassOf: (B and B and B)’, which is not
particularly interesting. Hence a limited subset needs to be chosen. Naturally, this kind of measure is promising for our use case,
but since it has only recently found limited implementations [10],
we have concentrated on well-understood structural measures when
developing our comparison tool.
3.2
Normalisation
Naı̈ve lexical and structural measures are problematic when they
are put to the task of specifying the similarity of more complex
ontologies: The goal of ontologies is to provide a model that accurately represents some domain of reality. They thus thrive on
explicit semantics and it is debatable whether purely syntactical
comparisons are sufficient here, since they also capture superficial
syntactical differences between ontologies. In this light and in view
of our requirements, we choose triple based entity similarity and
common semantic cotopy to be the most fitting candidates for the
implementation of our software component.
Additionally, we were very interested in further reducing the sensitivity of the measures to syntactic variations (requirement 2). To
this end, Vrandečić and Sure [17, 18] have devised a normalisation
algorithm that transforms ontologies into a more predictable form
(while preserving the semantics), thus enriching the utility of structural similarity measures. This concept of normalisation has to be
distinguished from the ‘untangling’ of poly-hierarchies, for which
the term ‘normalisation’ is also used and which is mainly interested in presenting ontologies in a way that is more manageable for
the ontology engineer [15]. Contrary to this, normalisation according to Vrandečić explicitly creates poly-hierarchies and proceeds as
follows:
1. Create names for all anonymous complex class expressions
2. Create names for all anonymous individuals
3. Reason over the ontology in order to materialise all missing
subsumption links
4. Clean redundant and circular subsumptions
5. Assert all individual and property instances to be of the most
specific type
6. Normalise property usage, i.e. avoid inverses
Step 1 modifies the class hierarchy so that all subClassOf: axioms are asserted exclusively between named classes. So whenever
a complex class expression appears on the left or right hand side
of a subClassOf: axiom, a new named class is introduced and
defined by an equivalentTo: axiom to be equivalent to the
original class expression. Step 2 similarly assigns explicit names
to anonymous individuals in the ABox, which might be shown to be
identical to other individuals by reasoning.
Steps 3–5 are the ones crucial to representing the semantics of
the graph, and thus rely on a description logics reasoner to be performed. Consequently, the reasoner is used in step 3 to create all
subsumption links that are entailed by the ontology but had not been
explicitly asserted. Since this produces a large number of redundant
subClassOf: axioms, step 4 removes all subsumption links that
hold by transitivity alone and replaces those which form a circle
with equivalence axioms.
Step 5 uses the inferred information on the level of individuals and
assigns the inferred types to both property and individual instances,
while step 6 merely picks a canonical name for all properties used
(e.g. by replacing inverseOf property expressions with normal
names).
All these procedures preserve the semantics of the ontology while
at the same time making sure that much implicit information is
explicitly represented in the ontology graph, thus making it more
amenable to structural similarity analysis.
4 METHODS
4.1 Improvements to the Normalisation Algorithm
Since the explicit semantics of axiomatic class definitions were important for our evaluation, we regarded it as crucial to implement
a variant of the aformentioned ontology normalisation, though we
made some modifications in our implementation. On the one hand,
we restricted the scope of the normalisation to suit our use case.
Since our modelling experiments did not involve instance bases, we
did not implement the steps 2 and 5. The experimental setup already
included a stringent regimen on the use of properties (especially the
students were required to use preexisting object properties), so step
6 was also expendable as well. We also discovered that the explicit
cycle detection for the taxonomy in step 4 was not necessary, since
the reasoner did already infer subsumption in both directions for every pair of classes that were part of a cycle. These could be replaced
locally by equivalentTo: axioms.
On the other hand, the suggestions of Vrandečić and Sure proved
to be not sufficient in some cases. Among other things, we could not
distinguish between cases where different class expressions were
embedded in, for example, the range of a existential restriction
(for example ‘A subClassOf: has locus some (B and C)’
as opposed to ‘A subClassOf: has locus some (C and D)’),
for which the original normalisation proposal would only generate a named class that would very often be inferred to occupy a
structurally similar place in the class hierarchy.
We thus augmented the normalisation algorithm with a procedure that takes inspiration from OLA (cf. section 3.1.2 on the facing
page), which uses a recursive algorithm to determine similarity and
Paper K
3
Grewe, N., Schober D., and Boeker, M.
from [13] who extract ‘signatures’ from class definitions in order to
compare their similarity, where a signature is defined as the set of
primitive concepts and restrictions used in the axioms that define the
class.
Our procedure works by explicitly generating named classes for
the components of the class definition, so that they can be explicitly
represented in the ontology graph. We derive the set of classes to
be generated by computing what we call the decomposition set of
the class definition. We defined the decomposition set as follows
(presentation follows OWL 2 functional syntax):
Configuration Management
N1
N2
N3
Definition 1 (Decomposition Set): Let C be a class expression,
then the decomposition set of C DS(C) consists of C and the
members of the following sets of classes:
Common
Semantic Cotopy
Semantic Cotopy
Normalisation
OntoSim
Similarity
Measures
Ontology
Cache
Similarity
Measurement
• If C is an atomic class: {}
• If C is an ObjectIntersectionOf D1 . . . Dn expression: The
DS‘s of all conjuncts D1 ... Dn
Figure 2. Architecture of the GoodOD Evaluator
• If C is an ObjectU nionOf D1 . . . Dn expression: The DS’s
of all disjuncts D1 ... Dn
reasoner for reasoning over the ontologies, though we also make use
of the JENA and OntoSim libraries (cf. 4.2.3).
We provide rich facilities for customisation of the comparison
process either through command-line switches or through configuration files in a property list format (plain text or XML), allowing
users to quickly tailor comparisons to their needs.
The main workflow that is constructed based on the configuration
consists of normalisation and subsequent comparison of two individual ontologies or groups of ontologies. For performance reasons,
the normalisation process was embedded into a caching mechanism,
which avoids resource intensive, redundant normalisation passes.
The normalisation of individual ontologies, as well as the comparison of pairs of ontologies, is implemented in a multi-threaded
fashion, so that modern multi-core hardware can be used efficiently.
As output, our tool provides a command line based summary
of the comparisons conducted and also produces comma separated
value tables with the individual results.
• If C is an ObjectComplementOf D expression: For every member M of DS(D), ObjectComplementOf M is a
member of DS(C).
• If C is an ObjectOneOf E expression, where E is a set of
individuals: The set of all ObjectOneOf expressions that can
be constructed with members of the power set of E.
• If C is of the form K P D, where K is one
of {‘ObjectSomeV alueF rom’, ‘ObjectAllV aluesF rom’,
‘ObjectM inCardinality n’, ‘ObjectM axCardinality n’,
‘ObjectExactCardinality n’} and P is an object property
expression: All class expressions that arise from combining
K P with the members of the DS(D).
• If C is an ObjectHasV alue P I, ObjectHasSelf P , or data
property restriction: {}
The connections between the original classes and the classes added
this way are easily inferred by a reasoner. For example, given the
original expression ‘A subClassOf: has locus some (B and
C)’, would decompose into the following set of additional axioms:
D1 equivalentTo: has locus some B
D2 equivalentTo: has locus some C
A subClassOf: D1
A subClassOf: D2
This kind of decomposition is of course far from complete. It does,
for example, not take into account the property hierarchy of the
ontology. One solution would be to also materialise the subsumption hierarchy of object property restrictions (e.g. if has part is a
subproperty of has locus, ‘has part some B’ subsumes ‘has locus
some B’). We decided not to incorporate this kind of decomposition
rule because it seemed to unfairly penalise the absence of property
restrictions for many similarity measures.
4.2
Implementation
4.2.1 Architecture Our evaluation tool is implemented as a Java
application with a command-line interface. It relies on the OWLAPI [12] for processing the OWL 2 ontologies and on the HermiT
4
4.2.2 Normalisation Modules Different aspects of normalisation
are implemented by separate classes, which can be dynamically
composed by instances of the NormalizerChain class to form
the core normalisation component of the workflow. This allows
for easy customisation of the normalisation process and facilitates
extension. In detail, we provide classes for minor tasks such as
rewriting IRIs or rerouting imports for the ontology. Different steps
of the Vrandečić normalisation algorithm (creating names for class
expressions, subsumption materialisation) are also implemented
separately and can be plugged into the workflow as needed, so
can the class responsible for implementing the decomposition set
procedure described above.
Since normaliser classes are loaded dynamically based on a configuration file, users can easily implement their own normalisation
modules by having their classes conform to the Normalizer
interface.
4.2.3 Comparison Modules For comparison, we in part rely on
the OntoSim library [8], which implements many similarity measures, often in an very API agnostic way. This allowed us to easily
plug into its functionality to provide lexical similarity measurement
based on a cosine vector model. We also integrated triple based
entity similarity using OntoSim, which was only possible by serialising our OWLAPI data structures in memory and importing them
Paper K
The GoodOD Similarity Evaluator
unrestricted
restricted
Mean
SD
Mean
SD
Normalisation
F-Measure
(n = 312)
F-Measure
(n = 312)
None
With Imports
Vrandečić
GoodOD
0.2552
0.9385
0.8641
0.8353
0.2874
0.0342
0.0505
0.0616
0.2552
0.7580
0.7509
0.7220
0.2874
0.1230
0.1066
0.1626
Table 1. Mean taxonomic F-measure for comparisons of student-generated
ontologies against expert models, using various normalisation procedures.
into JENA before passing them into OntoSim. Since we also provide
an abstract class for interfacing with OntoSim, it is easily possible to
extend our tool to cover more of the measures implemented therein.
Unfortunately, a general limitation of OntoSim is that it only
takes into account classes defined in the ontologies for which the
similarity is being computed, neglecting the impact of top-level ontology classes which were frequently used in our ontologies. For that
reason, we also implemented both semantic cotopy and common semantic cotopy based taxonomic precision and recall to supplement
OntoSim. This implementation can be configured to either take into
account classes from imported ontologies or to ignore them. Also,
including the entire imported top-level ontology was often not desirable since it was shared by all ontologies in our data set, so we added
facilities for comparing only a limited set of pre-defined classes.
As with the normalisation modules, comparison modules are resolved at runtime based on the configuration, so users can easily
supplement the available similarity measures by writing classes
implementing the Comparator interface.
5 RESULTS AND DISCUSSION
We are making available the GoodOD Similarity Evaluator under the GNU General Public License (GPL) v3, from our website
at http://purl.org/goodod/evaluator; the source code
can be downloaded from https://github.com/goodod/
evaluator.
In general, we found that we were able to cope even with large
numbers of comparisons in reasonable amounts of time on commodity hardware. In our tests, a machine with two 1.6GHz x86-64 CPU
cores and 4GB RAM computed 1369 similarity values for different
pairs from a set of 325 ontologies using a complex normalisation
chain in about 100 minutes wall time. These ontologies contained
between 60 and 70 classes (including toplevel), with about the same
number of classes being generated during normalisation. Since we
have not yet exploited some of the more obvious avenues of optimisation (for example, not only different ontologies could be compared
in parallel, but also individual classes), we are confident that the tool
can be useful for larger ontologies as well.
On the whole, the most resource intensive part of the whole process seems to be reasoning over ontologies that are enriched with a
large number of additional class expressions during the normalisation process, which is a common problem for all applications that
rely on reasoning over highly formal ontologies. We have not timed
the individual steps of the computation, though.
To assess the effect of the normalisation procedures, we performed an analysis of 13 × 24 comparisons of student ontologies
against expert models from our dataset (table 1). The comparisons
were performed both without restrictions on the scope of classes
being compared (i.e. including the imports closure) and restricted
to just the classes defined in the original ontology using common
semantic cotopy (CSC) under a variety of normalisation procedures.
The results show that not considering the imports closure of the
ontology does leads to similarity measures that are hardly credible,
which is probably due to the limited number of classes (usually 10–
15) in the artefacts considered, so that every variation has a large
impact on the overall result. Also, computing similarity over the entire imports closure in an unrestricted way obscures all meaningful
differences because the shared top-level ontology dwarfs the ontology fragments containing the actual domain models by an order of
magnitude.
Considering only the restricted results, the Vrandečić normalisation procedure – while having no marked effect on the mean
similarity – slightly reduces the standard deviation (SD). This might,
for example, be attributed to the fact that it eliminates differences
arising where some ontologies were explicitly asserting taxonomic
links, while others were leaving them implicit.
Our augmented (‘GoodOD’) version that performs decomposition of class definitions reduces the observed similarity while at the
same time increasing the standard deviation. This means that the differences between the ontologies in the field are slightly emphasised,
but not in an implausible way. We attribute this effect to the fact that
the decomposition set procedure brings to light subsumption relationships that were only implicit in the original taxonomy and thus
would not be taken into account when computing similarity based
on CSC.
A principled evaluation of the merits of these measures is difficult,
but they generally seem to be in line with the manual appraisal of a
sample of the ontologies. Nonetheless, based on this results we conclude that our initial requirements were fulfilled by the evaluation
software by a combination of the following factors:
• By generating new classes using the decomposition set procedure, we could account for more expressive features of OWL
2 even though we only used structural similarity measures.
(requirement 1)
• By implementing the Vrandečić normalisation algorithm, we
successfully suppressed a great deal of skew created by semantically equivalent syntactic differences. (requirement 2)
• By using a structural similarity measure (such as triple based
entity similarity or common semantic cotopy), we could successfully ignore lexical information as far as possible. (requirement 3).
• By restricting the aggregation of individual similarity values to
those obtained from classes in the local domain models, while
still considering top-level classes as part of the characteristic
extract in the CSC algorihtm, we faithfully represented the effect of the top-level ontology without obscuring the differences
we were trying to assess. (requirement 4)
We have used our tool for a detailed evaluation of our ontology
teaching experiments [3], which generally produced meaningful results. But although our measurement facilities were initially tailored
for this use-case, they should be transferable to other, more conservative, uses (for example in ontology learning) and are provided in
a highly configurable, ready to use way.
Paper K
5
Grewe, N., Schober D., and Boeker, M.
6 CONCLUSION
We do not believe that measuring similarity is in general a useful
tool for solving quality problems in biomedical ontologies because
it assumes the existence of a known good model of the domain covered by the ontology. The problem would hence just be shifted to
verifying the quality of that model.
Nonetheless, we believe that the availability of ready-to-use software for performing such measurements can be beneficial at least
in some cases. It has already proven to be very useful for examining the status of a student cohort in an ontology teaching setting,
where students usually work on small, clearly delineated problems
for which a canonical solution is already available.
Another obvious application is quality assessment on ontologies
generated by machine learning, since it can make a big difference for
the outcome of an evaluation whether the effect of class definitions
has been taken into account or not.
Our work on the evaluation tool also shows that generic graphbased similarity measures have tremendous limitations when applied to ‘heavy duty’ ontologies that make extensive use of the more
expressive subsets of OWL 2 (i.e. they only capture information
that is explicitly represented in the graph structure). And while we
have shown that there are workarounds for some of these issues,
we believe that the entire enterprise will tremendously benefit from
further research in and the implementation of bona-fide semantical
similarity measures.
ACKNOWLEDGMENTS
This work is supported by the German Science Foundation (DFG) as
part of the research project JA 1904/2-1, SCHU 2515/1-1 GoodOD
(Good Ontology Design).
REFERENCES
[1]Rudi Araújo and Helena Sofia Pinto. ‘Towards Semantics-based
Ontology Similarity’. In: Proceedings of the 2nd International
Workshop on Ontology Matching (OM-2007) Collocated with
the 6th International Semantic Web Conference (ISWC-2007)
and the 2nd Asian Semantic Web Conference (ASWC-2007). Ed.
by Pavel Shvaiko et al. 2007. URL: http://ceur-ws.org/
Vol-304/paper4.pdf.
[2]Elena Beisswanger et al. ‘Biotop: An Upper Domain Ontology for the Life Sciences’. In: Applied Ontology 3.4 (2008)
pp. 205–212. DOI: 10.3233/AO-2008-0057.
[3]Martin Boeker et al. ‘Teaching Good Biomedical Ontology Design’. In: Proceedings of the 3rd International Conference on
Biomedical Ontology (ICBO) ed. by Ronald Cornet and Robert
Stevens. 2012. URL: http://ceur-ws.org/Vol-897/
sessionJ-paper25.pdf.
[4]Horst Bunke and Kim Shearer. ‘A graph distance metric based
on the maximal common subgraph’. In: Pattern Recognition
Letters 19.3–4 (1998) pp. 255–259. DOI: 10.1016/S01678655(97)00179-7.
6
[5]Jérôme David and Jérôme Euzenat. ‘Comparison between ontology distances (preliminary results)’. In: Proceedings of the
7th International Semantic Web Conference (ISWC) ed. by Amit
P. Sheth et al. 2008, pp. 245–260.
[6]Klaas Dellschaft and Steffen Staab. ‘On How to Perform a Gold
Standard Based Evaluation of Ontology Learning’. In: Proceedings of the 5th International Semantic Web Conference (ISWC)
ed. by Isabel Cruz et al. 2006, pp. 228–241.
[7]Jérôme Euzenat and Petko Valtchev. ‘Similarity-based ontology
alignment in OWL-lite’. In: Proceedings of the 16th European
Conference on Artificial Intelligence (ECAI) ed. by Ramon
López De Mántaras and Lorenza Saitta. 2004, pp. 333–337.
[8]Jérôme Euzenat et al. D3.3.4: Ontology distances for contextualisation. Technical Report. NeOn Consortium, 2009.
[9]Jérôme Euzenat. ‘Semantic Precision and Recall for Ontology
Alignment Evaluation’. In: Proceedings of the 20th International Joint Conference on Artificial Intelligence (IJCAI) ed. by
Manuela Veloso. 2007, pp. 348–353.
[10]Daniel Fleischhacker and Heiner Stuckenschmidt. ‘A Practical Implementation of Semantic Precision and Recall’. In:
Proceedings of the Sixth International Conference on Complex, Intelligent, and Software Intensive Systems (CISIS) ed.
by Leonard Barolli et al. 2010, pp. 986–991. DOI: 10.1109/
CISIS.2010.97.
[11]OBO Foundry. OBO Foundry Principles. URL: http://www.
obofoundry.org/crit.shtml (visited on 03/08/2012)
[12]Matthew Horridge and Sean Bechhofer. ‘The OWL API: A Java
API for OWL Ontologies’. In: Semantic Web Journal 2.1 (2011)
pp. 11–21.
[13]Bo Hu et al. ‘Semantic metrics’. In: Proceedings of the 15th International Conference on Knowledge Engineering and Knowledge Management (EKAW) ed. by Steffen Staab and Vojtech
Svátek. 2006, pp. 166–181.
[14]Peter F. Patel-Schneider and Boris Motik, eds. OWL 2 Web Ontology Language Mapping to RDF Graphs. 2009. URL: http:
//www.w3.org/TR/2009/REC-owl2-mapping-tordf-20091027/ (visited on 06/09/2012)
[15]Alan L. Rector. ‘Normalisation of ontology implementations:
Towards modularity, re-use, and maintainability’. In: Proceedings Workshop on Ontologies for Multiagent Systems (OMAS)
in conjunction with European Knowledge Acquisition Workshop. 2002.
[16]Linda G. Shapiro and Robert M. Haralick. ‘Structural Descriptions and Inexact Matching’. In: IEEE Transactions on
Pattern Analysis and Machine Intelligence PAMI-3.5 (1981)
pp. 504–519. DOI: 10.1109/TPAMI.1981.4767144.
[17]Denny Vrandečić and York Sure. ‘How to Design Better Ontology Metrics’. In: Proceedings of the 4th European Semantic
Web Conference (ESWC) ed. by Enrico Franconi Michael Kifer
and Wolfgang May. 2007, pp. 311–325.
[18]Denny Vrandecic. ‘Ontology Evaluation’. PhD thesis. Karlsruhe: KIT, Fakultät für Wirtschaftswissenschaften, 2010.
Paper K
ZooAnimals.owl: A didactically sound example-ontology for
teaching description logics in OWL 2
Daniel Schober1*, Niels Grewe2, Johannes Röhl2, Martin Boeker1
1
Institute of Medical Biometry and Medical Informatics (IMBI), University Medical Center, Freiburg, Germany
University of Rostock, Rostock, Germany
2
ABSTRACT
Motivation: Over the years several OWL ontologies have
been published to serve as training examples in teaching the
principles of description logics and ontology development to
a wide audience.
Problem: As some of these resources make no commitment
to the biomedical target domain, they increase the learning
threshold by adding the burden of an unfamiliar topic area in
addition to the technicalities and the intended domain independent description logics principles.
Solution: We argue that the outstanding role of formal ontologies in the life sciences is important enough to warrant a
robust teaching ontology that is optimized for the biomedical
ontology novice intending to learn the basics of description
logics in OWL 2.
Result: We present a list of requirements for user compliant
teaching ontologies. We also exemplify these in the ZooAnimalOntology justifying design decisions with intuitive
teaching examples and exercises, carried out and tested in a
didactical teaching effort, the GoodOD summer school.
1
INTRODUCTION
Drawbacks of existing teaching ontologies
Over the years several OWL DL ontologies have been published to serve as training examples in teaching the principles of description logics. However, most existing resources
display one or more of the following drawbacks1:
· They are often insufficiently aligned to the targeted
recipients in the biomedical domain. The modelling
domains do not match user requirements particularly
well, as biomedical researchers are not necessarily experts in Wines, Pizzas nor Marsupials. Hence learning
description logic (DL) principles is distracted by learning domain knowledge in remote fields, sometimes –
although being entertaining – misleading the novice to
model unicorns, where not needed.
· Existing teaching artefacts are also built according to
diverse, sometimes conflicting ontological commitment
and lack a clearly delineated usage scope that would
justify the exact expressivity they use. In the Protégé tutorials for example the expressivity seems to be guided
*
solely by the intent to explain all the Protégé tools’ capabilities.
· No explicit commitment to certain modelling policies.
Most teaching ontologies do not clarify to what kinds of
entities they commit, i.e. they fail to state their criteria
to discriminate classes and individuals, or fail to discuss
whether information content entities are admissible in
an ontology, etc.
· They don’t make use of established known top or upper
level ontologies (TLO), but invent their own upper level and own relations, again increasing the burden of orientation within the artefact.
· They use overly granular TLOs with hard to grasp class
names like independent continuant and occurrent.
· They are insufficiently restricted, specified and justified
to a certain modeling flavor and OWL expressivity.
This would ideally be done according to a targeted application use case.
· They lack of instance data serving examples.
· They are over simplified: they are not realistic in terms
of granularity and complexity of axiomatic descriptions. Simplicity can here mislead modelers to judge
ontology engineering as being trivial.
To serve the purpose of providing intuitive teaching examples within the GoodOD project2 a content domain had to be
selected alleviating the mentioned drawbacks. Besides our
earlier analysis into Ontology Simplifications 3, we here define a requirement list for such a didactic Artefact. An OWL
ontology was implemented as a practically useable example
along these requirements.
MATERIALS AND METHODS
The domain to be represented in our didactic ontology was
chosen according to our target user group, life science students with some IT background.
The resource was first developed as one comprehensive
Artefact containing all the teaching examples in one file
importing Biotop4, which is compliant to established TLOs
BFO and DOLCE. BioTop is too complex and distracting
for the purpose of a teaching ontology, with e.g. important
classes like ‘Human’ residing ten hierarchy-levels deep
from root. This makes the orientation difficult for the novice. We therefore simplified the standard BioTop into a Biotop light version and re-implemented solely the needed parts
To whom correspondence should be addressed.
Paper L
1
D. Schober et al.
of it under the ZooAnimal namespace. Classes and relations
not used were deleted, e.g. TaxonValueRegion and CanonicityValueRegion. ImmaterialThreeDimensionalPhysicalEntity was deleted, but its more intuitive subclass Place was
retained under root. The class Living was moved from BiologicalLife under Action and BiologicalLife was deleted.
Inverse relations were only added where they served examples. Domain and range constraints as well as relation properties like transitivity were re-added manually. Some names
were simplified, i.e. from hasProperPhysicalPart to
hasProperPart.
We derived ubiquitously known animal individual names
form a suitable wiki page5 (e.g. Keiko the Orca), but also
generated typical anonymized instance names, e.g. Cow
#213 and even photos of individuals. Class labeling and
metadata completeness was checked with the OntoCheck
tool6.
To keep the working examples simple, for the curriculum
exercises, single modules were refactored out and presented
to the students in small self-standing OWL files.
2
RESULTS
Informative entity naming: We have elaborated on intuitive and explicit class labeling best practices extensively in
another paper8.
Cognitive ergonomy aligned with the mesocosm: Most
modeling cases in the Zoo ontology are from the traceable
mesocosm9. The scale of all participating individuals falls
within a human beings ‘Lebenswelt’. These can further be
separated along the dimensions of the major top level categories, e.g. for material objects it’s the mesosize/length, for
processes it’s the meso-duration/time. As ‘meso-world
facts’ are likely to be close to the way humans think, as
most parts of it are directly and immediately accessible to
our human senses, and our brains were trained and coevolved cognitive models10 in coherence with this directly
and immediately perceivable meso-world during childhood
and socialization11. Hence many facts in such domain are
likely to be aligned with already existing categorizations in
our heads. This enables the modeler to memorize and verify
class definitions and drawn inferences themselves quickly
without being distracted from the didactic principles to be
conveyed. Our animal domain is hence cognitively intuitive
and ergonomic12 in comparison to domains from the microlevel like e.g. genetics or metabolomics.
Requirements for domain aligned teaching ontologies
We here outline some requirements for good teaching ontologies.
Artefact size: The ontology should be as small as possible
to foster fast memorization and orientation.
Domain alignment: We choose to model the area of zoo
animal taxonomy on the easy level of what is e.g. written on
signposts any zoo displays for each of its animals, listing
their characteristic features, peculiarities, geographic origin
and some contextual knowledge. This is likely to be common sense knowledge for the life science domain and it is
something any biomedical scientists can immediately relate
to, i.e. understand right away, without time consuming need
to refresh their knowledge from external sources. Lack in
domain knowledge is hence not distracting from the didactics in implementation and modeling tasks.
Delineated background ontology commitment: No matter
if you are a realist or pragmatist, some words on how this
conflict is handled should be given, describing the intended
teaching Artefact. We in fact subscribe to an intermediate
position of pragmatic realism and have no problems to represent Unicorns as InformationEntity.
Coverage and granularity: The domain should be rich
enough to cover all classic modeling challenges within the
scope of our curriculum. The granularity should fit the compositional approach taken in DL based languages.
Complexity: The ontology should not contain too many
overly complicated and long nor nested restrictions. The
simpler the axiomatizations, the easier they will be grasped
by the novice and foster Ontology Understanding.7
2
The ZooAnimal.owl teaching ontology
The ZooAnimal.owl ontology in OWL 2 expressivity SHO
currently consists of 123 classes, the root classes being Action, Disposition, InformationObject, MaterialObject,
Place, Quality and ValueRegion. 11 Individuals, 159 subclass and 42 equivalent class axioms were asserted (hidden
GCI count of 36). Of the 11 object properties (all from Biotop-lite) two are subproperties and three are transitive. All
but one object property, the hasPart superproperty, have
domain and ranges specified. The ontology contains no
number and value restrictions, as these will disturb tableau
based reasoning and will better fit an advanced curriculum.
The OWL file is available at our webpages:
http://www.imbi.uni-freiburg.de/ontology/ZooAnimals
The actual didactic exercises of the curriculum will be described in another paper.
Coverage of topics and examples
The ontology contains all standard expressions most likely
to occur in our domain. We here list examples to show how
they are expressed correctly in OWL 2 in our teaching ontology.
DL language constructs:
Set theoretic basics and taxonomy: Venn diagrams were
used to introduce sets of individuals and derive classes from
common properties. The meaning of the SubClassOf relation in building the taxonomic class hierarchy, the only relation between classes, was explained here.
Paper L
ZooAnimals.owl: A didactically sound example-ontology for teaching description logics in OWL 2
Logical Operators: We introduced intersections, unions
and complements, e.g. KoalaBear subClassOf (Mammal
and Herbivore).
BirdThatCanNotFly equivalentTo Bird and (not (bearerOf
some FlyingDisposition))
Relations and their properties: Domain and range, as well
as transitivity were explained according to partonomies,
e.g. VertebralColumn hasProperPart some Vertebra, Vertebrata hasProperPart some VertebralColumn, Wolf SubclassOf Vertebrata à Wolf hasProperPart Vertebra.
Inverses like hasProperPart, properPartOf were explained.
Defined classes, axiomatisation: Canonical Insects have 6
legs, spiders have 8 legs. As insects can survive with less
than 6 legs, this example can also serve to discuss the requirement to introduce more granular but unintuitive modeling constructs such as Dispositions.
Property hierarchies: E.g. hasPart with its subproperties
hasGranularPart and hasProperPart.
RestrictionTypes: EquivalentTo vs. SubClassOf. E.g. Bird
SubclassOf has_part some Beak, but the reverse statement
that everything that has part some beak is a bird, is falsified
by the Platypus, a mammal with a beak.
Existential vs universal Quantifiers: E.g. KoalaBear that
eats only EucalyptusPlant and nothing else. Usually the
universal quantifier is accompanied by the closure: In this
case we also need to assert an existential quantification.
Reasoning for inference and consistency checks: E.g.
Bird equivalentTo Vertebrate and hasProperPhysicalPart
some BirdWing
Penguin subClassOf Vertebrate and hasProperPhysicalPart
some BirdWing àInference: Penguin subClassOf Bird.
Dolphin SubClassOf Fish (Fish & Mammal disjoint) à
Reasoner detects dolphin as inconsistent, as it has mammal
characteristics.
Modeling best practices and patterns:
Partitions: We introduced partitioning nodes for different
classification schemes, i.e. AnimalByAnatomyParitition,
AnimalByLocomotionPartition, AnimalByHabitatPartition
and AnimalByNutritionPartition. E.g. we added a "and
(bearerOf some FlyingDisposition)" restriction to the FlyingFish to let a reasoner classify it also under FlyingAnimal
within the AnimalByLocomotionPartition.
Phylogenetics (derives from, hasLocus): A common phylogenetic origin is a good approximation for similarities as
the homologous developments of a bat wing and a bird wing
show, however the reverse is not true as the analogous development for the hydrodynamic shape of shark (fish) and
dolphin (mammal) illustrate.
We also defined African and Indian elephant restrictions to
reflect different country of origin, e.g. IndianElefant EquivalentTo (derivesFrom some (Elephant and (participatesIn
some (Living and (hasLocus value India)))))
Qualities: We added a color quality to the GreenGrashopper SubClassOf Hexapoda and bearerOf some (Color and
(qualityLocated some Green))
Ontology Design Patterns (ODP), e.g. entailments for an
exception pattern that classifies a Penguin SubClassOf
AtypicalBird:
Bird EquivalentTo AtypicalBird or TypicalBird
Penguin SubClassOf Bird
Penguin SubClassOf not (bearerOf some FlyingDisposition)
TypicalBird EquivalentTo Bird and (bearerOf some FlyingDisposition).
Atypicality: E.g. introducing a class Beak (SubClassOf
BodyPart) and include ‘hasProperPart some Beak’ in the
definition of Bird. Introducing atypical Platypus SubClassOf (Mammal and hasProperPart some Beak). Regarding
typical Classes e.g. ‘Bird’ and its exceptions, e.g. ‘Ostrich’
SubclassOf ‘BirdThatCannotFly’, one has to admit that
'typicality' is hard to model in OWL and needs some form of
non-monotonic reasoning.
Class Border problems: Is Archaeopterix a bird and reptile? Does it exist at all (extinct)?
Identity: This issue can merely be discussed according to
workarounds, as OWL cannot handle temporal dynamics
and reclassification, i.e. where an Instance of Caterpillar can
become an instance of Butterfly after metamorphosis (individual stages of the same insect). External rules or the Ontology Pattern and Processing Language OPPL1 might be
introduced here.
Collections and Grains: The StandardFood for a Horse
(UngulateFood) consists of Water, Hay, Cereal and HappyHorsePowder, which in turn consists of Mineral and Vitamin. The cereal constitution varies according to the Type
of Horse. Horses get Oat cereals; Zebras get a mix of Oat
and Wheat.
ImmaterialObjects and Boundaries: A Cage consists of a
CageFrame and InteriorOfCage, which borders with the
Cage. InhabitedCage can be defined when the interior is
occupied by some animal. The habitat of a Jaguar can be
described via the place a JaguarPopulation lives in.
InformationObjects: Animals are identified via a unique
code on an implemented RFID-Chip.
A FeedingPlan, capturing food, time and frequency for different animals, is represented classifiable as InformationObjects.
ClosurePattern: A KoalaBearPopulation consists of at
least one (some) KoalaBear in the range of the hasGranularPart relation, and all further fillers for this same range can
only be of the type KoalaBear (only, closure): KoalaPopulation (hasGranularPart some KoalaBear) and (hasGranularPart only KoalaBear).
3
DISCUSSION
Comparison with existing artefacts: The ZooAnimal Ontology differs from classic biological taxonomy in a massive
reduction of classes to a minimal set that is easy to learn and
grasp, the differentiae being clear and intuitive as only ani1
http://oppl2.sourceforge.net/
Paper L
3
D. Schober et al.
mals everybody knows are included. Another factor increasing ontologic ergonomics is the use of colloquial vs. Latin
naming. The ontology’s capability of automatically inferring
multiple parenthood renders it superior to standard taxonomy, where Carnivore is an Order under the class Mammal,
neglecting the fact that there are carnivorous animals that
are not mammals like the alligator.
Embedding into Ontology Engineering Tutorial: Some
practical exercises were carried out before introducing ontology engineering in Protégé 4 (P4). One was to group
cards of individuals according to common properties (set
theory). Here the usual traps could be provided (Dolphin is
no Fish, Kiwi is a Bird without wings) and the limitation of
physical representations without the possibility for multiple
parenthood (Flipper as Mammal and WaterAnimal) was
illustrated practically. A next exercise was to assign class
names to the generated sets of individuals. These practical
examples were later implemented in Protégé using the Zebra
Fish Anatomy ontology (ZAO).
In another practical exercise a semantic network from the
ZAO was presented as a graph of Classes connected by unlabeled relations, which had to be labeled via object properties from ZAO.
Not all of the curriculum topics and examples have been
implemented in ZAO yet, but these can successively be
transferred from the available exercise owl module files.
Some more difficult problems have not yet been modeled in
our Artefact but might be modeled later for an advanced
course.
Limitations: Top level specific granularity assumptions like
using Dispositions and the addition of closure axioms can
render axiomatic definitions quite complex and not suited
for the novice:
HerbivoreAnimal equivalentTo AnimalByNutritionPartition
and (bearerOf some (Disposition and (hasRealization
only (Eating and ((hasPatient some Plant) and (hasPatient
only Plant))))))2
Such precise, but complex and long axiomatizations were
only be presented at the end of the curriculum and in the
self-standing complete ontology. In the earlier teaching
modules these were simplified.
In addition tool support for ontology understanding might
be investigated, i.e. as discussed by Nernst13.
Another example of didactically distracting complexity is
the labeling and the use of artificially sounding Partition
classes, e.g. MarineAnimal equivalentTo AnimalByHabitatPartition and (agentIn some (Living and (hasLocus some
MarineHabitat)))
This expression should possibly be simplified to:
2
The ‘hasPatient’-relation refers to any participant in a process rather than
a hospital patient.
4
MarineAnimal equivalentTo (Animal and livesIn some
Sea), using the partition superclass and more intuitive labels
and relations.
4
CONCLUSION
We have presented characteristics of DL ontologies to serve
as teaching resources and introduced a first draft of a didactically sound teaching ontology. Although our resource is
aligned to the particular needs of the novice in biomedical
ontology engineering and intended for teaching within the
description logics OWL 2 regime, we believe it can be used
to teach a novice from any application domain background.
Adhering strictly to the Mesocosmos modelling level and a
domain almost everybody in more or less familiar with
makes it almost universally applicable. Practical use of this
teaching resource in a 2 week curriculum has shown promising results. It was difficult to balance the need for formal
rigidity dictated by the chosen semantics and expressivity
with the requirement of simplicity, ergonomics and intuitiveness. One solution was the separation of the teaching
resource into small modules and fragments that are aligned
to a given stage in a curriculum. This way the teaching resource can grow more complex in a recursive and incremental manner, allowing the students to follow the curriculum
more easily.
ACKNOWLEDGEMENTS
This work is supported by the German Science Foundation
(DFG) as part of the research project JA 1904/2-1, SCHU
2515/1-1 GoodOD (Good Ontology Design).
REFERENCES
1
Boeker, M et al. “Teaching Good Biomedical Ontology
Design” In: Proceedings of the 3rd International Conference
on Biomedical Ontology (ICBO), Graz 2012.
2
The GoodOD Project, http://www.iph.uni-rostock.de/GoodOntology-Design.902.0.html, last accessed 20.07.2012
3
Schober D, Boeker M: Ontology Simplification: New
Buzzword or Real Need? Mannheim, OBML 2010, IMISE
Report 2010, Edited by Herre H, Hoehndorf R, Kelso J,
Schulz S, Institut fuer Medizinische Informatik, Statistik
und Epidemiologie (IMISE), Markus Loeffler. 2010, M1-5.
http://www.onto-med.de/obml/ws2010/obml2010report.pdf
4
Beißwanger, E et al., BioTop: An Upper Domain Ontology
for the Life Sciences - A Description of its Current Structure, Contents, and Interfaces to OBO Ontologies, In: Applied Ontology 3:4, 2008, pp. 205–212.
5
List
of
known
animals,
http://de.wikipedia.org/wiki/Liste_bekannter_Tiere,
last
accessed 20.07.2012
6
Schober D, Tudose I,Svatek V, Boeker M: OntoCheck:
Verifying Ontology Naming Conventions and Metadata
Paper L
ZooAnimals.owl: A didactically sound example-ontology for teaching description logics in OWL 2
Completeness in Protégé 4, Journal of Biomedical Semantics (JBMS), Invited Submission, OBML 2011, in print August 2012, http://www.jbiomedsem.com/
8
Schober D, Smith B, Lewis S, et al. (2009). Survey-based
naming conventions for use in OBO Foundry ontology development. BMC Bioinformatics, Vol.10, Issue 1
9
Gerhard Vollmer: Evolutionäre Erkenntnistheorie, Hizel,
Stuttgart 1. A. 1975, 2. A. 1980, S. 161.
10
Evolutionary
Epistemology,
http://en.wikipedia.org/wiki/Evolutionary_epistemology,
last accessed 20.07.2012
11
Konrad Lorenz, Die Rückseite des Spiegels. Versuch einer
Naturgeschichte des menschlichen Erkennens, 1973
12
Cognitive
Ergonomics,
http://en.wikipedia.org/wiki/Cognitive_ergonomics,
last
accessed 20.07.2012
13
N. A. Ernst, M.-A. Storey, and P. Allen. Cognitive support for ontology modeling. Int. J. Hum.-Comput. Stud.,
62(5):553 577, 2005.
Paper L
5
In der Reihe IMISE-REPORTS sind bisher erschienen:
2002
1/2002
Barbara Heller, Markus Löffler
Telematics and Computer-Based Quality Management in
a Communication Network for Malignant Lymphoma
2/2002
Barbara Heller, Katrin Kühn,
Kristin Lippoldt
Report OntoBuilder
3/2002
Barbara Heller, Katrin Kühn,
Kristin Lippoldt
Handbuch OntoBuilder
4/2002
Barbara Heller, Katrin Kühn,
Kristin Lippoldt
Leitfaden für die Eingabe von Begriffen in den
OntoBuilder
5/2002
Mitarbeiter des IMISE
Skriptenheft für Medizinstudenten
Medizinische Biometrie
Medizinische Statistik und Informatik
(Kursus zum Ökologischen Stoffgebiet)
1/2003
Birgit Brigl, Thomas Wendt,
Alfred Winter
Ein UML-basiertes Meta-Modell zur Beschreibung von
Krankenhausinformationssystemen
2/2003
Thomas Wendt
Modellierung von Architekturstilen mit dem 3LGM²
3/2003
Birgit Brigl, Thomas Wendt,
Alfred Winter
Requirements on tools for modeling hospital information
systems
4/2003
Madlen Dörschmann
Evaluation der Fehlerhäufigkeit im Rahmen einer
Klinischen Studie
5/2003
Mohammad Zaino
Statistische Analyse zur Aufdeckung von neurotoxischen
Störungen infolge langjähriger beruflicher
Schadstoffexposition
Mitarbeiter des IMISE
Skriptenheft zum SPSS-Kurs
Kurs zur Auswertung medizinischer Daten unter
Verwendung des Statistikprogramms SPSS
2003
2004
1/2004
2/2004 Renate Abelius, Barbara Heller,
Luisa Mantovani, Frank Meineke,
Roman Mishchenko, Jan Ramsch
Standardisierung von Studienkurzprotokollen Qualitätsgesicherte rechnerbasierte Erfassung,
Verarbeitung und Speicherung
3/2004 Jan Ramsch, Renate Abelius,
Therapieschemata - Qualitätsgesicherte vereinheitlichte
Barbara Heller, Luisa Man-tovani,
rechnerbasierte Erfassung, Verarbeitung und
Frank Meineke, Roman Mishchenko Speicherung
4/2004 Jan Ramsch
Variabilität beim Einsatz von onkologischen
Therapieschemata - Erkennung von Ausnahmen und
resultierenden Therapieänderungen
5/2004 André Wunderlich (Diss.)
Prognostische Faktoren für chemotherapieinduzierte
Toxizität in der Behandlung von Malignomen speziell
bei aggressiven Non-Hodgkin-Lymphomen
6/2004
Skriptenheft für Medizinstudenten
Methodensammlung zur Auswertung klinischer und
epidemiologischer Daten
Mitarbeiter des IMISE
7/2004
Grit Meyer (Diss.)
Charakterisierung der zellkinetischen Wirkungen bei
exogener Applikation von Erythropoetin auf die
Erythropoese des Menschen mit Hilfe eines
mathematischen Kompartimentmodells
1/2005
Ingo Röder (Diss.)
Dynamic Modeling of Hematopoietic Stem Cell
Organization – Design and Validation of the New
Concept of Within-Tissue Plasticity
2/2005
Katrin Braesel (Dipl.)
Modellierung klonaler Kompetitionsprozesse
hämatopoetischer Stammzellen mit Hilfe von
Computersimulationen
3/2005
Dr. Barbara Heller (Habil)
Knowledge-Based Systems and Ontologies in Medicine
1/2006
Alexander Strübing, Ulrike Müller
2/2006
Marc Junger (Diss.)
Evaluation des 3LGM² Baukastens
Studienplan - Ergebnisse - Auswertung
Benutzermodellierung bei der Qualitätssicherung im
onkologischen Studienmanagement
3/2006
Thomas Wendt (Diss.)
Modellierung und Bewertung von Integration in
Krankenhausinformationssystemen
1/2007
Markus Kreuz (Dipl.)
Entwicklung und Implementierung eines
Auswertungswerkzeuges für Matrix-CGH-Daten
2/2007
Mitarbeiter des IMISE
Skriptenheft für Studenten
2005
2006
2007
Methodensammlung zur Auswertung klinischer und
epidemiologischer Daten
3/2007
Frank Meineke (Diss.)
Räumliche Modellierung und Simulation der Organisations- und Wachstumsprozesse biologischer
Zellverbände am Beispiel der Dünndarmkrypte der Maus
Daniel Müller-Briel (Dipl.)
Standardisierung klinischer Studienprotokolle unter
Berücksichtigung der Therapieplanung
1/2010
A. Winter, L. Ißler, F. Jahn, A.
Strübing, T. Wendt
Das Drei-Ebenen-Metamodell für die Modellierung und
Beschreibung von Informationssystemen (3LGM² V3)
2/2010
H. Herre, R. Hoehndorf, J. Kelso,
S. Schulz
OBML 2010 Workshop Proceedings, Mannheim,
September 9-10, 2010
H. Herre, R. Hoehndorf, F. Loebe
OBML 2011 Workshop Proceedings, Berlin,
October 6-7, 2011
2008
1/2008
2010
2011
1/2011