Overview

Overall, the project consists of seven work packages addressing different parts of the overall system according to Fig. 1. Each work package is assigned to several partners, with one or two of the partners having the overall responsibility (underlined in the WP heading). All work packages contribute to the overall system architecture of the PersOnS system.

WP1: Medical Data Warehouse

The goal of WP1 is the establishment of a medical data warehouse, which will be implemented as a live mirroring system of the clinical database CentraXX. CentraXX will collect relevant medical data from various sources provided by the partners and enable simple searches and statistical evaluations within CentraXX. The data that will be collected includes clinical data, referenced imaging data, sample information, basic research data and patient data. In order to completely satisfy all concerns arising in regards to data protection, an external ID management will be used for the pseudonymization of all sensitive information. The rights and roles of users for specific organizations will be assigned and grant access to the data warehouse via a centralized LDAP service. Interfaces (e.g., HL7) and generic bulkimport possibilities (XML by XSLT) will be established and provided to the partners for the integration and dissemination of data including the resulting data from the text mining of WP2. Before all of the medical information can be integrated, the information must be structured and harmonized. Therefore, it will be essential to develop or adapt ontologies which describe and group the data and which allow unambiguous description. In the second phase of the project the clinical data warehouse will allow extensive queries to be run over all data that is stored in the database. HT data will not be stored directly. Instead, selected information will be included from the HT data warehouse (WP3). The medical data warehouse will be made aware of (a) the existence of HT data sets for a patient in the HT data warehouse, and (b) a condensed version of the information containing actionable variants or known biomarkers (proteomics, metabolomics). These data will be made searchable after import from XML (see Figure 2 and WP 4). Analysis of each of the data sets that will be stored and/or made searchable are vital for the enhancement of decision support and personalized medicine. Reinserting therapy data and findings into the data warehouse will allow the creation of a virtuous cycle facilitating future decision support. Statistical evaluations become more powerful by this expanding knowledge base.

WP2: Text Mining

In this work package, we will develop, optimize and apply text-mining techniques for two applications:
(1) Extraction of relevant information from medical documentation, and (2) extraction
of relevant information from published scientific articles. Both areas here require adaptation
and extension of existing methods, development of specific novel algorithms, and training/
validation of machine learning methods with the specific texts and diseases entities under
study.
Medical text (e.g., visit protocols, reports from imaging, or discharge summaries) will be analyzed
to extract phenotype descriptions, treatments and their effects, and patient history in
general. Analysis will be performed using extensive medical ontologies (human phenotype
ontology (Robinson et al., 2008), mammalian phenotype ontology (Smith et al., 2004), MeSH
(http://www.nlm.nih.gov/mesh/), disease ontology (Schriml et al., 2012) and error-tolerant
dictionary matching. The set of relevant terms will be generated by merging and selecting
relevant parts from the above-mentioned ontologies, leading to subtype-specific cancer ontologies.
Identified concepts will be grouped by sentences to reflect causal or temporal relationships
and presented to a human expert curator (from the teams of UKT and CHARITE)
for confirmation, refutation, or correction. After treating an initial set of texts in this manner,
we will train machine learning models (conditional random fields for concept recognition,
SVM for binary relationship detection) to improve extraction performance. Iterative training,
application and validation will be carried out throughout the project, leading to a growing
body of expert-annotated anonymized texts that can be used to improve extraction models
and algorithms. For analyzing scientific articles, we will extract text-mined data from the GeneView system
(Thomas et al., 2012), especially mutations, diseases, and genes. Extracted information will
be ranked based on their relevance for the cancer entities under study. This text classification
will be based on training data developed together with the clinicians in the project. As for
information extracted from medical texts, only data verified by a human expert curator will be
used for clinical applications. All data extracted from medical or scientific texts and confirmed
by expert curation will be injected into the medical data warehouse (see WP1), always annotated
with links to the text from which the information was extracted. The information will be
associated to the respective entities in the warehouse, such as patient or mutation-disease
combination. Through this warehouse, the information becomes available for the applications
(and hence clinical practice) developed in WP6 and WP7. Work in this work package will
continue in the translational phase, but with decreased intensity. The goals during this second
phase are continuous improvements of the text mining routines and the curation tools
based on expert feedback, enhancements to the dictionaries and ontologies which are the
basis of the information extraction, and publication of all methods and data sets / ontologies
created within the project.

WP3: HT Data Extraction and Integration

The work package aims at the automated extraction of structured data from omics raw data
(genomics, transcriptomics, proteomics, metabolomics). Data sets are available from CHARITE
and UKT in different numbers, but sufficient to set up and refine data processing pipelines.
Even though standard pipelines for exome sequencing, RNA-Seq, proteomics, and
metabolomics have already been established at the Quantitative Biology Center (QBiC, Tübingen),
these pipelines will need to be adapted. They will ensure a consistent and automated
(and thus rapid) processing of incoming data. We expect the amount of data to grow drastically
within the next few years with hundreds of datasets becoming available every year.
Automated processing is thus pivotal to success. Note that the HT data sets will be associated
with patient pseudonyms only; no connection with specific patients can be made without
the pseudonym information provided by CentraXX. The analysis pipelines will use available
open-source software (e.g., BowTie2, Annovar, Cufflinks2, OpenMS, R) and result in standardized
data formats (e.g., mzTab, VCF), which will enable convenient processing with other
tools and integration into the HT data warehouse UniPAX developed in the Kohlbacher lab
(http://unipax.sourceforge.net/). UniPAX provides a caching middleware that permits very
rapid queries on network data (e.g., queries like pathway enrichment, all neighboring deregulated
genes, etc.). A wide range of existing annotations will be integrated, ranging from pathway
information (e.g., KEGG) to actionable variants and associated drugs (e.g., from Drug-
Bank). Data importers for some of these databases need to be developed within this work
package. High-throughput data will be stored on the infrastructure of the Quantitative Biology
Center (QBiC) in an access-controlled and secure manner compliant with local regulations.
High-throughput data available in Berlin will be moved automatically to Tübingen via an automated
data transfer (‘data mover’) and processed immediately after transfer. Pipelines
dealing with genomics data will be implemented first, in accordance with the data already
available (see Section 4). In the second half of the translational phase, additional processing
pipelines will be set up (transcriptomics, proteomics, metabolomics) to deal with the additional
omics modalities becoming available then. Availability of HT data sets for a patient and
selected extracted information (see WP1) will be exported regularly as an XML document for
import into the medical warehouse.

WP4: HT Data Query Module

This work package will focus on the interface between the HT data warehouse and the clients
(study interface, Interactive Tumor Conference Table). The HT query module is a comfortable
and easy-to-use API to the HT data warehouse. It provides access by means of a
rich set of canned query services, which shields the user of the API from database internals
and helps to keep maintainability of the entire system high (in the spirit of a Service-Oriented
Architecture). All queries can be parameterized upon calling the respective service, where
we restrict conditions to range and equality predicates. This approach is similar to recent
entity query languages, such as the EJB query language or Microsoft’s Entity SQL language.
Note that user authorization is controlled for every query; therefore, user credentials are sent
along with each query object and passed to the Medical Data Warehouse which manages
access rights and roles. To ensure low latency, the primary result of a query is a callback
token with which the client can subsequently poll the results item by item (or in batches); this
pattern is drawn from the cursor concept in typical database APIs. To ensure all clients a
user-friendly API, the interface will be based on REST paradigms which work in most existing
IT infrastructures over secured HTTPS transfer protocol. The Basic Authentication Header is
used to make sure only authorized users and systems access the REST API. The rights and
roles concept used by CentraXX will limit the data visibility to authorized users only.
The query interface will support a number of queries of different complexity in order to support
both the study interface and the interactive tumor conference table. Simple queries entail
queries about the data present for a given patient pseudonym or all patient pseudonyms with
a given variant, overexpressed gene/protein, or up-regulated metabolite. These queries can
be executed trivially on an RDBMS with interactive response times. More complex are similarity
queries (e.g., “show me the patients who are most similar with respect to the variants in
the wnt pathway”) and pathway queries. For the similarity queries we will create and maintain
similarity networks for variants, gene/protein expression, and metabolite concentrations for
all samples. Pathway queries permit the navigation of a patient’s data in the context of regulatory/
metabolic networks (e.g., “what are the next druggable targets neighboring this gene”).
Since graph-based queries are somewhat difficult to implement in RDBMS, we will make use
of the integrated network mapping and caching mechanisms of UniPAX. The system keeps a
large graph with all neighborhood information in memory and can thus answer these queries
within a few milliseconds and thus ensure interactive response times required for the interactive
tumor conference table.

WP5: Clinical Specification Development and Evaluation

From the start of the project, a close interaction between the technical development team
and the clinical partners will ensure utility, effectiveness and user acceptance of the final interfaces.
To this end, early on in the initial phase a series of specification meetings will be
help guiding the design process. Experienced clinicians (UKT, CHARITE) will present their
current clinical decision making processes regarding complex oncological cases not covered
by guidelines (melanoma and HCC) and detail the requirements needed for data provisioning
in the data warehouse, for the study interface (WP6) and for the interactive tumor conference
interface (WP7). These requirements will be formalized in specification documents. Exemplary
use cases will be conceived centered around anonymized clinical data sets illustrating a
specific application case (e.g., selection of cohorts based on genomic information and clinical
inclusion criteria or therapeutic decisions based on complex clinical and HT data). The interface
teams (KAIROS, IWM) will support this process by providing mock-ups of potential interface
types and an analysis of the user navigation for those use cases. An iterative process
will refine these specifications based on the mock-ups through expert user feedback. The
requirements for the query engines resulting from the necessary queries will be communicated
to WP1 and WP4 to guide the development of the query engines early on. In the late initial
phase and the translational phase (years 3-5) the focus of this work package will shift
towards design validation via usability studies. To this end, IWM, supported by EKUT and
HUB, will create structured interview forms and perform empirical user studies, focusing on
provided functionality, ease of usage, scenarios of usage, and time spent with the tool. Results
from these qualitative studies will be fed back into the development and revision process.

WP6: Study Interface

An important aspect of personalized medicine lies in the selection of accurately stratified patient
cohorts for clinical studies. While large medical centers such as the Charité or University
Hospital Tübingen have data on hundreds of thousands of patients, the selection of patients
based on inclusion and exclusion criteria poses major issues. This work package will focus
on creating a study interface integrating study register functionality into a medical information
system (CentraXX, WP1). The study interface will extend existing functionality in CentraXX to
permit selection based on HT data as well.
The interface will be able to import, register, and document all studies/trials that are/were
carried out within the project’s framework. Study profiles can be created, which are based on
different documentation field types (e.g., free text, single selection fields [with selection options
from a system-controlled vocabulary], date fields, Boolean values, etc.). These types of
entry fields can be arranged flexibly, allowing the creation of various types of study profiles
(e.g., oncological profiles, pre-clinical profiles, etc.). Also included are references/links to HTdata
sets or analysis done in WP3. Results of the HT analyses with known actionable genetic
variants or relevant biomarkers could be also described with vocabularies as own documentation
points and therefore as integrated searchable knowledge of a study.
Studies and structured HT-data can be imported into the system via XML, following a specified
XSD scheme. This will make bulk-import of existing, relevant, and cleared studies into
the data warehouse possible. Additional queries for specific HT properties (e.g., specific variants
not yet known to be actionable) can be delegated by the interface to the HT query interface
(WP4) and can then populate additional fields. In this way, it is possible to check whether
a study cohort can be assembled with specific molecular profiles. Reporting functionality
will permit regular creation and dissemination of specific pre-defined reports, such as number
of recruits by months, or lists of trials filtered and sorted by phase, tumor localization and the
like, to specific users. Reports can be exported in PDF and EXCEL formats corresponding to
user rights. Inclusion and exclusion criteria can be mapped to individual studies/trials. These
criteria can then automatically be applied for recruiting searches over all patients/test persons
within the data warehouse. A new study-recruiting algorithm will be added to the data
warehouse, which will enable automatic suggestions by the system for suitable patients for
open studies. In the second phase of the project, it will be possible to not only recruit patients/
test persons for specific studies, but also rather to exclusively include samples. The
existence of patient consent will be asked for, before the system will allow the inclusion of
patients or their samples into studies/experiments.

WP7: Interactive Tumor Conference Table

This WP focuses on designing and empirically investigating an innovative touch-based interface
for interdisciplinary tumor conference that supports an intuitive and interactive exploration
of clinical data in combination with HT and public data. These data are made accessible
through a large variety of external representations (e.g., visualizations, tables, graphs, text).
Technically, the system will be implemented as a multi-touch tabletop application that can be
operated by multiple experts simultaneously to navigate available information sources using
intuitive interaction gestures and other multimodal devices (e.g., tangible objects that can be
five milestones:
M7.1 – Task analyses regarding current practice during medical decision making: Understanding
how doctors reach patient management decisions is a necessary prerequisite for
designing a system that will adequately support this process (Schwartz & Griffin, 1986). To
this end, a procedural task analysis will be carried out to provide a comprehensive description
of current medical practices in oncology (e.g., reasoning steps during decision making,
use of external resources, barriers). Also interviews will be conducted to determine what
would be needed to improve current practice (e.g., timely access to additional external resources,
integration of multiple resources including HT data and public data bases).
M7.2 – Task analyses focusing on additional external resources: The task leading to
M7.1 is likely to reveal an infrequent and unsystematic use of external resources related to
HT data since these resources are not yet well accessible in clinical practice. To get a better
impression of their use, a first and simplified implementation of a multi-touch tabletop interface
will provide basic access to these external resources via multiple representations. Despite
its basic interaction capabilities, studying how it might be used by physicians during
decision making will provide important insights for the design of a more advanced interface:
(a) how and when would medical practitioners incorporate different HT-related data sources
in their decision making?, (b) which information is extracted from different information
sources?, (c) are there preferences for certain types of external representations for HTrelated
data, while others are being ignored, or cause problems in their interpretation?, and
(d) which information types would be discussed conjointly and thus should be presented in
conjunction?
M7.3 – Conceptual development of prototypical interface designs: The information gathered
in steps 1 and 2 will be used to inform the conceptual development of prototypical interface
designs. The design will aim at (a) providing just-in-time access to all relevant information,
(b) connecting all data sources in a meaningful way to provide easy and quick access,
(c) highlighting relevant information and their relationships, (d) enabling search for and
comparison of similar cases, (e) enabling user annotations. Appropriate decisions with regard
to information visualization and interaction design will play a fundamental role to
achieve these aims.
M7.4 – Prototype Implementation: Based on already existing technical frameworks for multi-
touch tabletop applications, the implementation of a prototype interfaces for the interactive
tumor conference table will be done in close collaboration with HUB and EKUT. This interface
will integrate the information structures provided by the other WPs. Implementation decisions
depend strongly on the results of the tasks associated with M7.1-3 (which parts of the
available information will be included, which types of representations and navigation elements
will be provided). The prototype implementation will be subject to formative and summative
evaluations.
M7.5 – Experimental validation: Different versions of the interface design will be implemented
to empirically evaluate their usability in a clinical decision-making context with respect
to criteria such as the efficiency of the decision making process, confidence in one’s
own decision, level of accordance of different experts, performance decreases under time
pressure, amount of relevant information considered, amount of relevant information overlooked
(errors of omission), and degree of elaboration (i.e., degree to which for instance
treatment decisions can be justified against the background of the information made available).
These usability studies will also involve time pressure manipulations, thereby mimicking
relevant conditions of medical practice.