 
ISBN 978-3-642-19405-4 e-ISBN 978-3-642-19406-1
DOI 10.1007/978-3-642-19406-1
Studies in Computational Intelligence ISSN 1860-949X
Library of Congress Control Number: 2011923523
Preface
The emerging problem of data fusion offers plenty of opportunities, also raises
lots of interdisciplinary challenges in computational biology. Currently, developments
in high-throughput technologies generate Terabytes of genomic data
at awesome rate.Howto combine and leverage themass amount of data sources
to obtain significant and complementary high-level knowledge is a state-of-art
interest in statistics, machine learning and bioinformatics communities.
To incorporate various learning methods with multiple data sources is a
rather recent topic. In the first part of the book, we theoretically investigate
a set of learning algorithms in statistics and machine learning. We find that
many of these algorithms can be formulated as a unified mathematical model
as the Rayleigh quotient and can be extended as dual representations on the
basis of Kernel methods. Using the dual representations, the task of learning
with multiple data sources is related to the kernel based data fusion, which
has been actively studied in the recent five years.
In the second part of the book, we create several novel algorithms for supervised
learning and unsupervised learning. We center our discussion on the
feasibility and the efficiency of multi-source learning on large scale heterogeneous
data sources. These new algorithms are encouraging to solve a wide
range of emerging problems in bioinformatics and text mining.
In the third part of the book, we substantiate the values of the proposed algorithms
in several real bioinformatics and journal scientometrics applications.
These applications are algorithmically categorized as ranking problem and
clustering problem. In ranking, we develop a multi-view text mining methodology
to combine different text mining models for disease relevant gene prioritization.
Moreover, we solidify our data sources and algorithms in a gene
prioritization software,which is characterized as a novel kernel-based approach
to combine text mining data with heterogeneous genomic data sources using
phylogenetic evidence across multiple species. In clustering, we combine multiple
text mining models and multiple genomic data sources to identify the disease
relevant partitions of genes. We also apply our methods in scientometric
field to reveal the topic patterns of scientific publications. Using text mining
technique, we create multiple lexical models for more than 8000 journals retrieved
from Web of Science database. We also construct multiple interaction
graphs by investigating the citations among these journals. These two types
VI Preface
of information (lexical /citation) are combined together to automatically construct
the structural clustering of journals. According to a systematic benchmark
study, in both ranking and clustering problems, the machine learning
performance is significantly improved by the thorough combination of heterogeneous
data sources and data representations.
The topics presented in this book are meant for the researcher, scientist
or engineer who uses Support Vector Machines, or more generally, statistical
learning methods. Several topics addressed in the book may also be interesting
to computational biologist or bioinformatician who wants to tackle data
fusion challenges in real applications. This book can also be used as reference
material for graduate courses such as machine learning and data mining.
The background required of the reader is a good knowledge of data mining,
machine learning and linear algebra.
This book is the product of our years of work in the Bioinformatics group,
the Electrical Engineering department of the Katholieke Universiteit Leuven.
It has been an exciting journey full of learning and growth, in a relaxing
and quite Gothic town. We have been accompanied by many interesting colleagues
and friends. This will go down as a memorable experience, as well
as one that we treasure. We would like to express our heartfelt gratitude to
Johan Suykens for his introduction of kernel methods in the early days. The
mathematical expressions and the structure of the book were significantly
improved due to his concrete and rigorous suggestions. We were inspired by
the interesting work presented by Tijl De Bie on kernel fusion. Since then,
we have been attracted to the topic and Tijl had many insightful discussions
with us on various topics, the communication has continued even after he
moved to Bristol. Next, we would like to convey our gratitude and respect
to some of our colleagues. We wish to particularly thank S. Van Vooren, B.
Coessen, F. Janssens, C. Alzate, K. Pelckmans, F. Ojeda, S. Leach, T. Falck,
A. Daemen, X. H. Liu, T. Adefioye, E. Iacucci for their insightful contributions
on various topics and applications. We are grateful to W. Gl¨anzel for
his contribution of Web of Science data set in several of our publications.
This researchwas supported by the Research Council KUL (ProMeta,GOA
Ambiorics, GOA MaNet, CoE EF/05/007 SymBioSys, KUL PFV/10/016),
FWO (G.0318.05, G.0553.06, G.0302.07, G.0733.09, G.082409), IWT (Silicos,
SBO-BioFrame, SBO-MoKa, TBM-IOTA3), FOD (Cancer plans), the Belgian
Federal Science Policy Office (IUAP P6/25 BioMaGNet, Bioinformatics and
Modeling: from Genomes to Networks), and the EU-RTD (ERNSI: European
Research Network on System Identification, FP7-HEALTH CHeartED).