Cedric Gondro-Primer to Analysis of Genomic Data Using R-Springer International Publishing (2015).pdf
Just about any text written on the analysis of genomic data will begin by mentioning
the rapid pace of changes in the field. How the technology is frantically moving
forward and how datasets are getting bigger and bigger. A huge experiment one
year is just a tiny proof of concept the following year. Databases are growing
exponentially. The literature on even quite specific subjects is overwhelming and
we have to decide if we are going to keep up to date or actually get some of the
work done.
It feels that just a few years ago, a genome scan with 300–400 microsatellite
markerswas a pretty big deal (in truth, it really was just a few years ago)! Then along
came the 10k SNP chip, then the 50k, the 500k, and then, of course, the one, two,
and three million SNP arrays. Full individual sequence data is rapidly becoming the
platform of choice. At 10x coverage, that’s around 30 billion nucleotide reads per
patient/animal/sample in the unprocessed fastq files.
Of course, we cannot manually operate at this kind of scale anymore. Data
analysis became heavily dependent on computers and efficient algorithms to sift
through the sea of data and make sense out of it all.
A plethora of computational tools have been written to cope with this high
volume of data. Most of these have been developed to tackle specific problems,
and even if they excel in their specific task, they may not be ideal for automated
processes—the output of one tool is not in an adequate format for another tool
further down the analysis pipeline. This leaves us with the task of finding out which
tools are available for each step in an analysis, choosing the ones that meet our
needs, figuring out how each one works, and sewing them together. Alternatively,
some software (usually commercial, a.k.a. costs money) will seamlessly handle
a full analysis from beginning to end, but the user is restricted to the choice of
algorithms coded into the program, there’s less flexibility in what can be done, and
there’s always a lingering feeling of black box about it. In recent years, R [90],
a statistical programming language and environment, has become popular for the
analysis of genomic data and even further has become the de facto tool for the
analysis of gene expression data. R provides an integrated development environment
for analysis and at the same time flexibility and full control of the analytic workflow.
In this book, we will focus on using R for the analysis of genomic data and how
to set up routines to automate the analytical steps. We will not cover all that R can
do (that in itself would be a rather large book and there are some very good ones
already), but we will focus on some of the key points relevant to the analysis of
genomic data: less emphasis on the theory and more emphasis on a practical handson,
how to get the job done approach. The purpose of this book is to serve as a
companion text for advanced undergraduate and graduate units in genomic analysis
and bioinformatics and can be used as the practical component in lab sessions. The
book should also be of use to researchers who want to use R for the analysis of
genomic data.
Strictly speaking, no previous knowledge of R is necessary—the first chapter
covers some of the basics, but readers will definitely benefit from some prior
exposure to R. Familiarity with undergraduate level biostatistics and genetics is
assumed.
附件列表