Features
A key feature of this book, which dierentiates it from many other very
good textbooks on data mining, is the focus on the hands-on end-to-end
process for data mining. We cover data understanding, data preparation,
model building, model evaluation, data renement, and practical deployment.
Most data mining textbooks have their primary focus on just the
model building|that is, the algorithms for data mining. This book, on
the other hand, shares the focus with data and with model evaluation
and deployment.
In addition to presenting descriptions of approaches and techniques
for data mining using modern tools, we provide a very practical resource
with actual examples using Rattle. Rattle is easy to use and is built on top
of R. As mentioned above, we also provide excursions into the command
line, giving numerous examples of direct interaction with R. The reader
will learn to rapidly deliver a data mining project using software obtained
for free from the Internet. Rattle anI Explorations 1
1 Introduction 3
1.1 Data Mining Beginnings . . . . . . . . . . . . . . . . . . . 5
1.2 The Data Mining Team . . . . . . . . . . . . . . . . . . . 5
1.3 Agile Data Mining . . . . . . . . . . . . . . . . . . . . . . 6
1.4 The Data Mining Process . . . . . . . . . . . . . . . . . . 7
1.5 A Typical Journey . . . . . . . . . . . . . . . . . . . . . . 8
1.6 Insights for Data Mining . . . . . . . . . . . . . . . . . . . 9
1.7 Documenting Data Mining . . . . . . . . . . . . . . . . . . 10
1.8 Tools for Data Mining: R . . . . . . . . . . . . . . . . . . 10
1.9 Tools for Data Mining: Rattle . . . . . . . . . . . . . . . . 11
1.10 Why R and Rattle? . . . . . . . . . . . . . . . . . . . . . . 13
1.11 Privacy . . . . . . . . . . . . . . . . . . . . . . . . . . . . 17
1.12 Resources . . . . . . . . . . . . . . . . . . . . . . . . . . . 18
2 Getting Started 21
2.1 Starting R . . . . . . . . . . . . . . . . . . . . . . . . . . . 22
2.2 Quitting Rattle and R . . . . . . . . . . . . . . . . . . . . 24
2.3 First Contact . . . . . . . . . . . . . . . . . . . . . . . . . 25
2.4 Loading a Dataset . . . . . . . . . . . . . . . . . . . . . . 26
2.5 Building a Model . . . . . . . . . . . . . . . . . . . . . . . 28
2.6 Understanding Our Data . . . . . . . . . . . . . . . . . . . 31
2.7 Evaluating the Model: Confusion Matrix . . . . . . . . . . 35
2.8 Interacting with Rattle . . . . . . . . . . . . . . . . . . . . 39
2.9 Interacting with R . . . . . . . . . . . . . . . . . . . . . . 43
xv2.10 Summary . . . . . . . . . . . . . . . . . . . . . . . . . . . 54
2.11 Command Summary . . . . . . . . . . . . . . . . . . . . . 55
3 Working with Data 57
3.1 Data Nomenclature . . . . . . . . . . . . . . . . . . . . . . 58
3.2 Sourcing Data for Mining . . . . . . . . . . . . . . . . . . 61
3.3 Data Quality . . . . . . . . . . . . . . . . . . . . . . . . . 62
3.4 Data Matching . . . . . . . . . . . . . . . . . . . . . . . . 63
3.5 Data Warehousing . . . . . . . . . . . . . . . . . . . . . . 65
3.6 Interacting with Data Using R . . . . . . . . . . . . . . . 68
3.7 Documenting the Data . . . . . . . . . . . . . . . . . . . . 71
3.8 Summary . . . . . . . . . . . . . . . . . . . . . . . . . . . 73
3.9 Command Summary . . . . . . . . . . . . . . . . . . . . . 74
4 Loading Data 75
4.1 CSV Data . . . . . . . . . . . . . . . . . . . . . . . . . . . 76
4.2 ARFF Data . . . . . . . . . . . . . . . . . . . . . . . . . . 82
4.3 ODBC Sourced Data . . . . . . . . . . . . . . . . . . . . . 84
4.4 R Dataset|Other Data Sources . . . . . . . . . . . . . . 87
4.5 R Data . . . . . . . . . . . . . . . . . . . . . . . . . . . . 90
4.6 Library . . . . . . . . . . . . . . . . . . . . . . . . . . . . 91
4.7 Data Options . . . . . . . . . . . . . . . . . . . . . . . . . 93
4.8 Command Summary . . . . . . . . . . . . . . . . . . . . . 97
5 Exploring Data 99
5.1 Summarising Data . . . . . . . . . . . . . . . . . . . . . . 100
5.1.1 Basic Summaries . . . . . . . . . . . . . . . . . . . 101
附件列表