Preface
With the growth of data in volume and type, it is becoming very essential to perform data
mining in order to extract insights from large datasets. This is because organizations feel the
need to a get return on investment (ROI) from large-scale data implementations. The
fundamental reason behind data mining is to find out hidden treasure in large databases so
that the business stakeholders can take action about future business outcomes. Data mining
processes not only help the organizations reduce cost and increase profit but also help them
find out new avenues.
In this book, I am going to explain the fundamentals of data mining using an open source
tool and programming language known as R. R is a freely available language and
environment for performing statistical computation, graphical data visualization, predictive
modeling, and integration with other tools and platforms. I am going to explain the data
mining concepts by taking example datasets using the R programming language.
In this book, I am going to explain the topics, their mathematical formulation, their
implementation in a software environment, and also how the topics help in solving a
business problem. The book is designed in such a way that the user can start from data
management techniques, exploratory data analysis, data visualization, and modeling up to
creating advanced predictive modeling such as recommendation engines, neural network
models, and so on. It also gives an overview of the concept of data mining, its various facets
with data science, analytics, statistical modeling, and visualization.
So let’s have a look at the chapters briefly!
What this book covers
Chapter 1, Data Manipulation Using In-built R Data, gives a glimpse of programming basics
using R, how to read and write data, programming notations, and syntax understanding
with the help of a real-world case study. It also includes R scripts for practice to get handson experience of the concepts, terminologies, and underlying reasons for performing certain
tasks. The chapter is designed in such a way that any reader with little programming
knowledge should be able to execute R commands to perform various data mining tasks.
We will discuss in brief the meaning of data mining and its relations with other domains
such as data science, analytics, and statistical modeling; apart from this, we will start the
data management topics using R.
Preface
[ 2 ]
Chapter 2, Exploratory Data Analysis with Automobile Data, helps the learners to understand
exploratory data analysis. It involves numerical as well as graphical representation of
variables in a dataset for easy understanding and quick conclusion about a dataset. It is
important to get an understanding of the dataset, type of variables considered for analysis,
the association between various variables, and so on. Creating cross-tabulations to
understand the relationship between categorical variables and performing classical
statistical tests on the data to verify various different hypotheses about the data can be
tested out.
Chapter 3, Visualize Diamond Dataset, covers the basics of data visualization along with
how to create advanced data visualization using existing libraries in the R programming
language. While looking at numbers and statistics, it may tell a similar story for the
variables we are looking at by different cuts; however, when we visually look at the
relationship between variables and factors, it shows a different story altogether. Hence, data
visualization tells you a message that numbers and statistics fail to do.
Chapter 4, Regression with Automobile Data, helps you to know the basics of predictive
analytics using regression methods, including various linear and nonlinear regression
methods using R programming. In this chapter, you will get to know the basics of
predictive analytics using regression methods, including various linear and nonlinear
regression methods using R programming. You will be able to understand the theoretical
background as well as get practical hands-on experience on all the regression methods
using R.
Chapter 5, Market Basket Analysis with Groceries Data, shows the second method of product
recommendation, popularly known as Market Basket Analysis (MBA) and also known as
association rules. This is about associating items purchased at transaction level, finding out
the sub-segments of users having similar products and hence, recommending the products.
MBA can also be used to form upsell and cross-sell strategies.
Chapter 6, Clustering with E-commerce Data, teaches the following things: what
segmentation is, how clustering can be applied to perform segmentation, what are the
methods used for clustering, and a comparative view of the various methods for
segmentation. In this chapter, you will know the basics of segmentation using various
clustering methods.
Chapter 7, Building a Retail Recommendation Engine, covers the following things and their
implementation using the R programming language: what recommendation is and how it
works, types and methods for performing recommendation, and implementation of product
recommendation using R.
Preface
[ 3 ]
Chapter 8, Dimensionality Reduction, implements dimensionality reduction techniques
such as PCA, singular value decomposition (SVD), and iterative feature selection methods
using a practical dataset and R. With the growth of data in volumes and variety, dimensions
of data have been continuously on the rise. Dimensionality reduction techniques have many
applications in different industries, such as in image processing, speech recognition,
recommendation engines, text processing, and so on.
Chapter 9, Applying Neural Networks to Healthcare Data, teaches you various types of neural
networks, methods, and variants of neural networks with different functions to control the
training of artificial neural networks in performing standard data mining tasks such as
these: prediction of real-valued output using regression-based methods, prediction of
output levels in a classification-based task, forecasting future values of a numerical attribute
based on historical data, and compressing features to recognize important ones in order to
perform prediction or classification.