Data Science with R Decision Trees
Decision trees are widely used in data mining and well supported in R (R Core Team, 2014). Decision tree learning deploys a divide and conquer approach, known as recursive partitioning. It is usually implemented as a greedy search using information gain or the Gini index to select the best input variable on which to partition our dataset at each step.
This Module introduces rattle (Williams, 2014) and rpart (Therneau and Atkinson, 2014) for building decision trees. We begin with a step-by-step example of building a decision tree using Rattle, and then illustrate the process using R begining with Section 14. We cover both classification trees and regression trees.
The required packages for this module include:
- library(rattle) # GUI for building trees and fancy tree plot
- library(rpart) # Popular decision tree algorithm
- library(rpart.plot) # Enhanced tree plots
- library(party) # Alternative decision tree algorithm
- library(partykit) # Convert rpart object to BinaryTree
- library(RWeka) # Weka decision tree J48.
- library(C50) # Original C5.0 implementation.
As we work through this chapter, new R commands will be introduced. Be sure to review the command’s documentation and understand what the command does. You can ask for help using the ? command as in:
?read.csv
We can obtain documentation on a particular package using the help= option of library():
library(help=rattle)
This chapter is intended to be hands on. To learn effectively, you are encouraged to have R running (e.g., RStudio) and to run all the commands as they appear here. Check that you get the same output, and you understand the output. Try some variations. Explore.