Statistic Learning
A. Tools for understanding data
Building a statistical model for predicting, or estimating, an output based on one or more inputs
Applications: business, medicine, astrophysics, and public policy
There are inputs but no supervising output
B. Three data sets
- Wage Data: to understand the association between an employee’s age and education, as well as the calendar year, on his wage
- Stock Market Data: to predict whether the index will increase or decrease on a given day using the past 5 days’ percentage changes in the index
- Gene Expression Data: to understand which types of customers are similar to each other by grouping individuals according to their observed characteristics
C. Three problem types
- A regression problem: predicting a continuous or quantitative output value
- A classification problem: predicting a non-numerical value such as a categorical or qualitative output
- A clustering problem: not trying to predict an output variable
D. Brief history
- linear regression: 19th century, Legendre and Gauss
- linear discriminant analysis: 1936, Fisher
- logistic regression: 1940s
- generalized linear models: early 1970s, Nelder and Wedderburn
- classification and regression trees: mid 1980s, Breiman, Friedman, Olshen and Stone
- generalized additive models: 1986, Hastie and Tibshirani
- machine learning