The data that you use, and how you use it, will likely define the success of your predictive modeling problem.
Data and the framing of your problem may be the point of biggest leverage on your project.
Choosing the wrong data or the wrong framing for your problem may lead to a model with poor performance or, at worst, a model that cannot converge.
It is not possible to analytically calculate what data to use or how to use it, but it is possible to use a trial-and-error process to discover how to best use the data that you have.
In this post, you will discover to get the most from your data on your machine learning project.
After reading this post, you will know:
The importance of exploring alternate framings of your predictive modeling problem.
The need to develop a suite of “views” on your input data and to systematically test each.
The notion that feature selection, engineering, and preparation are ways of creating more views on your problem.
Let’s get started. How to Get the Most From Your Machine Learning Data
Photo by Jean-Marc Bolfing, some rights reserved.
OverviewThis post is divided into 8 parts; they are:
Problem Framing
Collect More Data
Study Your Data
Training Data Sample Size
Feature Selection
Feature Engineering
Data Preparation
Go Further
1. Problem FramingBrainstorm multiple ways to frame your predictive modeling problem.
The framing of the problem means the combination of:
Inputs
Outputs
Problem Type
For example:
Can you use more or less data as inputs to the model?
Can you predict something else instead?
Can you change the problem to be regression/classification/sequence/etc.?
The more creative you get, the better.
Use ideas from other projects, papers, and the domain itself.
Brainstorm. Write down all of the ideas, even if they are crazy.
I have some frameworks that will help with brainstorming the framing here:
2. Collect More DataGet more data than you need, even data that is tangentially related to the outcome being predicted.
We cannot know how much data will be needed.
Data is the currency spent during model development. It is the oxygen needed by the project to breathe. Each time you use some data, it is less data available for other tasks.
You need to spend data on tasks like:
Model training.
Model evaluation.
Model tuning.
Model validation.
Further, the project is new. No one has done your specific project before, modeled your specific data. You don’t really know what features will be useful yet. You might have ideas, but you don’t know. Collect them all; make them all available at this stage. 3. Study Your DataUse every data visualization you can think of to look at your data from every angle.
Looking at raw data helps. You will notice things.
Looking at summary statistics helps. Again, you will notice things.
Data visualization is like a beautiful combination of these two ways of learning. You will notice a lot more things.
Spend a long time with your raw data and summary statistics. Then move on to the visualizations last as they can take more time to prepare.
Use every data visualization you can think of and glean from books and papers on your data.
Review plots.
Save plots.
Annotate plots.
Show plots to domain experts.
You are seeking a little more insight into the data. Ideas that you can use to help better select, engineer, and prepare data for modeling. It will pay off. 4. Training Data Sample SizePerform a sensitivity analysis with your data sample to see how much (or little) data you actually need.
You do not have all observations. If you did, you would not need to make predictions for new data.
Instead, you are working with a sample of the data. Therefore, there is an open question as to how much data will be needed to fit the model.
Don’t assume that more is better. Test.
Design experiments to see how model skill changes with sample size.
Use statistics to see how important trends and tendencies change with sample size.
Without this knowledge, you won’t know enough about your test harness to comment on model skill sensibly.
Learn more about sample size in this post:
5. Feature SelectionCreate many different views of your input features and test each one.
You don’t know what variables will be helpful or most helpful in your predictive modeling problem.
You can guess.
You can use advice from domain experts.
You can even use suggestions from feature selection methods.
But they are all just guesses.
Each set of suggested input features is a “view” on your problem. An idea on what features might be useful for modeling and predicting the output variable.
Brainstorm, compute, and collect as many different views of your input data as you can.
Design experiments and carefully test and compare each view. Use data to inform you which features and which view are the most predictive.
For more on feature selection, see this post:
6. Feature EngineeringUse feature engineering to create additional features and views on your predictive modeling problem.
Sometimes you have all of the data you can get, but a given feature or set of features locks up knowledge that is too dense for the machine learning methods to learn and map to the outcome variable.
Examples include:
Date/Times.
Transactions.
Descriptions.
Break down these data into simpler additional component features, such as counts, flags, and other elements.
Make things as simple as you can for the modeling process.
For more on feature engineering, see the post: