全部版块 我的主页
论坛 计量经济学与统计论坛 五区 计量经济学与统计软件 winbugs及其他软件专版
1007 0
2017-04-16
复制代码

As the target variable is not continuous, binary classification model predicts the probability of a target variable to be Yes/No. To evaluate such a model, a metric called the confusion matrix is used, also called the classification or co-incidence matrix. With the help of a confusion matrix, we can calculate important performance measures:

  • True Positive Rate (TPR) or Hit Rate or Recall or Sensitivity = TP / (TP + FN)
  • False Positive Rate(FPR) or False Alarm Rate = 1 - Specificity = 1 - (TN / (TN + FP))
  • Accuracy = (TP + TN) / (TP + TN + FP + FN)
  • Error Rate = 1 – accuracy or (FP + FN) / (TP + TN + FP + FN)
  • Precision = TP / (TP + FP)
  • F-measure: 2 / ( (1 / Precision) + (1 / Recall) )
  • ROC (Receiver Operating Characteristics) = plot of FPR vs TPR
  • AUC (Area Under the Curve)
  • Kappa statistics

You can find more details about these measures here: The Best Metric to Measure Accuracy of Classification Models.

All of these measures should be used with domain skills and balanced, as, for example, if you only get a higher TPR in predicting patients who don’t have cancer, it will not help at all in diagnosing cancer.

In the same example of cancer diagnosis data, if only 2% or less of the patients have cancer, then this would be a case of class imbalance, as the percentage of cancer patients is very small compared to rest of the population. There are main 2 approaches to handle this issue:

  • Use of a cost function: In this approach, a cost associated with misclassifying data is evaluated with the help of a cost matrix (similar to the confusion matrix, but more concerned with False Positives and False Negatives). The main aim is to reduce the cost of misclassifying. The cost of a False Negative is always more than the cost of a False Positive. e.g. wrongly predicting a cancer patient to be cancer-free is more dangerous than wrongly predicting a cancer-free patient to have cancer.
   Total Cost = Cost of FN * Count of FN + Cost of FP * Count of FP
  • Use of different sampling methods: In this approach, you can use over-sampling, under-sampling, or hybrid sampling. In over-sampling, minority class observations are replicated to balance the data. Replication of observations leading to overfitting, causing good accuracy in training but less accuracy in unseen data. In under-sampling, the majority class observations are removed causing loss of information. It is helpful in reducing processing time and storage, but only useful if you have a large data set.

Find more about class imbalance here.

If there are multiple classes in the target variable, then a confusion matrix of dimensions equal to the number of classes is formed, and all performance measures can be calculated for each of the classes. This is called a multiclass confusion matrix. e.g. there are 3 classes X, Y, Z in the response variable, so recall for each class will be calculated as below:

   Recall_X = TP_X/(TP_X+FN_X)   Recall_Y = TP_Y/(TP_Y+FN_Y)   Recall_Z = TP_Z/(TP_Z+FN_Z)

Related:


二维码

扫码加我 拉你入群

请注明:姓名-公司-职位

以便审核进群资格,未注明则拒绝

相关推荐
栏目导航
热门文章
推荐文章

说点什么

分享

扫码加好友,拉您进群
各岗位、行业、专业交流群