看论文正好看到这个词,就查了一查,一点拙见,欢迎大家指正。
Data mining bias (数据挖掘偏误) 是一种bias,源于数据挖掘方法本身。
“数据挖掘”是一种通过寻找历史数据的显著特征来预测未来趋势的方法。本质上讲,通过这种方法找到的规律只能反映事前联系(ex-ante relevance),能否用相同的规律预测未来是一个需要慎重考量的问题。Data mining bias指在分析数据时,由于过多依赖挖掘过程,从而将一些可能只是巧合的数据特征当做会重复出现的经济规律,由此产生的误差。
例如“一月效应”(January Effect)。通过对以往50-70年股票市场数据的研究,有人发现一月份的股票收益率相比其他月份较高,因此预测这种现象也会持续。但是由于人们都意识到这个现象并据此做出反应,如果市场effecient,则人们的市场操作会让一月效应减弱,从而产生Data mining bias“。
答案来源:http://www.investopedia.com/exam-guide/cfa-level-1/quantitative-methods/sampling-bias.asp
以下是原文:
“Data mining is the practice of searching through historical data in an effort to find significant patterns, with which researchers can build a model and make conclusions on how this population will behave in the future. "
“Data-mining bias refers to the errors that result from relying too heavily on data-mining practices. In other words, while some patterns discovered in data mining are potentially useful, many others might just be coincidental and are not likely to be repeated in the future - particularly in an "efficient" market.”
“For example, the so-called January effect, where stock market returns tend to be stronger in the month of January, is a product of data mining: monthly returns on indexes going back 50 to 70 years were sorted and compared against one another, and the patterns for the month of January were noted.”
“For example, we may not be able to continue to profit from the January effect going forward, given that this phenomenon is so widely recognized. As a result, stocks are bid for higher in November and December by market participants anticipating the January effect, so that by the start of January, the effect is priced into stocks and one can no longer take advantage of the model. Intergenerational data mining refers to the continued use of information already put forth in prior financial research as a guide for testing the same patterns and overstating the same conclusions. ”