用fastrtext做中文文本分类问题

1511

收藏 2019-04-18

这个是我在github https://github.com/pommedeterresautee/fastrtext/issues/34 上提问的，用fastrtext 来做文本分类预测的，以下是英文直接复制过来的，哪位大神帮忙看看，多谢多谢~

I got an issue with Chinese text classification prediction model as folloing:

test_sentences$text2[9]
[1] "蛋白粉开封后两个月在次食用味道发苦"
predict(model,test_sentences$text2[9])
[[1]]
__label__262
0.5312194

predict(model, "蛋白粉开封后两个月在次食用味道发苦")
[[1]]
__label__314
0.9935217

Basically, after you trained the model using "fastrtext", if you try to predict a Chinese tokenized text and put it as an object (e.g. test_sentences$text2[9] in my case), it will give you a wrong prediction with low probability. If you just simply copy the tokenized Chinese text into the prediction model like I did above, it will give a correct one with high probability. I am really confused about this situation. Anyone can help with it? Much appreciated!