全部版块 我的主页
论坛 数据科学与人工智能 人工智能 自然语言处理
1979 1
2017-05-01
情感分析清洗数据的时候,有好几处报度量数目不对,小白一个,之前也没有学过R,请教各位大神,到底是什么原因

> train<- read.csv("C:\\Users\\Administrator\\Desktop\\新建文件夹\\1.csv",quote = "",sep = "\"", header = F,col.names = 'msg', stringsAsFactors = F)
> neg <- read.csv("C:\\Users\\Administrator\\Desktop\\新建文件夹\\neg.csv", header = F, sep = ",", stringsAsFactors = F)
> weight <- rep(-1, length(neg[,1]))
> neg <- cbind(neg, weight)
> pos <- read.csv("C:\\Users\\Administrator\\Desktop\\新建文件夹\\pos.csv", header = F, sep = ",", stringsAsFactors = F)
> weight <- rep(1, length(pos[,1]))
> pos <- cbind(pos, weight)
> posneg <- rbind(pos, neg)
> names(posneg) <- c("term", "weight")
> posneg <- posneg[!duplicated(posneg$term), ]
> dict <- posneg[, "term"]
> library(Rwordseg)
> sentence <- as.vector(train$msg)
> sentence <- gsub("[[:digit:]]*", "", sentence)
> sentence <- gsub("[a-zA-Z]", "", sentence)
> sentence <- gsub("\\.", "", sentence)
> train<- train[!is.na(sentence), ]
> sentence <- sentence[!is.na(sentence)]

>train <- train[!nchar(sentence) < 2, ]    #老师这里说量度数目不对,我实在找不到问题是怎么回事了
>sentence <- sentence[!nchar(sentence) < 2]
>system.time(x <- segmentCN(strwords = sentence))
> temp <- lapply(x, length)
> temp <- unlist(temp)
> id <- rep(train[, "id"], temp)   #这里也说量度数目不对
> label <- rep(train[, "label"], temp)   #这里也是说量度数目不对
> term <- unlist(x)  
> testterm <- as.data.frame(cbind(id, term, label), stringsAsFactors = F)
>stopword <- read.csv("C:\\Users\\Administrator\\Desktop\\新建文件夹\\stopword.csv", header = F, sep = ",", stringsAsFactors = F)  
> stopword <- stopword[!stopword$term %in% posneg$term,]
> testterm <- testterm[!testterm$term %in% stopword,]

附件列表

1.xlsx

大小:157.82 KB

 马上下载

二维码

扫码加我 拉你入群

请注明:姓名-公司-职位

以便审核进群资格,未注明则拒绝

全部回复
2023-11-28 10:42:53
二维码

扫码加我 拉你入群

请注明:姓名-公司-职位

以便审核进群资格,未注明则拒绝

相关推荐
栏目导航
热门文章
推荐文章

说点什么

分享

扫码加好友,拉您进群
各岗位、行业、专业交流群