前情说明:需要将地址(C1)字段中的国家提取出来,C1为地址字段(此字段中包括作者、机构、国家,我需要提取国家出来)现编辑程序如下:
#read file
wos.file.read <- function(file.name) {
file.items <- read.delim(file = file.name, header = F, row.names = NULL, encoding = "UTF-8", sep = "\t", dec = ".", quote = "", comment.char = "", colClasses = "character")
headers <- file.items[1,]
for (i in 1:ncol(file.items)) {
colnames(file.items)
<- headers
}
row.names(file.items) <- NULL
return(file.items[-1,])
}
#由于合著行为,一条记录的地址字段会有多个作者,相应多个国家,故要将地址字段按“;”分割,但由于此字段也包含作者,作者信息中也有“;”,所以要先替换“[]”中的“;”为“#”
wos.string.replace <- function(line, oldchar, newchar) {
reg <- "\\[(\\s|\\S)*?\\]"
pos <- str_locate_all(line,reg)
lines <- str_sub(line, start = pos[[1]][, "start"], end = pos[[1]][, "end"])
for (i in seq_along(lines)) {
pos.new <- str_locate_all(lines, oldchar)
pos.new <- pos.new[[1]] + pos[[1]][i, "start"] - 1
for (i in seq_along(pos.new[, "start"])) {
str_sub(line, start = pos.new[i, "start"], end = pos.new[i, "end"]) <- newchar
}
return(lines[[1]])
}
return(lines)
}
#现按 ";"进行分割
wos.item.addresses <- function(line) {
#不为空或者包括;
splitchar<-";"
if (length(line)!=0) {
if(str_detect(line,splitchar)>0){
terms <- strsplit(line,splitchar)
terms <- as.character(terms)
return(terms[[1]])
}
}
return(terms)
}
#提取国家
wos.item.countries <- function(terms) {
locatechar <- ","
if (length(terms)!=0){
allcomma <-str_locate_all(terms,locatechar)
rownum<-nrow(allcomma)
pos<-allcomma[rownum]
pos <- as.numeric(pos)
extractions<-str_sub(terms,pos+1)
extractions<-str_trim(extractions)
#return(extractions[[1]])
}
return(extractions)
}
#统一调用以上函数(每条记录的国家对应每条记录的年份)
wos.items.countries <- function(items) {
uts <- vector()
years <- vector()
countries <- vector()
for (i in seq_along(items[, "C1"])) {
line = items[i, "C1"]
if (length(line)>0) {
year <- as.integer(items[i, "PY"])
line <- wos.string.replace(line,";","#")
addresses <- wos.item.addresses(line)
extraction <- wos.item.countries(addresses)
extractions <- c(extraction)
years <- c(years, rep(year, times = length(addresses)))
}
}
countries = data.frame(extractions, years, stringsAsFactors = FALSE)
colnames(countries) <- c("country", "year")
return(countries)
}
这段程序有一些问题,比如:
1、运行后出现
Error in stri_locate_all_regex(string, pattern, omit_no_match = TRUE, :
argument `str` should be a character vector (or an object coercible to)
我把样本数据上传至附件,希望能够得到解答。
无限感谢!要是我的问题或者要求没说清楚,请告知我,我补充或者改正。