全部版块 我的主页
论坛 计量经济学与统计论坛 五区 计量经济学与统计软件 HLM专版
1339 0
2014-04-19

I vaguely recall from grad school that the following is a valid approach to do a weighted sampling without replacement:

  • Start with an initially empty "sampled set".
  • Draw a (single) weighted sample with replacement with whatever method you have.
  • Check whether you have already picked it.
  • If you did, ignore it and move to the next sample. If you didn't, add it to your sampled set.
  • Repeat 2-4 until the sampled set is of the desired size.

Will this give me a valid weighted sample?

(Context: for very large samples, R's sample(x,n,replace=FALSE,p) takes a really long time, much longer than implementing the strategy above.)

UPDATE 2012-01-05

Thank you everyone for your comments and responses! Regarding code, here's an example that faithfully reproduces what I'm trying to do (although I'm actually calling R from Python with RPy, but it seems the runtime characteristics are unchanged).

First, define the following function, which assumes a sufficiently long sequence of with-replacement samples as input:

get_rejection_sample <- function(replace_sample, n) {    top = length(replace_sample)    bottom = 1    mid = (top+bottom)/2    u = unique(replace_sample[1:mid])    while(length(u) != n) {        if (length(u) < n) {            bottom = mid        }        else {            top = mid        }        mid = (top+bottom)/2        u = unique(replace_sample[1:mid])        if (mid == top || mid == bottom) {            break        }    }    u}

It simply does a binary search for the number of non-unique samples needed to get enough unique samples. Now, in the interpreter:

> x = 1:10**7> w = rexp(length(x))> system.time(y <- sample(x, 500, FALSE, w))   user  system elapsed  12.820   0.048  12.902 > system.time(y <- get_rejection_sample(sample(x, 1000, TRUE, w), 500))   user  system elapsed   0.311   0.066   0.378 Warning message:In sample(x, 1000, TRUE, w) :  Walker's alias method used: results are different from R < 2.2.0

As you can see, even for just 500 elements the difference is dramatic. I'm actually trying to sample 106 indices from a list of about 1.2×107, which takes hours using replace=FALSE.


二维码

扫码加我 拉你入群

请注明:姓名-公司-职位

以便审核进群资格,未注明则拒绝

相关推荐
栏目导航
热门文章
推荐文章

说点什么

分享

扫码加好友,拉您进群
各岗位、行业、专业交流群