How do I sample without replacement using a sampling-with-replacement function?

Lisrelchen

1391

收藏 2014-04-19

I vaguely recall from grad school that the following is a valid approach to do a weighted sampling without replacement:

Start with an initially empty "sampled set".
Draw a (single) weighted sample with replacement with whatever method you have.
Check whether you have already picked it.
If you did, ignore it and move to the next sample. If you didn't, add it to your sampled set.
Repeat 2-4 until the sampled set is of the desired size.

Will this give me a valid weighted sample?

(Context: for very large samples, R's sample(x,n,replace=FALSE,p) takes a really long time, much longer than implementing the strategy above.)

UPDATE 2012-01-05

Thank you everyone for your comments and responses! Regarding code, here's an example that faithfully reproduces what I'm trying to do (although I'm actually calling R from Python with RPy, but it seems the runtime characteristics are unchanged).

First, define the following function, which assumes a sufficiently long sequence of with-replacement samples as input:

get_rejection_sample <- function(replace_sample, n) { top = length(replace_sample) bottom = 1 mid = (top+bottom)/2 u = unique(replace_sample[1:mid]) while(length(u) != n) { if (length(u) < n) { bottom = mid } else { top = mid } mid = (top+bottom)/2 u = unique(replace_sample[1:mid]) if (mid == top || mid == bottom) { break } } u}

It simply does a binary search for the number of non-unique samples needed to get enough unique samples. Now, in the interpreter:

> x = 1:10**7> w = rexp(length(x))> system.time(y <- sample(x, 500, FALSE, w)) user system elapsed 12.820 0.048 12.902 > system.time(y <- get_rejection_sample(sample(x, 1000, TRUE, w), 500)) user system elapsed 0.311 0.066 0.378 Warning message:In sample(x, 1000, TRUE, w) : Walker's alias method used: results are different from R < 2.2.0

As you can see, even for just 500 elements the difference is dramatic. I'm actually trying to sample 106 indices from a list of about 1.2×107, which takes hours using replace=FALSE.

扫码加我拉你入群

请注明：姓名-公司-职位

以便审核进群资格，未注明则拒绝

扫码加我 拉你入群

分享

扫码加好友，拉您进群

扫码加我拉你入群