javascript-Weighted random sample of array items *without replacement*

meriton 2020-11-30 13:29:00

The following implementation selects k out of n elements, without replacement, with weighted probabilities, in O(n + k log n) by keeping the accumulated weights of the remaining elements in a sum heap:

function sample_without_replacement<T>(population: T[], weights: number[], sampleSize: number) {

    let size = 1;
    while (size < weights.length) {
        size = size << 1;
    }

    // construct a sum heap for the weights
    const root = 1;
    const w = [...new Array(size) as number[], ...weights, 0];
    for (let index = size - 1; index >= 1; index--) {
        const leftChild = index << 1;
        const rightChild = leftChild + 1;
        w[index] = (w[leftChild] || 0) + (w[rightChild] || 0);
    }

    // retrieves an element with weight-index r 
    // from the part of the heap rooted at index
    const retrieve = (r: number, index: number): T => {
        if (index >= size) {
            w[index] = 0;
            return population[index - size];
        } 
        
        const leftChild = index << 1;
        const rightChild = leftChild + 1;

        try {
            if (r <= w[leftChild]) {
                return retrieve(r, leftChild);
            } else {
                return retrieve(r - w[leftChild], rightChild);
            }
        } finally {
            w[index] = w[leftChild] + w[rightChild];
        }
    }

    // and now retrieve sampleSize random elements without replacement
    const result: T[] = [];
    for (let k = 0; k < sampleSize; k++) {
        result.push(retrieve(Math.random() * w[root], root));
    }
    return result;
}

The code is written in TypeScript. You can transpile it to whatever version of EcmaScript you need in the TypeScript playground.

Test code:

const n = 1E7;
const k = n / 2;
const population: number[] = [];
const weight: number[] = [];
for (let i = 0; i < n; i++) {
    population[i] = i;
    weight[i] = i;
}

console.log(`sampling ${k} of ${n} elments without replacement`);
const sample = sample_without_replacement(population, weight, k);
console.log(sample.slice(0, 100)); // logging everything takes forever on some consoles
console.log("Done")

Executed in Chrome, this samples 5 000 000 out of 10 000 000 entries in about 10 seconds.

Todd 2020-11-29 22:58:33

What known algorithm did you model this after, and do you know if it's statistically accurate, or an approximation reservoir sample? If you have time, could you please add a few comments to the code to identify the different parts (heap insert, pop, etc)?

meriton 2020-11-30 05:40:03

It's pretty much the same algorithm you use, except that I accumulate weights in a sum heap so I can update partial sums in O(log n) rather than O(n). The statistical distribution should therefore be identical. I added some comments and a link to a blog post explaining the technique.

Todd 2020-11-30 05:52:52

hmmm.. I don't know how to get around updating all the subsequent accm weights after the current selection. I ran your code several times to analyze it, and your updates much fewer than mine even after I mod'd my code to reduce the accm wts updates. I'm not sure how a heap could reduce that overhead without missing some elements. Your code seems to produce valid results though =/

meriton 2020-11-30 07:27:33

Then you should read the blog post I linked :-). The basic idea is that rather than calculating the sum from left to right, like ((((((a+b)+c)+d)+e)+f)+g)+h, I calculate it like ((a+b)+(c+d))+((e+f)+(g+h)). This means that when an input changes, only O(log n) intermediary results are affected, allowing me to calculate the new sum quickly by keeping track of intermediary results. Also, I can use these intermediary results to quickly recalculate the O(log n) partial sums the binary search visits, bringing to overall runtime per element retrieved down to O(log n).

Todd 2020-11-30 07:39:40

Yes.. I was just looking at that. Great reference - thanks!

Weighted random sample of array items without replacement

热门帖子

热门github