Warm tip: This article is reproduced from serverfault.com, please click

Weighted random sample of array items *without replacement*

发布于 2020-11-28 01:42:58

Javascript/ECMAScript 6 specific solution desired.

I want to generate a random sample from an array of objects using an array of weighted values for each object. The population list contains the actual members of the population - not the types of members. Once one is selected for a sample, it can't be selected again.

An analogous problem to the one I'm working on would be simulating a probable outcome for a chess tournament. Each player's rating would be their weight. A player can only place once (1st, 2nd, or 3rd place) per tournament.

To pick a likely list of the top 3 winners could look like:

let winners = wsample(chessPlayers,  // population
                      playerRatings, // weights
                      3);            // sample size

The weighted list may, or may not, be integer values. It could be floats like [0.2, 0.1, 0.7, 0.3], or it could be integers like [20, 10, 70, 30]. The weights do not have to add up to a value that represents 100%.

Peter below gave me a good reference on a general algorithm, however It's not specific to JS: https://stackoverflow.com/a/62459274/7915759 it may be a good point of reference.

Solutions to the problem that rely on generating a second population list with each member copied weight number of times may not be a practical solution. Each weight in the weights array could be very high numbers, or they could be fractional; basically, any non-negative value.

Some additional questions:

  • Is there already an accumulate() function available in JS?
  • Is there a bisect() type function in JS that does a binary search of sorted lists?
  • Are there any efficient and low memory footprint JS modules available with statistical functions that include solutions for the above?
Questioner
Todd
Viewed
0
meriton 2020-11-30 13:29:00

The following implementation selects k out of n elements, without replacement, with weighted probabilities, in O(n + k log n) by keeping the accumulated weights of the remaining elements in a sum heap:

function sample_without_replacement<T>(population: T[], weights: number[], sampleSize: number) {

    let size = 1;
    while (size < weights.length) {
        size = size << 1;
    }

    // construct a sum heap for the weights
    const root = 1;
    const w = [...new Array(size) as number[], ...weights, 0];
    for (let index = size - 1; index >= 1; index--) {
        const leftChild = index << 1;
        const rightChild = leftChild + 1;
        w[index] = (w[leftChild] || 0) + (w[rightChild] || 0);
    }

    // retrieves an element with weight-index r 
    // from the part of the heap rooted at index
    const retrieve = (r: number, index: number): T => {
        if (index >= size) {
            w[index] = 0;
            return population[index - size];
        } 
        
        const leftChild = index << 1;
        const rightChild = leftChild + 1;

        try {
            if (r <= w[leftChild]) {
                return retrieve(r, leftChild);
            } else {
                return retrieve(r - w[leftChild], rightChild);
            }
        } finally {
            w[index] = w[leftChild] + w[rightChild];
        }
    }

    // and now retrieve sampleSize random elements without replacement
    const result: T[] = [];
    for (let k = 0; k < sampleSize; k++) {
        result.push(retrieve(Math.random() * w[root], root));
    }
    return result;
}

The code is written in TypeScript. You can transpile it to whatever version of EcmaScript you need in the TypeScript playground.

Test code:

const n = 1E7;
const k = n / 2;
const population: number[] = [];
const weight: number[] = [];
for (let i = 0; i < n; i++) {
    population[i] = i;
    weight[i] = i;
}

console.log(`sampling ${k} of ${n} elments without replacement`);
const sample = sample_without_replacement(population, weight, k);
console.log(sample.slice(0, 100)); // logging everything takes forever on some consoles
console.log("Done")

Executed in Chrome, this samples 5 000 000 out of 10 000 000 entries in about 10 seconds.