Javascript/ECMAScript 6 specific solution desired.
I want to generate a random sample from an array of objects using an array of weighted values for each object. The population list contains the actual members of the population - not the types of members. Once one is selected for a sample, it can't be selected again.
An analogous problem to the one I'm working on would be simulating a probable outcome for a chess tournament. Each player's rating would be their weight. A player can only place once (1st, 2nd, or 3rd place) per tournament.
To pick a likely list of the top 3 winners could look like:
let winners = wsample(chessPlayers, // population
playerRatings, // weights
3); // sample size
The weighted list may, or may not, be integer values. It could be floats like [0.2, 0.1, 0.7, 0.3]
, or it could be integers like [20, 10, 70, 30]
. The weights do not have to add up to a value that represents 100%.
Peter below gave me a good reference on a general algorithm, however It's not specific to JS: https://stackoverflow.com/a/62459274/7915759 it may be a good point of reference.
Solutions to the problem that rely on generating a second population list with each member copied weight number of times may not be a practical solution. Each weight in the weights array could be very high numbers, or they could be fractional; basically, any non-negative value.
Some additional questions:
accumulate()
function available in JS?bisect()
type function in JS that does a binary search of sorted lists?The following implementation selects k
out of n
elements, without replacement, with weighted probabilities, in O(n + k log n) by keeping the accumulated weights of the remaining elements in a sum heap:
function sample_without_replacement<T>(population: T[], weights: number[], sampleSize: number) {
let size = 1;
while (size < weights.length) {
size = size << 1;
}
// construct a sum heap for the weights
const root = 1;
const w = [...new Array(size) as number[], ...weights, 0];
for (let index = size - 1; index >= 1; index--) {
const leftChild = index << 1;
const rightChild = leftChild + 1;
w[index] = (w[leftChild] || 0) + (w[rightChild] || 0);
}
// retrieves an element with weight-index r
// from the part of the heap rooted at index
const retrieve = (r: number, index: number): T => {
if (index >= size) {
w[index] = 0;
return population[index - size];
}
const leftChild = index << 1;
const rightChild = leftChild + 1;
try {
if (r <= w[leftChild]) {
return retrieve(r, leftChild);
} else {
return retrieve(r - w[leftChild], rightChild);
}
} finally {
w[index] = w[leftChild] + w[rightChild];
}
}
// and now retrieve sampleSize random elements without replacement
const result: T[] = [];
for (let k = 0; k < sampleSize; k++) {
result.push(retrieve(Math.random() * w[root], root));
}
return result;
}
The code is written in TypeScript. You can transpile it to whatever version of EcmaScript you need in the TypeScript playground.
Test code:
const n = 1E7;
const k = n / 2;
const population: number[] = [];
const weight: number[] = [];
for (let i = 0; i < n; i++) {
population[i] = i;
weight[i] = i;
}
console.log(`sampling ${k} of ${n} elments without replacement`);
const sample = sample_without_replacement(population, weight, k);
console.log(sample.slice(0, 100)); // logging everything takes forever on some consoles
console.log("Done")
Executed in Chrome, this samples 5 000 000 out of 10 000 000 entries in about 10 seconds.
What known algorithm did you model this after, and do you know if it's statistically accurate, or an approximation reservoir sample? If you have time, could you please add a few comments to the code to identify the different parts (heap insert, pop, etc)?
It's pretty much the same algorithm you use, except that I accumulate weights in a sum heap so I can update partial sums in O(log n) rather than O(n). The statistical distribution should therefore be identical. I added some comments and a link to a blog post explaining the technique.
hmmm.. I don't know how to get around updating all the subsequent accm weights after the current selection. I ran your code several times to analyze it, and your updates much fewer than mine even after I mod'd my code to reduce the accm wts updates. I'm not sure how a heap could reduce that overhead without missing some elements. Your code seems to produce valid results though =/
Then you should read the blog post I linked :-). The basic idea is that rather than calculating the sum from left to right, like
((((((a+b)+c)+d)+e)+f)+g)+h
, I calculate it like((a+b)+(c+d))+((e+f)+(g+h))
. This means that when an input changes, only O(log n) intermediary results are affected, allowing me to calculate the new sum quickly by keeping track of intermediary results. Also, I can use these intermediary results to quickly recalculate the O(log n) partial sums the binary search visits, bringing to overall runtime per element retrieved down to O(log n).Yes.. I was just looking at that. Great reference - thanks!