Warm tip: This article is reproduced from serverfault.com, please click

Minimum absolute difference between vector pairs (greedy algorithm)

发布于 2020-11-21 20:14:50

Given a numeric vector, I'd like to find the smallest absolute difference in combinations of size 2. However, the point of friction comes with the use of combn to create the matrix holding the pairs. How would one handle issues when a matrix/vector is too large?

When the number of resulting pairs (number of columns) using combn is too large, I get the following error:

Error in matrix(r, nrow = len.r, ncol = count) : invalid 'ncol' value (too large or NA)

This post states that the size limit of a matrix is roughly one billion rows and two columns.

Here is the code I've used. Apologies for the use of cat in my function output -- I'm solving the Minimum Absolute Difference in an Array Greedy Algorithm problem in HackerRank and R outputs are only counted as correct if they're given using cat:

minimumAbsoluteDifference <- function(arr) {
  combos <- combn(arr, 2)
  
  cat(min(abs(combos[1,] - combos[2,])))
}

# This works fine
input0 <- c(3, -7, 0)

minimumAbsoluteDifference(input0) #returns 3

# This fails
inputFail <- rpois(10e4, 1)

minimumAbsoluteDifference(inputFail) 
  #Error in matrix(r, nrow = len.r, ncol = count) : 
  # invalid 'ncol' value (too large or NA)
Questioner
Chris Aguilar
Viewed
0
Joseph Wood 2020-12-01 06:41:30

TL;DR

No need for combn or the like, simply:

min(abs(diff(sort(v))))

The Nitty Gritty

Finding the difference between every possible combinations is O(n^2). So when we get to vectors of length 1e5, the task is burdensome both computationally and memory-wise.

We need a different approach.

How about sorting and taking the difference only with its neighbor?

By first sorting, for any element vj, the difference min |vj - vj -/+ 1| will be the smallest such difference involving vj. For example, given the sorted vector v:

v = -9 -8 -6 -4 -2  3  8

The smallest distance from -2 is given by:

|-2 - 3| = 5
|-4 - -2| = 2

There is no need in checking any other elements.

This is easily implemented in base R as follows:

getAbsMin <- function(v) min(abs(diff(sort(v))))

I'm not going to use rpois as with any reasonably sized vector, duplicates will be produces, which will trivially give 0 as an answer. A more sensible test would be with runif or sample (minimumAbsoluteDifference2 is from the answer provided by @RuiBarradas):

set.seed(1729)
randUnif100 <- lapply(1:100, function(x) {
    runif(1e3, -100, 100)
})

randInts100 <- lapply(1:100, function(x) {
    sample(-(1e9):(1e9), 1e3)
})

head(sapply(randInts100, getAbsMin))
[1]  586 3860 2243 2511 5186 3047

identical(sapply(randInts100, minimumAbsoluteDifference2),
          sapply(randInts100, getAbsMin))
[1] TRUE

options(scipen = 99)
head(sapply(randUnif100, getAbsMin))
[1] 0.00018277206 0.00020549633 0.00009834766 0.00008395873 0.00005299225 0.00009313226

identical(sapply(randUnif100, minimumAbsoluteDifference2),
          sapply(randUnif100, getAbsMin))
[1] TRUE

It's very fast as well:

library(microbenchmark)
microbenchmark(a = getAbsMin(randInts100[[50]]),
               b = minimumAbsoluteDifference2(randInts100[[50]]),
               times = 25, unit = "relative")
Unit: relative
expr      min       lq     mean   median       uq      max neval
   a   1.0000   1.0000   1.0000   1.0000  1.00000  1.00000    25
   b 117.9799 113.2221 105.5144 107.6901 98.55391 81.05468    25

Even for very large vectors, the result is instantaneous;

set.seed(321)
largeTest <- sample(-(1e12):(1e12), 1e6)

system.time(print(getAbsMin(largeTest)))
[1] 3
 user  system elapsed 
0.083   0.003   0.087