Warm tip: This article is reproduced from stackoverflow.com, please click
algorithm python

Algorithm to interleave clusters

发布于 2020-03-27 10:27:38

I need to interleave clusters of items onto a 1D array in way that maximizes the spread between items of the same cluster.

Example:

clusterA = [A, A, A, A, A, A, A, A]
clusterB = [B, B, B, B]
clusterC = [C, C]

interleaved_example = [A, B, A, C, A, A, B, A, B, A, C, A, B, A]

The spread between A values is (2, 2, 1, 2, 2, 2, 2), mean of ~1.9

The spread between B values is (5, 2, 4), mean of ~3.7

The spread between C values is (7) = mean of 7

Mean spread is =~ 2.8, minimum spread = 1

Ideally the minimum spread would be optimized first with the mean spread being optimized second.

My first shot at an algorithm is to take the largest cluster and place it in an array, X, then take the second largest cluster and insert the values into X at positions linearly spaced (rounding as necessary), then repeating for each subsequent smaller cluster, but that's easily demonstrated to be sub optimal, though decent on average.

I've struggled to model this as a convex optimization problem in hopes of using scipy.optimize.minimize on it.

I wonder if there are existing principled algorithms that achieve this.

Questioner
David Parks
Viewed
99
Alain T. 2019-07-04 08:12

I think you will get the best spread by inserting progressively at bisecting positions. Applying this from the smallest to the largest set should result in an optimal spread (or close to it):

First, you need a function that will give you the bisecting insertion points for m source elements in a list of N target elements (where N >= m). The function should start out with the widest possible spread of the first 3 insertions (first, last, middle) and then use bisection from the middle for the rest of the insertion points.

def iPoints(N,m):
    d = N//2
    result = [0,N,d]
    if m==N: result[1] = N-1
    while len(result)<m:
        d = max(1,d//2)
        for r in result[2:]:
            for s in [-1,1]:
                p = r+s*d
                if p in result : continue
                result.append(p)
    result = sorted(result[:m])
    result = [ p + sum(p>r for r in result[:i]) for i,p in enumerate(result)]
    return result

Using this, you can run through the list of clusters, from largest to smallest and perform the insertions:

clusterA  = ["A", "A", "A", "A", "A", "A", "A", "A"]
clusterB  = ["B", "B", "B", "B"]
clusterC  = ["C", "C"]

clusters  = [clusterA,clusterB,clusterC]
totalSize = sum(map(len,clusters))
order     = -1 if all((totalSize-len(c))//(len(c)-1) for c in clusters) else 1
clusters  = sorted(clusters,key=lambda c: order*(totalSize-len(c))//(len(c)-1))
merged    = clusters[0]
for cluster in clusters[1:]:
    target = cluster.copy()
    source = merged
    if len(source) > len(target):
        source,target = target,source
    indexes = iPoints(len(target),len(source))
    for c,p in zip(source,indexes):
        target.insert(p,c)
    merged  = target

print(merged)

# ['C', 'B', 'A', 'A', 'B', 'A', 'A', 'B', 'A', 'A', 'A', 'A', 'B', 'C']

Analysis of this result shows that it is a little better for this set of clusters. Unfortunately it doesn't always give the optimal solution.

from statistics import mean
m = "".join(merged)
spreadA = [ d+1 for d in map(len,m.split("A")[1:-1])]
spreadB = [ d+1 for d in map(len,m.split("B")[1:-1])]
spreadC = [ d+1 for d in map(len,m.split("C")[1:-1])]
print("A",spreadA,mean(spreadA))
print("B",spreadB,mean(spreadB))
print("C",spreadC,mean(spreadC))
print("minimum spread",min(spreadA+spreadB+spreadC))
print("average spread", round(mean(spreadA+spreadB+spreadC), 1))

# A [1, 2, 1, 2, 1, 1, 1] 1.3
# B [3, 3, 5] 3.7
# C [13] 13
# minimum spread 1
# average spread 3

Experimenting with other cluster sizes, I found that the order of cluster processing matters. The order I used is based on the maximum spread of each cluster. Ascending if at least one is larger than the rest, descending otherwise.

clusterA = ["A", "A", "A", "A", "A"]
clusterB = ["B", "B", "B", "B"]
clusterC = ["C", "C"]


# ['A', 'C', 'B', 'A', 'B', 'A', 'B', 'A', 'B', 'C', 'A']
# A [3, 2, 2, 3] 2.5
# B [2, 2, 2] 2
# C [8] 8
# minimum spread 2
# average spread 3