Algorithm to interleave clusters

Alain T. 2019-07-04 08:12

I think you will get the best spread by inserting progressively at bisecting positions. Applying this from the smallest to the largest set should result in an optimal spread (or close to it):

First, you need a function that will give you the bisecting insertion points for m source elements in a list of N target elements (where N >= m). The function should start out with the widest possible spread of the first 3 insertions (first, last, middle) and then use bisection from the middle for the rest of the insertion points.

def iPoints(N,m):
    d = N//2
    result = [0,N,d]
    if m==N: result[1] = N-1
    while len(result)<m:
        d = max(1,d//2)
        for r in result[2:]:
            for s in [-1,1]:
                p = r+s*d
                if p in result : continue
                result.append(p)
    result = sorted(result[:m])
    result = [ p + sum(p>r for r in result[:i]) for i,p in enumerate(result)]
    return result

Using this, you can run through the list of clusters, from largest to smallest and perform the insertions:

clusterA  = ["A", "A", "A", "A", "A", "A", "A", "A"]
clusterB  = ["B", "B", "B", "B"]
clusterC  = ["C", "C"]

clusters  = [clusterA,clusterB,clusterC]
totalSize = sum(map(len,clusters))
order     = -1 if all((totalSize-len(c))//(len(c)-1) for c in clusters) else 1
clusters  = sorted(clusters,key=lambda c: order*(totalSize-len(c))//(len(c)-1))
merged    = clusters[0]
for cluster in clusters[1:]:
    target = cluster.copy()
    source = merged
    if len(source) > len(target):
        source,target = target,source
    indexes = iPoints(len(target),len(source))
    for c,p in zip(source,indexes):
        target.insert(p,c)
    merged  = target

print(merged)

# ['C', 'B', 'A', 'A', 'B', 'A', 'A', 'B', 'A', 'A', 'A', 'A', 'B', 'C']

Analysis of this result shows that it is a little better for this set of clusters. Unfortunately it doesn't always give the optimal solution.

from statistics import mean
m = "".join(merged)
spreadA = [ d+1 for d in map(len,m.split("A")[1:-1])]
spreadB = [ d+1 for d in map(len,m.split("B")[1:-1])]
spreadC = [ d+1 for d in map(len,m.split("C")[1:-1])]
print("A",spreadA,mean(spreadA))
print("B",spreadB,mean(spreadB))
print("C",spreadC,mean(spreadC))
print("minimum spread",min(spreadA+spreadB+spreadC))
print("average spread", round(mean(spreadA+spreadB+spreadC), 1))

# A [1, 2, 1, 2, 1, 1, 1] 1.3
# B [3, 3, 5] 3.7
# C [13] 13
# minimum spread 1
# average spread 3

Experimenting with other cluster sizes, I found that the order of cluster processing matters. The order I used is based on the maximum spread of each cluster. Ascending if at least one is larger than the rest, descending otherwise.

clusterA = ["A", "A", "A", "A", "A"]
clusterB = ["B", "B", "B", "B"]
clusterC = ["C", "C"]


# ['A', 'C', 'B', 'A', 'B', 'A', 'B', 'A', 'B', 'C', 'A']
# A [3, 2, 2, 3] 2.5
# B [2, 2, 2] 2
# C [8] 8
# minimum spread 2
# average spread 3