algorithm python

python - 交织集群的算法

``````clusterA = [A, A, A, A, A, A, A, A]
clusterB = [B, B, B, B]
clusterC = [C, C]

interleaved_example = [A, B, A, C, A, A, B, A, B, A, C, A, B, A]
``````

A值之间的价差为 `(2, 2, 1, 2, 2, 2, 2), mean of ~1.9`

B值之间的差为 `(5, 2, 4), mean of ~3.7`

C值之间的差为 `(7) = mean of 7`

My first shot at an algorithm is to take the largest cluster and place it in an array, `X`, then take the second largest cluster and insert the values into `X` at positions linearly spaced (rounding as necessary), then repeating for each subsequent smaller cluster, but that's easily demonstrated to be sub optimal, though decent on average.

I've struggled to model this as a convex optimization problem in hopes of using `scipy.optimize.minimize` on it.

I wonder if there are existing principled algorithms that achieve this.

David Parks

156
Alain T. 2019-07-04 08:12

I think you will get the best spread by inserting progressively at bisecting positions. Applying this from the smallest to the largest set should result in an optimal spread (or close to it):

``````def iPoints(N,m):
d = N//2
result = [0,N,d]
if m==N: result[1] = N-1
while len(result)<m:
d = max(1,d//2)
for r in result[2:]:
for s in [-1,1]:
p = r+s*d
if p in result : continue
result.append(p)
result = sorted(result[:m])
result = [ p + sum(p>r for r in result[:i]) for i,p in enumerate(result)]
return result
``````

``````clusterA  = ["A", "A", "A", "A", "A", "A", "A", "A"]
clusterB  = ["B", "B", "B", "B"]
clusterC  = ["C", "C"]

clusters  = [clusterA,clusterB,clusterC]
totalSize = sum(map(len,clusters))
order     = -1 if all((totalSize-len(c))//(len(c)-1) for c in clusters) else 1
clusters  = sorted(clusters,key=lambda c: order*(totalSize-len(c))//(len(c)-1))
merged    = clusters[0]
for cluster in clusters[1:]:
target = cluster.copy()
source = merged
if len(source) > len(target):
source,target = target,source
indexes = iPoints(len(target),len(source))
for c,p in zip(source,indexes):
target.insert(p,c)
merged  = target

print(merged)

# ['C', 'B', 'A', 'A', 'B', 'A', 'A', 'B', 'A', 'A', 'A', 'A', 'B', 'C']
``````

``````from statistics import mean
m = "".join(merged)
spreadA = [ d+1 for d in map(len,m.split("A")[1:-1])]
spreadB = [ d+1 for d in map(len,m.split("B")[1:-1])]
spreadC = [ d+1 for d in map(len,m.split("C")[1:-1])]

# A [1, 2, 1, 2, 1, 1, 1] 1.3
# B [3, 3, 5] 3.7
# C [13] 13
# minimum spread 1
# average spread 3
``````

``````clusterA = ["A", "A", "A", "A", "A"]
clusterB = ["B", "B", "B", "B"]
clusterC = ["C", "C"]

# ['A', 'C', 'B', 'A', 'B', 'A', 'B', 'A', 'B', 'C', 'A']
# A [3, 2, 2, 3] 2.5
# B [2, 2, 2] 2
# C [8] 8
# minimum spread 2
# average spread 3
``````