温馨提示:本文翻译自stackoverflow.com,查看原文请点击:python - Algorithm to interleave clusters
algorithm python

python - 交织集群的算法

发布于 2020-03-27 11:44:59

我需要以一维数组的方式将项目集群交织在一起,以最大程度地扩大同一集群中项目之间的传播。

例:

clusterA = [A, A, A, A, A, A, A, A]
clusterB = [B, B, B, B]
clusterC = [C, C]

interleaved_example = [A, B, A, C, A, A, B, A, B, A, C, A, B, A]

A值之间的价差为 (2, 2, 1, 2, 2, 2, 2), mean of ~1.9

B值之间的差为 (5, 2, 4), mean of ~3.7

C值之间的差为 (7) = mean of 7

平均价差=〜2.8,最小价差=1

理想情况下,首先应优化最小点差,然后再优化平均点差。

My first shot at an algorithm is to take the largest cluster and place it in an array, X, then take the second largest cluster and insert the values into X at positions linearly spaced (rounding as necessary), then repeating for each subsequent smaller cluster, but that's easily demonstrated to be sub optimal, though decent on average.

I've struggled to model this as a convex optimization problem in hopes of using scipy.optimize.minimize on it.

I wonder if there are existing principled algorithms that achieve this.

查看更多

查看更多

提问者
David Parks
被浏览
156
Alain T. 2019-07-04 08:12

I think you will get the best spread by inserting progressively at bisecting positions. Applying this from the smallest to the largest set should result in an optimal spread (or close to it):

首先,您需要一个函数,该函数将为您提供N个目标元素列表(其中N> = m)中m个源元素的平分插入点。该功能应从头三个插入(第一个,最后一个,中间)的最大可能扩展开始,然后将中间的二等分用于其余插入点。

def iPoints(N,m):
    d = N//2
    result = [0,N,d]
    if m==N: result[1] = N-1
    while len(result)<m:
        d = max(1,d//2)
        for r in result[2:]:
            for s in [-1,1]:
                p = r+s*d
                if p in result : continue
                result.append(p)
    result = sorted(result[:m])
    result = [ p + sum(p>r for r in result[:i]) for i,p in enumerate(result)]
    return result

使用此功能,您可以从最大到最小浏览整个群集列表,并执行插入操作:

clusterA  = ["A", "A", "A", "A", "A", "A", "A", "A"]
clusterB  = ["B", "B", "B", "B"]
clusterC  = ["C", "C"]

clusters  = [clusterA,clusterB,clusterC]
totalSize = sum(map(len,clusters))
order     = -1 if all((totalSize-len(c))//(len(c)-1) for c in clusters) else 1
clusters  = sorted(clusters,key=lambda c: order*(totalSize-len(c))//(len(c)-1))
merged    = clusters[0]
for cluster in clusters[1:]:
    target = cluster.copy()
    source = merged
    if len(source) > len(target):
        source,target = target,source
    indexes = iPoints(len(target),len(source))
    for c,p in zip(source,indexes):
        target.insert(p,c)
    merged  = target

print(merged)

# ['C', 'B', 'A', 'A', 'B', 'A', 'A', 'B', 'A', 'A', 'A', 'A', 'B', 'C']

对这个结果的分析表明,对于这组集群来说要好一些。不幸的是,它并不总是能提供最佳解决方案。

from statistics import mean
m = "".join(merged)
spreadA = [ d+1 for d in map(len,m.split("A")[1:-1])]
spreadB = [ d+1 for d in map(len,m.split("B")[1:-1])]
spreadC = [ d+1 for d in map(len,m.split("C")[1:-1])]
print("A",spreadA,mean(spreadA))
print("B",spreadB,mean(spreadB))
print("C",spreadC,mean(spreadC))
print("minimum spread",min(spreadA+spreadB+spreadC))
print("average spread", round(mean(spreadA+spreadB+spreadC), 1))

# A [1, 2, 1, 2, 1, 1, 1] 1.3
# B [3, 3, 5] 3.7
# C [13] 13
# minimum spread 1
# average spread 3

通过尝试其他大小的集群,我发现集群处理的顺序很重要。我使用的顺序基于每个群集的最大传播。如果至少一个大于其余的,则升序;否则,降序。

clusterA = ["A", "A", "A", "A", "A"]
clusterB = ["B", "B", "B", "B"]
clusterC = ["C", "C"]


# ['A', 'C', 'B', 'A', 'B', 'A', 'B', 'A', 'B', 'C', 'A']
# A [3, 2, 2, 3] 2.5
# B [2, 2, 2] 2
# C [8] 8
# minimum spread 2
# average spread 3