Warm tip: This article is reproduced from serverfault.com, please click

Best parse tree given spans using maximum sum

发布于 2020-11-28 15:47:01

I want to find a way in which we can score each tree by summing the scores of its spans and choose the highest scoring binary tree that maximizes the sum by backtracking.

Assume, I have the following sentence:

string = "Revenue was 19.9 million"

The spans array represents the scores of span at the corresponding (i, j) position. Here: j=i+1

For example,

score(0, 1) = -100.0 corresponds to the span "Revenue";

score(0, 4) = 100.0 corresponds to the span "Revenue was 19.9 million";

score(1, 4) = 100.0 corresponds to the span "was 19.9 million";

And so on ...

                                               #  1 2 3 4
spans = [[-100.0, -100.0, -100.0,  100.0],     # 0
         [ 0,     -100.0, -100.0,  100.0],     # 1
         [ 0,       0 ,   -100.0,  100.0],     # 2
         [ 0,       0,      0,    -100.0]]     # 3
         

The expected tree is - (Revenue (was (19.9 million))) corresponding to the spans (0, 4), (1, 4), (2, 4).

Although, I am getting - (Revenue was 19.9 million) with the following code.

I tried to do:

class MaxSpanCalculator:

    def __init__(self, s, spans):
        self.sentence = s.split(" ")
        self.spans = spans
        self.splits = []

    def get_max_span_sentence(self):
        n = len(self.sentence)
        # Initialize the splits array
        for i in range(n):
            s = []
            for j in range(n):
                s.append(-1)
            self.splits.append(s)

        # Here Using spans array itself as dp array
        for l in range(1, n+1):
            i = 0
            j = i + l - 1
            while j<n:
                if i!=j:
                    for p in range(i,j):
                        if self.spans[i][j] < self.spans[i][p] + self.spans[p+1][j]:
                            self.splits[i][j] = p
                            self.spans[i][j] = self.spans[i][p] + self.spans[p+1][j]
                i = i + 1
                j = j + 1

        print("Max span value: " + str(self.spans[0][n-1]))
        print("Max span sentence is : " + self.get_sentence(0, n-1))

    def get_sentence(self, start, end):
        # print(str(start) + " " + str(end))
        if start==end: return self.sentence[start]
        if self.splits[start][end] == -1:
            return " ".join(x for x in self.sentence[start:end+1])
        return "(" + self.get_sentence(start, self.splits[start][end]) + ")" + "(" + self.get_sentence(self.splits[start][end]+1, end) + ")"
Questioner
nikinlpds
Viewed
0
btilly 2020-11-29 08:19:39

This is a dynamic programming problem.

First of all, you want to turn your text string into an array of words. Because that is what your cost is.

Second, you need to calculate two data structures.

  1. aggregate_cost[(start, end)] is the total cost of the best parse of the interval from start to end.
  2. best_parse[(start, end)] is the best parse of the interval from start to end.

As with all dynamic programming problems there is a top down and a bottom up way to do it. Either way the key insight is that the aggregate cost is the largest of:

cost(start, end),
cost(start, end) + aggregate_cost[(start, i)],
cost(start, end) + aggregate_cost[(i+1, end)],
cost(start, end) + aggregate_cost[(start, i)] + aggregate_cost[(i+1, end)]

where i can be anywhere between start and end. And the best_parse[(start, end)] is determined by which of these you pick (and, of course, best_parse of the subintervals whose aggregate cost were chosen).