Warm tip: This article is reproduced from serverfault.com, please click

python-仅删除特定的句号会在文本中意外失败

(python - Removing only specific full stops fails unexpectedly in a text)

发布于 2020-11-28 07:38:07

我需要以两种不同的文件格式预处理成绩单,即SRT和WebVTT文件。我的目标是从文本行中删除标点符号,但不要从时间戳中删除。因为WebVTT文件中的时间戳包含句号而不是逗号(与SRT文件相对),所以预处理在删除句号方面有所不同。

时间戳中的句号必须保持不变,而文本行中的句号应删除。

输入文件如下所示:

00:00:09.761 --> 00:00:13.864
The Sahara Desert is one of the least hospitable climates on Earth.


00:00:14.340 --> 00:00:23.670
Its barren plateaus, rocky peaks and shifting sands envelop the northern third of Africa, which sees very little rain, vegetation and life.


00:00:24.440 --> 00:00:29.100
Meanwhile, across the Atlantic Ocean, thrives the world's largest rainforest.

这是我各自的代码:

import re

class Prep:
    def __init__(self, transcript_filename):
        self.transcript_filename = transcript_filename
        self.transcript = self.read_file()
    
    def read_file(self):
        f = open(self.transcript_filename, "r")
        data = f.read()
        f.close()
        
        return data
    
    def preprocessing(self):
        # Remove noisy punctuation from the transcript.
        prep_transcript = self.transcript.replace("'", '')
        prep_transcript = prep_transcript.replace(';', '')
        prep_transcript = prep_transcript.replace('!', '')
        prep_transcript = prep_transcript.replace('?', '')
        prep_transcript = re.sub(r",\D\b", " ", prep_transcript,
                                 flags=re.MULTILINE)
        prep_transcript = re.sub(r",\n", "\n", prep_transcript,
                                 flags=re.MULTILINE)
        """Handle full stops differently in .vtt and .srt files to remove
        varyingly structured timestamps."""
        if self.transcript_filename.endswith(".vtt"):
            pattern = re.compile(r"\d{2}\.\d{3}")
            if pattern.search(prep_transcript):
                pass
            else:
                prep_transcript = prep_transcript.replace('.', '')
        elif self.transcript_filename.endswith(".srt"):
            prep_transcript = prep_transcript.replace('.', '')

        return prep_transcript
    
inst = Prep("sample_transcript.vtt")
print(inst.preprocessing())

在SRT成绩单文件上,上述预处理步骤可以正常工作。但是对于WebVTT文件,它们仅适用于逗号,问号等。但是-不管出于什么原因-都不能句号,因为它们仍然保留在输出中:

00:00:07.318 --> 00:00:15.654
The Sahara Desert is one of the least hospitable climates on Earth.


00:00:17.310 --> 00:00:25.679
Its barren plateaus rocky peaks and shifting sands envelop the northern third of Africa which sees very little rain vegetation and life.


00:00:26.440 --> 00:00:29.100
Meanwhile across the Atlantic Ocean thrives the worlds largest rainforest.

相反,输出应如下所示

00:00:07.318 --> 00:00:15.654
The Sahara Desert is one of the least hospitable climates on Earth


00:00:17.310 --> 00:00:25.679
Its barren plateaus rocky peaks and shifting sands envelop the northern third of Africa which sees very little rain vegetation and life


00:00:26.440 --> 00:00:29.100
Meanwhile across the Atlantic Ocean thrives the worlds largest rainforest

谁能告诉我我在做什么错?感谢你的帮助和提示!

Questioner
MareikeP
Viewed
0
The fourth bird 2020-11-28 19:48:42

你可以Remove noisy punctuation from the transcript.使用re.sub缩短前4个replace语句,以使用单个字符类。

为了将点保留在时间戳中,例如,如果不直接跟数字,则可以匹配点。

由于所有语句都用空字符串替换匹配项,因此你可以使用替代|方式将它们组合在一起。

更新行可能如下所示:

# Remove noisy punctuation from the transcript.
prep_transcript = re.sub(r"[';!?]|\.(?!\d)", '', self.transcript)

输出

00:00:07.318 --> 00:00:15.654
The Sahara Desert is one of the least hospitable climates on Earth


00:00:17.310 --> 00:00:25.679
Its barren plateaus rocky peaks and shifting sands envelop the northern third of Africa which sees very little rain vegetation and life


00:00:26.440 --> 00:00:29.100
Meanwhile across the Atlantic Ocean thrives the worlds largest rainforest