python-仅删除特定的句号会在文本中意外失败

(python - Removing only specific full stops fails unexpectedly in a text)

发布于 2020-11-28 07:38:07

我需要以两种不同的文件格式预处理成绩单，即SRT和WebVTT文件。我的目标是从文本行中删除标点符号，但不要从时间戳中删除。因为WebVTT文件中的时间戳包含句号而不是逗号（与SRT文件相对），所以预处理在删除句号方面有所不同。

时间戳中的句号必须保持不变，而文本行中的句号应删除。

输入文件如下所示：

00:00:09.761 --> 00:00:13.864
The Sahara Desert is one of the least hospitable climates on Earth.


00:00:14.340 --> 00:00:23.670
Its barren plateaus, rocky peaks and shifting sands envelop the northern third of Africa, which sees very little rain, vegetation and life.


00:00:24.440 --> 00:00:29.100
Meanwhile, across the Atlantic Ocean, thrives the world's largest rainforest.

这是我各自的代码：

import re

class Prep:
    def __init__(self, transcript_filename):
        self.transcript_filename = transcript_filename
        self.transcript = self.read_file()
    
    def read_file(self):
        f = open(self.transcript_filename, "r")
        data = f.read()
        f.close()
        
        return data
    
    def preprocessing(self):
        # Remove noisy punctuation from the transcript.
        prep_transcript = self.transcript.replace("'", '')
        prep_transcript = prep_transcript.replace(';', '')
        prep_transcript = prep_transcript.replace('!', '')
        prep_transcript = prep_transcript.replace('?', '')
        prep_transcript = re.sub(r",\D\b", " ", prep_transcript,
                                 flags=re.MULTILINE)
        prep_transcript = re.sub(r",\n", "\n", prep_transcript,
                                 flags=re.MULTILINE)
        """Handle full stops differently in .vtt and .srt files to remove
        varyingly structured timestamps."""
        if self.transcript_filename.endswith(".vtt"):
            pattern = re.compile(r"\d{2}\.\d{3}")
            if pattern.search(prep_transcript):
                pass
            else:
                prep_transcript = prep_transcript.replace('.', '')
        elif self.transcript_filename.endswith(".srt"):
            prep_transcript = prep_transcript.replace('.', '')

        return prep_transcript
    
inst = Prep("sample_transcript.vtt")
print(inst.preprocessing())

在SRT成绩单文件上，上述预处理步骤可以正常工作。但是对于WebVTT文件，它们仅适用于逗号，问号等。但是-不管出于什么原因-都不能句号，因为它们仍然保留在输出中：

00:00:07.318 --> 00:00:15.654
The Sahara Desert is one of the least hospitable climates on Earth.


00:00:17.310 --> 00:00:25.679
Its barren plateaus rocky peaks and shifting sands envelop the northern third of Africa which sees very little rain vegetation and life.


00:00:26.440 --> 00:00:29.100
Meanwhile across the Atlantic Ocean thrives the worlds largest rainforest.

相反，输出应如下所示：

00:00:07.318 --> 00:00:15.654
The Sahara Desert is one of the least hospitable climates on Earth


00:00:17.310 --> 00:00:25.679
Its barren plateaus rocky peaks and shifting sands envelop the northern third of Africa which sees very little rain vegetation and life


00:00:26.440 --> 00:00:29.100
Meanwhile across the Atlantic Ocean thrives the worlds largest rainforest

谁能告诉我我在做什么错？感谢你的帮助和提示！

Questioner

MareikeP

Viewed

Original

English

00:00:07.318 --> 00:00:15.654 The Sahara Desert is one of the least hospitable climates on Earth 00:00:17.310 --> 00:00:25.679 Its barren plateaus rocky peaks and shifting sands envelop the northern third of Africa which sees very little rain vegetation and life 00:00:26.440 --> 00:00:29.100 Meanwhile across the Atlantic Ocean thrives the worlds largest rainforest

python-仅删除特定的句号会在文本中意外失败

(python - Removing only specific full stops fails unexpectedly in a text)

热门帖子

热门github