我需要以两种不同的文件格式预处理成绩单,即SRT和WebVTT文件。我的目标是从文本行中删除标点符号,但不要从时间戳中删除。因为WebVTT文件中的时间戳包含句号而不是逗号(与SRT文件相对),所以预处理在删除句号方面有所不同。
时间戳中的句号必须保持不变,而文本行中的句号应删除。
输入文件如下所示:
00:00:09.761 --> 00:00:13.864
The Sahara Desert is one of the least hospitable climates on Earth.
00:00:14.340 --> 00:00:23.670
Its barren plateaus, rocky peaks and shifting sands envelop the northern third of Africa, which sees very little rain, vegetation and life.
00:00:24.440 --> 00:00:29.100
Meanwhile, across the Atlantic Ocean, thrives the world's largest rainforest.
这是我各自的代码:
import re
class Prep:
def __init__(self, transcript_filename):
self.transcript_filename = transcript_filename
self.transcript = self.read_file()
def read_file(self):
f = open(self.transcript_filename, "r")
data = f.read()
f.close()
return data
def preprocessing(self):
# Remove noisy punctuation from the transcript.
prep_transcript = self.transcript.replace("'", '')
prep_transcript = prep_transcript.replace(';', '')
prep_transcript = prep_transcript.replace('!', '')
prep_transcript = prep_transcript.replace('?', '')
prep_transcript = re.sub(r",\D\b", " ", prep_transcript,
flags=re.MULTILINE)
prep_transcript = re.sub(r",\n", "\n", prep_transcript,
flags=re.MULTILINE)
"""Handle full stops differently in .vtt and .srt files to remove
varyingly structured timestamps."""
if self.transcript_filename.endswith(".vtt"):
pattern = re.compile(r"\d{2}\.\d{3}")
if pattern.search(prep_transcript):
pass
else:
prep_transcript = prep_transcript.replace('.', '')
elif self.transcript_filename.endswith(".srt"):
prep_transcript = prep_transcript.replace('.', '')
return prep_transcript
inst = Prep("sample_transcript.vtt")
print(inst.preprocessing())
在SRT成绩单文件上,上述预处理步骤可以正常工作。但是对于WebVTT文件,它们仅适用于逗号,问号等。但是-不管出于什么原因-都不能句号,因为它们仍然保留在输出中:
00:00:07.318 --> 00:00:15.654
The Sahara Desert is one of the least hospitable climates on Earth.
00:00:17.310 --> 00:00:25.679
Its barren plateaus rocky peaks and shifting sands envelop the northern third of Africa which sees very little rain vegetation and life.
00:00:26.440 --> 00:00:29.100
Meanwhile across the Atlantic Ocean thrives the worlds largest rainforest.
相反,输出应如下所示:
00:00:07.318 --> 00:00:15.654
The Sahara Desert is one of the least hospitable climates on Earth
00:00:17.310 --> 00:00:25.679
Its barren plateaus rocky peaks and shifting sands envelop the northern third of Africa which sees very little rain vegetation and life
00:00:26.440 --> 00:00:29.100
Meanwhile across the Atlantic Ocean thrives the worlds largest rainforest
谁能告诉我我在做什么错?感谢你的帮助和提示!
你可以Remove noisy punctuation from the transcript.
使用re.sub缩短前4个replace语句,以使用单个字符类。
为了将点保留在时间戳中,例如,如果不直接跟数字,则可以匹配点。
由于所有语句都用空字符串替换匹配项,因此你可以使用替代|
方式将它们组合在一起。
更新行可能如下所示:
# Remove noisy punctuation from the transcript.
prep_transcript = re.sub(r"[';!?]|\.(?!\d)", '', self.transcript)
输出
00:00:07.318 --> 00:00:15.654
The Sahara Desert is one of the least hospitable climates on Earth
00:00:17.310 --> 00:00:25.679
Its barren plateaus rocky peaks and shifting sands envelop the northern third of Africa which sees very little rain vegetation and life
00:00:26.440 --> 00:00:29.100
Meanwhile across the Atlantic Ocean thrives the worlds largest rainforest
很好,只需将','添加
r"[';!?]|\.(?!\d)"
为(r"[',;!?]|\.(?!\d)"