I need to preprocess transcripts in two different file formats, namely in SRT and WebVTT files. My goal is to remove punctuation marks from the text lines - but not from the timestamps. Because the timestamps in the WebVTT file include full stops instead of commas (as opposed to SRT files), the preprocessing differs in terms of removing the full stops.
The full stops within the timestamps have to remain untouched, whereas those in the text lines shall be removed.
The input file looks like this:
00:00:09.761 --> 00:00:13.864
The Sahara Desert is one of the least hospitable climates on Earth.
00:00:14.340 --> 00:00:23.670
Its barren plateaus, rocky peaks and shifting sands envelop the northern third of Africa, which sees very little rain, vegetation and life.
00:00:24.440 --> 00:00:29.100
Meanwhile, across the Atlantic Ocean, thrives the world's largest rainforest.
This is my respective code:
import re
class Prep:
def __init__(self, transcript_filename):
self.transcript_filename = transcript_filename
self.transcript = self.read_file()
def read_file(self):
f = open(self.transcript_filename, "r")
data = f.read()
f.close()
return data
def preprocessing(self):
# Remove noisy punctuation from the transcript.
prep_transcript = self.transcript.replace("'", '')
prep_transcript = prep_transcript.replace(';', '')
prep_transcript = prep_transcript.replace('!', '')
prep_transcript = prep_transcript.replace('?', '')
prep_transcript = re.sub(r",\D\b", " ", prep_transcript,
flags=re.MULTILINE)
prep_transcript = re.sub(r",\n", "\n", prep_transcript,
flags=re.MULTILINE)
"""Handle full stops differently in .vtt and .srt files to remove
varyingly structured timestamps."""
if self.transcript_filename.endswith(".vtt"):
pattern = re.compile(r"\d{2}\.\d{3}")
if pattern.search(prep_transcript):
pass
else:
prep_transcript = prep_transcript.replace('.', '')
elif self.transcript_filename.endswith(".srt"):
prep_transcript = prep_transcript.replace('.', '')
return prep_transcript
inst = Prep("sample_transcript.vtt")
print(inst.preprocessing())
On SRT transcript files, the above preprocessing steps work just fine. But as for WebVTT files, they only work for commas, question marks etc. but - for whatever reason - not for full stops as they still remain in the output:
00:00:07.318 --> 00:00:15.654
The Sahara Desert is one of the least hospitable climates on Earth.
00:00:17.310 --> 00:00:25.679
Its barren plateaus rocky peaks and shifting sands envelop the northern third of Africa which sees very little rain vegetation and life.
00:00:26.440 --> 00:00:29.100
Meanwhile across the Atlantic Ocean thrives the worlds largest rainforest.
Instead, the output should look like this:
00:00:07.318 --> 00:00:15.654
The Sahara Desert is one of the least hospitable climates on Earth
00:00:17.310 --> 00:00:25.679
Its barren plateaus rocky peaks and shifting sands envelop the northern third of Africa which sees very little rain vegetation and life
00:00:26.440 --> 00:00:29.100
Meanwhile across the Atlantic Ocean thrives the worlds largest rainforest
Can anyone tell me what I'm doing wrong? I am thankful for any help and tips!
You can shorten the first 4 replace statements under Remove noisy punctuation from the transcript.
to use a single character class using re.sub.
To keep the dots in the timestamps, you can for example match a dot if no directly followed by a digit.
As all the statements replace the match with an empty string, you can use an alternation |
to combine them.
The update line could look like:
# Remove noisy punctuation from the transcript.
prep_transcript = re.sub(r"[';!?]|\.(?!\d)", '', self.transcript)
Output
00:00:07.318 --> 00:00:15.654
The Sahara Desert is one of the least hospitable climates on Earth
00:00:17.310 --> 00:00:25.679
Its barren plateaus rocky peaks and shifting sands envelop the northern third of Africa which sees very little rain vegetation and life
00:00:26.440 --> 00:00:29.100
Meanwhile across the Atlantic Ocean thrives the worlds largest rainforest
Good, just add ',' to
r"[';!?]|\.(?!\d)"
to be(r"[',;!?]|\.(?!\d)"