Warm tip: This article is reproduced from serverfault.com, please click

Removing only specific full stops fails unexpectedly in a text

发布于 2020-11-28 07:38:07

I need to preprocess transcripts in two different file formats, namely in SRT and WebVTT files. My goal is to remove punctuation marks from the text lines - but not from the timestamps. Because the timestamps in the WebVTT file include full stops instead of commas (as opposed to SRT files), the preprocessing differs in terms of removing the full stops.

The full stops within the timestamps have to remain untouched, whereas those in the text lines shall be removed.

The input file looks like this:

00:00:09.761 --> 00:00:13.864
The Sahara Desert is one of the least hospitable climates on Earth.


00:00:14.340 --> 00:00:23.670
Its barren plateaus, rocky peaks and shifting sands envelop the northern third of Africa, which sees very little rain, vegetation and life.


00:00:24.440 --> 00:00:29.100
Meanwhile, across the Atlantic Ocean, thrives the world's largest rainforest.

This is my respective code:

import re

class Prep:
    def __init__(self, transcript_filename):
        self.transcript_filename = transcript_filename
        self.transcript = self.read_file()
    
    def read_file(self):
        f = open(self.transcript_filename, "r")
        data = f.read()
        f.close()
        
        return data
    
    def preprocessing(self):
        # Remove noisy punctuation from the transcript.
        prep_transcript = self.transcript.replace("'", '')
        prep_transcript = prep_transcript.replace(';', '')
        prep_transcript = prep_transcript.replace('!', '')
        prep_transcript = prep_transcript.replace('?', '')
        prep_transcript = re.sub(r",\D\b", " ", prep_transcript,
                                 flags=re.MULTILINE)
        prep_transcript = re.sub(r",\n", "\n", prep_transcript,
                                 flags=re.MULTILINE)
        """Handle full stops differently in .vtt and .srt files to remove
        varyingly structured timestamps."""
        if self.transcript_filename.endswith(".vtt"):
            pattern = re.compile(r"\d{2}\.\d{3}")
            if pattern.search(prep_transcript):
                pass
            else:
                prep_transcript = prep_transcript.replace('.', '')
        elif self.transcript_filename.endswith(".srt"):
            prep_transcript = prep_transcript.replace('.', '')

        return prep_transcript
    
inst = Prep("sample_transcript.vtt")
print(inst.preprocessing())

On SRT transcript files, the above preprocessing steps work just fine. But as for WebVTT files, they only work for commas, question marks etc. but - for whatever reason - not for full stops as they still remain in the output:

00:00:07.318 --> 00:00:15.654
The Sahara Desert is one of the least hospitable climates on Earth.


00:00:17.310 --> 00:00:25.679
Its barren plateaus rocky peaks and shifting sands envelop the northern third of Africa which sees very little rain vegetation and life.


00:00:26.440 --> 00:00:29.100
Meanwhile across the Atlantic Ocean thrives the worlds largest rainforest.

Instead, the output should look like this:

00:00:07.318 --> 00:00:15.654
The Sahara Desert is one of the least hospitable climates on Earth


00:00:17.310 --> 00:00:25.679
Its barren plateaus rocky peaks and shifting sands envelop the northern third of Africa which sees very little rain vegetation and life


00:00:26.440 --> 00:00:29.100
Meanwhile across the Atlantic Ocean thrives the worlds largest rainforest

Can anyone tell me what I'm doing wrong? I am thankful for any help and tips!

Questioner
MareikeP
Viewed
0
The fourth bird 2020-11-28 19:48:42

You can shorten the first 4 replace statements under Remove noisy punctuation from the transcript. to use a single character class using re.sub.

To keep the dots in the timestamps, you can for example match a dot if no directly followed by a digit.

As all the statements replace the match with an empty string, you can use an alternation | to combine them.

The update line could look like:

# Remove noisy punctuation from the transcript.
prep_transcript = re.sub(r"[';!?]|\.(?!\d)", '', self.transcript)

Output

00:00:07.318 --> 00:00:15.654
The Sahara Desert is one of the least hospitable climates on Earth


00:00:17.310 --> 00:00:25.679
Its barren plateaus rocky peaks and shifting sands envelop the northern third of Africa which sees very little rain vegetation and life


00:00:26.440 --> 00:00:29.100
Meanwhile across the Atlantic Ocean thrives the worlds largest rainforest