Removing only specific full stops fails unexpectedly in a text

发布于 2020-11-28 07:38:07

I need to preprocess transcripts in two different file formats, namely in SRT and WebVTT files. My goal is to remove punctuation marks from the text lines - but not from the timestamps. Because the timestamps in the WebVTT file include full stops instead of commas (as opposed to SRT files), the preprocessing differs in terms of removing the full stops.

The full stops within the timestamps have to remain untouched, whereas those in the text lines shall be removed.

The input file looks like this:

00:00:09.761 --> 00:00:13.864
The Sahara Desert is one of the least hospitable climates on Earth.


00:00:14.340 --> 00:00:23.670
Its barren plateaus, rocky peaks and shifting sands envelop the northern third of Africa, which sees very little rain, vegetation and life.


00:00:24.440 --> 00:00:29.100
Meanwhile, across the Atlantic Ocean, thrives the world's largest rainforest.

This is my respective code:

import re

class Prep:
    def __init__(self, transcript_filename):
        self.transcript_filename = transcript_filename
        self.transcript = self.read_file()
    
    def read_file(self):
        f = open(self.transcript_filename, "r")
        data = f.read()
        f.close()
        
        return data
    
    def preprocessing(self):
        # Remove noisy punctuation from the transcript.
        prep_transcript = self.transcript.replace("'", '')
        prep_transcript = prep_transcript.replace(';', '')
        prep_transcript = prep_transcript.replace('!', '')
        prep_transcript = prep_transcript.replace('?', '')
        prep_transcript = re.sub(r",\D\b", " ", prep_transcript,
                                 flags=re.MULTILINE)
        prep_transcript = re.sub(r",\n", "\n", prep_transcript,
                                 flags=re.MULTILINE)
        """Handle full stops differently in .vtt and .srt files to remove
        varyingly structured timestamps."""
        if self.transcript_filename.endswith(".vtt"):
            pattern = re.compile(r"\d{2}\.\d{3}")
            if pattern.search(prep_transcript):
                pass
            else:
                prep_transcript = prep_transcript.replace('.', '')
        elif self.transcript_filename.endswith(".srt"):
            prep_transcript = prep_transcript.replace('.', '')

        return prep_transcript
    
inst = Prep("sample_transcript.vtt")
print(inst.preprocessing())

On SRT transcript files, the above preprocessing steps work just fine. But as for WebVTT files, they only work for commas, question marks etc. but - for whatever reason - not for full stops as they still remain in the output:

00:00:07.318 --> 00:00:15.654
The Sahara Desert is one of the least hospitable climates on Earth.


00:00:17.310 --> 00:00:25.679
Its barren plateaus rocky peaks and shifting sands envelop the northern third of Africa which sees very little rain vegetation and life.


00:00:26.440 --> 00:00:29.100
Meanwhile across the Atlantic Ocean thrives the worlds largest rainforest.

Instead, the output should look like this:

00:00:07.318 --> 00:00:15.654
The Sahara Desert is one of the least hospitable climates on Earth


00:00:17.310 --> 00:00:25.679
Its barren plateaus rocky peaks and shifting sands envelop the northern third of Africa which sees very little rain vegetation and life


00:00:26.440 --> 00:00:29.100
Meanwhile across the Atlantic Ocean thrives the worlds largest rainforest

Can anyone tell me what I'm doing wrong? I am thankful for any help and tips!

Questioner

MareikeP

Viewed

Original

00:00:07.318 --> 00:00:15.654 The Sahara Desert is one of the least hospitable climates on Earth 00:00:17.310 --> 00:00:25.679 Its barren plateaus rocky peaks and shifting sands envelop the northern third of Africa which sees very little rain vegetation and life 00:00:26.440 --> 00:00:29.100 Meanwhile across the Atlantic Ocean thrives the worlds largest rainforest

Removing only specific full stops fails unexpectedly in a text

热门帖子

热门github