Warm tip: This article is reproduced from stackoverflow.com, please click
python text python-pptx

Preserve text format on read/write to shape text python pptx

发布于 2020-04-08 23:41:11

I am looking to perform text replacements in a shape's text. I am using code similar to snippet below:

# define key/value
SRKeys, SRVals = ['x','y','z'], [1,2,3]

# define text
text = shape.text

# iterate through values and perform subs
for i in range(len(SRKeys)):
    # replace text
    text = text.replace(SRKeys[i], str(SRVals[i]))

# write text subs to comment box
shape.text = text

However, if the initial shape.text has formatted characters (bolded for example), the formatting is removed on the read. Is there a solution for this?

The only thing I could think of is to iterate over the characters and check for formatting, then add these formats before writing to shape.text.

Questioner
Michael Berk
Viewed
61
scanny 2020-02-02 06:05

@usr2564301 is on the right track. Character formatting (aka. "font") is specified at the run level. This is what a run is; a "run" (sequence) of characters all sharing the same character formatting.

When you assign to shape.text you replace all the runs that used to be there with a single new run having default formatting. If you want to preserve formatting you need to preserve whatever runs are not directly involved in the text replacement.

This is not a trivial problem because there is no guarantee runs break on word boundaries. Try printing out the runs for a few paragraphs and I think you'll see what I mean.

In rough pseudocode, I think this is the approach you would need to take:

  • do your search for the target text in the paragraph to determine the offset of its first character.
  • traverse all the runs in the paragraph keeping a running total of how many characters there are before each run, maybe something like (run_idx, prefix_len, length): (0, 0, 8), (1, 8, 4), (2, 12, 9), etc.
  • Identify which run is the starting, ending, and in-between runs involving your search string.
  • Split the first run at the start of the search term, split the last run at the end of the search term, and delete all but the first of the "middle" runs.
  • Change the text of the middle run to the replacement text and clone the formatting from the prior (original start) run. Maybe this last bit you do at split-start time.

This preserves any runs that do not involve the search string and preserves the formatting of the "matched" word in the "replaced" word.

This requires a few operations that are not directly supported by the current API. For those you'd need to use lower-level lxml calls to directly manipulate the XML, although you could get hold of all the existing elements you need from python-pptx objects without ever having to parse in the XML yourself.