python-How to efficiently parse fixed width files?

martineau 2020-08-03 00:21:21

Using the Python standard library's struct module would be fairly easy as well as extremely fast since it's written in C.

Here's how it could be used to do what you want. It also allows columns of characters to be skipped by specifying negative values for the number of characters in the field.

import struct

fieldwidths = (2, -10, 24)  # negative widths represent ignored padding fields
fmtstring = ' '.join('{}{}'.format(abs(fw), 'x' if fw < 0 else 's')
                        for fw in fieldwidths)
fieldstruct = struct.Struct(fmtstring)
parse = fieldstruct.unpack_from
print('fmtstring: {!r}, recsize: {} chars'.format(fmtstring, fieldstruct.size))

line = 'ABCDEFGHIJKLMNOPQRSTUVWXYZ0123456789\n'
fields = parse(line)
print('fields: {}'.format(fields))

Output:

fmtstring: '2s 10x 24s', recsize: 36 chars
fields: ('AB', 'MNOPQRSTUVWXYZ0123456789')

The following modifications would adapt it work in Python 2 or 3 (and handle Unicode input):

import struct
import sys

fieldstruct = struct.Struct(fmtstring)
if sys.version_info[0] < 3:
    parse = fieldstruct.unpack_from
else:
    # converts unicode input to byte string and results back to unicode string
    unpack = fieldstruct.unpack_from
    parse = lambda line: tuple(s.decode() for s in unpack(line.encode()))

Here's a way to do it with string slices, as you were considering but were concerned that it might get too ugly. The nice thing about it is, besides not being all that ugly, is that it works unchanged in both Python 2 and 3, as well as being able to handle Unicode strings. Speed-wise it is, of course, slower than the versions based the struct module, but could be sped-up slightly by removing the ability to have padding fields.

try:
    from itertools import izip_longest  # added in Py 2.6
except ImportError:
    from itertools import zip_longest as izip_longest  # name change in Py 3.x

try:
    from itertools import accumulate  # added in Py 3.2
except ImportError:
    def accumulate(iterable):
        'Return running totals (simplified version).'
        total = next(iterable)
        yield total
        for value in iterable:
            total += value
            yield total

def make_parser(fieldwidths):
    cuts = tuple(cut for cut in accumulate(abs(fw) for fw in fieldwidths))
    pads = tuple(fw < 0 for fw in fieldwidths) # bool values for padding fields
    flds = tuple(izip_longest(pads, (0,)+cuts, cuts))[:-1]  # ignore final one
    parse = lambda line: tuple(line[i:j] for pad, i, j in flds if not pad)
    # optional informational function attributes
    parse.size = sum(abs(fw) for fw in fieldwidths)
    parse.fmtstring = ' '.join('{}{}'.format(abs(fw), 'x' if fw < 0 else 's')
                                                for fw in fieldwidths)
    return parse

line = 'ABCDEFGHIJKLMNOPQRSTUVWXYZ0123456789\n'
fieldwidths = (2, -10, 24)  # negative widths represent ignored padding fields
parse = make_parser(fieldwidths)
fields = parse(line)
print('format: {!r}, rec size: {} chars'.format(parse.fmtstring, parse.size))
print('fields: {}'.format(fields))

Output:

format: '2s 10x 24s', rec size: 36 chars
fields: ('AB', 'MNOPQRSTUVWXYZ0123456789')

Reiner Gerecke 2011-02-06 20:15:04

+1 that's nice. In a way, I think this is similar to my approach (at least when you're getting the results), but obviously way faster.

Reiner Gerecke 2011-02-06 20:22:42

How would that work with unicode? Or, a utf-8 encoded string? struct.unpack seems to operate on binary data. I can't get this working.

John Machin 2011-02-06 21:53:08

@Reiner Gerecke: The struct module is designed to operate on binary data. Files with fixed-width fields are legacy jobs which are also highly likely to pre-date UTF-8 (in mind set, if not in chronology). Bytes read from files are binary data. You don't have unicode in files. You need to decode bytes to get unicode.

John Machin 2011-02-06 22:15:38

@Reiner Gerecke: Clarification: In those legacy file formats, each field is a fixed number of bytes, not a fixed number of characters. Although unlikely to be encoded in UTF-8, they can be encoded in an encoding that has a variable number of bytes per character e.g. gbk, big5, euc-jp, shift-jis, etc. If you wish to work in unicode, you can't decode the whole record at once; you need to decode each field.

Martijn Pieters 2014-11-21 16:06:38

This breaks down entirely when you try to apply this for Unicode values (like in Python 3) with text outside the ASCII character set and where 'fixed width' means 'fixed number of characters', not bytes.

How to efficiently parse fixed width files?

热门帖子

热门github