I am trying to find an efficient way of parsing files that holds fixed width lines. For example, the first 20 characters represent a column, from 21:30 another one and so on.
Assuming that the line holds 100 characters, what would be an efficient way to parse a line into several components?
I could use string slicing per line, but it's a little bit ugly if the line is big. Are there any other fast methods?
Using the Python standard library's struct
module would be fairly easy as well as extremely fast since it's written in C.
Here's how it could be used to do what you want. It also allows columns of characters to be skipped by specifying negative values for the number of characters in the field.
import struct
fieldwidths = (2, -10, 24) # negative widths represent ignored padding fields
fmtstring = ' '.join('{}{}'.format(abs(fw), 'x' if fw < 0 else 's')
for fw in fieldwidths)
fieldstruct = struct.Struct(fmtstring)
parse = fieldstruct.unpack_from
print('fmtstring: {!r}, recsize: {} chars'.format(fmtstring, fieldstruct.size))
line = 'ABCDEFGHIJKLMNOPQRSTUVWXYZ0123456789\n'
fields = parse(line)
print('fields: {}'.format(fields))
Output:
fmtstring: '2s 10x 24s', recsize: 36 chars
fields: ('AB', 'MNOPQRSTUVWXYZ0123456789')
The following modifications would adapt it work in Python 2 or 3 (and handle Unicode input):
import struct
import sys
fieldstruct = struct.Struct(fmtstring)
if sys.version_info[0] < 3:
parse = fieldstruct.unpack_from
else:
# converts unicode input to byte string and results back to unicode string
unpack = fieldstruct.unpack_from
parse = lambda line: tuple(s.decode() for s in unpack(line.encode()))
Here's a way to do it with string slices, as you were considering but were concerned that it might get too ugly. The nice thing about it is, besides not being all that ugly, is that it works unchanged in both Python 2 and 3, as well as being able to handle Unicode strings. Speed-wise it is, of course, slower than the versions based the struct
module, but could be sped-up slightly by removing the ability to have padding fields.
try:
from itertools import izip_longest # added in Py 2.6
except ImportError:
from itertools import zip_longest as izip_longest # name change in Py 3.x
try:
from itertools import accumulate # added in Py 3.2
except ImportError:
def accumulate(iterable):
'Return running totals (simplified version).'
total = next(iterable)
yield total
for value in iterable:
total += value
yield total
def make_parser(fieldwidths):
cuts = tuple(cut for cut in accumulate(abs(fw) for fw in fieldwidths))
pads = tuple(fw < 0 for fw in fieldwidths) # bool values for padding fields
flds = tuple(izip_longest(pads, (0,)+cuts, cuts))[:-1] # ignore final one
parse = lambda line: tuple(line[i:j] for pad, i, j in flds if not pad)
# optional informational function attributes
parse.size = sum(abs(fw) for fw in fieldwidths)
parse.fmtstring = ' '.join('{}{}'.format(abs(fw), 'x' if fw < 0 else 's')
for fw in fieldwidths)
return parse
line = 'ABCDEFGHIJKLMNOPQRSTUVWXYZ0123456789\n'
fieldwidths = (2, -10, 24) # negative widths represent ignored padding fields
parse = make_parser(fieldwidths)
fields = parse(line)
print('format: {!r}, rec size: {} chars'.format(parse.fmtstring, parse.size))
print('fields: {}'.format(fields))
Output:
format: '2s 10x 24s', rec size: 36 chars
fields: ('AB', 'MNOPQRSTUVWXYZ0123456789')
+1 that's nice. In a way, I think this is similar to my approach (at least when you're getting the results), but obviously way faster.
How would that work with unicode? Or, a utf-8 encoded string?
struct.unpack
seems to operate on binary data. I can't get this working.@Reiner Gerecke: The struct module is designed to operate on binary data. Files with fixed-width fields are legacy jobs which are also highly likely to pre-date UTF-8 (in mind set, if not in chronology). Bytes read from files are binary data. You don't have unicode in files. You need to decode bytes to get unicode.
@Reiner Gerecke: Clarification: In those legacy file formats, each field is a fixed number of bytes, not a fixed number of characters. Although unlikely to be encoded in UTF-8, they can be encoded in an encoding that has a variable number of bytes per character e.g. gbk, big5, euc-jp, shift-jis, etc. If you wish to work in unicode, you can't decode the whole record at once; you need to decode each field.
This breaks down entirely when you try to apply this for Unicode values (like in Python 3) with text outside the ASCII character set and where 'fixed width' means 'fixed number of characters', not bytes.