Warm tip: This article is reproduced from serverfault.com, please click

high RAM usage by eval function in python

发布于 2020-11-30 10:34:43

i implement each line 1 (not line 0) as string from 2 files (1st ~30MB and 2nd ~50MB) where line 0 has just some information which i dont need atm. line 1 is a string array which has around 1.3E6 smaller arrays like that ['I1000009', 'A', '4024', 'A'] as information in it.

[[['I1000009', 'A', '4024', 'A'], ['I1000009', 'A', '6734', 'G'],...],[['H1000004', 'B', '4024', 'A'], ['L1000009', 'B', '6734', 'C'],...],[and so on],...]

both files are in the same way filled. thats the reason why the files are between 30 and 50MB big. i read that files with my .py script to have access to the single information which i need:

import sys

myID        = sys.argv[1]
otherID     = sys.argv[2]

samePath        = '/home/srv/Dokumente/srv/' 
FolderName      = 'checkArrays/'
finishedFolder  = samePath+'finishedAnalysis/'
myNewFile       = samePath+FolderName+myID[0]+'/'+myID+'.txt'
otherFile       = samePath+FolderName+otherID[0]+'/'+otherID+'.txt'
nameFileOKarray = '_array_goodData.txt'

import csv 
import os 
import re #for regular expressions
# Text 2 - Start
import operator # zum sortieren der csv files
# Text 2 - End

whereIsMyArray    = 1
text_file         = open(finishedFolder+myID+nameFileOKarray, "r")
line              = text_file.readlines()[whereIsMyArray:];
myGoodFile        = eval(line[0])
text_file.close()

text_file         = open(finishedFolder+otherID+nameFileOKarray, "r")
line              = text_file.readlines()[whereIsMyArray:];
otherGoodFile     = eval(line[0])
text_file.close()

print(str(myGoodFile[0][0][0]))
print(str(otherGoodFile[0][0][0]))

the problem what i have is, that if i start my .py script over the shell:

python3 checkarr_v1.py 44 39

the RAM of my 4GB pi server increase to the limit of RAM and Swap and dies. then i tried to start the .py script on a 32Gb RAM server and look at that it worked, but the usage of the RAM is really huge. see pics

(slack mode) overview of normal usage of RAM and CPU: slackmode

(startsequence) overview in highest usage of RAM ~6GB and CPU: highest point

then it goes up and down after for ~1min: 1.2Gb to 3.6Gb then to 1.7Gb then to 1Gb and then the script finish ~1min and the right output was shown.

can you help me to understand if there is a better way to solve that for an 4Gb raspberry pi? is that a better way to write the 2 files, because the [",] symbols took also there spaces in the file? Is that a better solution as the eval function is to implement that string to an array? sry for that questions, but i cant understand why the 80MB files increase the RAM to around 6Gb. that sounds that i make something wrong. br and thx

Questioner
user2
Viewed
0
Mattias Nilsson 2020-11-30 22:07:56

1.3E9 arrays is going to be lots and lots of bytes if you read that into your application, no matter what you do.

I don't know if your code does what you actually want to do, but you're only ever using the first data item. If that's what you want to do, then don't read the whole file, just read that first part.

But also: I would advice against using "eval" for deserializing data. The built-in json module will give data in almost the same format (if you control the input format).

Still, in the end: If you want to hold that much data in your program, you're looking at many GB of memory usage.

If you just want to process it, I'd take a more iterative approach and do a little at the time rather than to swallow the whole files. Especially with limited resources.

Update: I See now that it's 1.3e6, not 1.3e9 entries. Big difference. :-) Then json data should be okay. On my machine a list of 1.3M ['RlFKUCUz', 'A', '4024', 'A'] takes about 250MB.