i implement each line 1 (not line 0) as string from 2 files (1st ~30MB and 2nd ~50MB) where line 0 has just some information which i dont need atm. line 1 is a string array which has around 1.3E6 smaller arrays like that ['I1000009', 'A', '4024', 'A'] as information in it.
[[['I1000009', 'A', '4024', 'A'], ['I1000009', 'A', '6734', 'G'],...],[['H1000004', 'B', '4024', 'A'], ['L1000009', 'B', '6734', 'C'],...],[and so on],...]
both files are in the same way filled. thats the reason why the files are between 30 and 50MB big. i read that files with my .py script to have access to the single information which i need:
import sys
myID = sys.argv[1]
otherID = sys.argv[2]
samePath = '/home/srv/Dokumente/srv/'
FolderName = 'checkArrays/'
finishedFolder = samePath+'finishedAnalysis/'
myNewFile = samePath+FolderName+myID[0]+'/'+myID+'.txt'
otherFile = samePath+FolderName+otherID[0]+'/'+otherID+'.txt'
nameFileOKarray = '_array_goodData.txt'
import csv
import os
import re #for regular expressions
# Text 2 - Start
import operator # zum sortieren der csv files
# Text 2 - End
whereIsMyArray = 1
text_file = open(finishedFolder+myID+nameFileOKarray, "r")
line = text_file.readlines()[whereIsMyArray:];
myGoodFile = eval(line[0])
text_file.close()
text_file = open(finishedFolder+otherID+nameFileOKarray, "r")
line = text_file.readlines()[whereIsMyArray:];
otherGoodFile = eval(line[0])
text_file.close()
print(str(myGoodFile[0][0][0]))
print(str(otherGoodFile[0][0][0]))
the problem what i have is, that if i start my .py script over the shell:
python3 checkarr_v1.py 44 39
the RAM of my 4GB pi server increase to the limit of RAM and Swap and dies. then i tried to start the .py script on a 32Gb RAM server and look at that it worked, but the usage of the RAM is really huge. see pics
(slack mode) overview of normal usage of RAM and CPU: slackmode
(startsequence) overview in highest usage of RAM ~6GB and CPU: highest point
then it goes up and down after for ~1min: 1.2Gb to 3.6Gb then to 1.7Gb then to 1Gb and then the script finish ~1min and the right output was shown.
can you help me to understand if there is a better way to solve that for an 4Gb raspberry pi? is that a better way to write the 2 files, because the [",] symbols took also there spaces in the file? Is that a better solution as the eval function is to implement that string to an array? sry for that questions, but i cant understand why the 80MB files increase the RAM to around 6Gb. that sounds that i make something wrong. br and thx
1.3E9 arrays is going to be lots and lots of bytes if you read that into your application, no matter what you do.
I don't know if your code does what you actually want to do, but you're only ever using the first data item. If that's what you want to do, then don't read the whole file, just read that first part.
But also: I would advice against using "eval" for deserializing data. The built-in json module will give data in almost the same format (if you control the input format).
Still, in the end: If you want to hold that much data in your program, you're looking at many GB of memory usage.
If you just want to process it, I'd take a more iterative approach and do a little at the time rather than to swallow the whole files. Especially with limited resources.
Update: I See now that it's 1.3e6, not 1.3e9 entries. Big difference. :-) Then json data should be okay. On my machine a list of 1.3M ['RlFKUCUz', 'A', '4024', 'A']
takes about 250MB.
1.) the code does what it has to do. it is not really spectacular what it does. 2) what do you mean with "using the first data item"? if you mean to read just the first 200 parts of that string and convert just that part to the array then forget it! i need the hole array for comparing it with a mysql database. 3) thx for the json link. think thats the reason. i thought that something is wrong with that eval function. big thx! 4) yes i think i need really much GB of RAM for doing this for more then 100 useres at same time 5) that with a iterative approach i didnt understand. some example?
@user2 2.) the
myGoodFile[0][0][0]
seems to only be accessing the very first data. 5.) Iterative approach means reading a little at the time rather than loading everything into memory at once. For example, if you had a different layout of the file where each "section" was one line, you could easily just process one line at the time and then throw it away. That way you would not need to hold all the data in memory at once.you mean something like each line has his own array like [['RlFKUCUz', 'A', '4024', 'A'],['2', 'A', '4111', B'],['bla', 'X', '4024', 'C'], ....] where just the Arrays for A is for one line and B for the other and so on? then i could have between 50k and 70k for just one single line and not all of it. its a oppertunity but i think i will try first that JSON trick. maybe it works, but first i have to figure out how to do that and rewrite much of my old code. thx you and @MauriceMeyer. i got an 2nd way