Thread: [Python] Unzipping a large list
i'm working large data files (several gigabytes) store interleaved data several channels: first value channel a, first value channel b, second value channel a, second value channel b, etc. also, there may more 2 channels sometimes. data stored unsigned 16-bit integers, byte order this:
and on millions of data values. goal separate channels , save them individual files. have couple questions.php code:
[a10, a11, b10, b11, a20, a21, b20, b21...]
first.
separate out each channel on own, get:
this easy numpy (using numpy.reshape() function, , slicing result). think involves operation of interpreting binary values. want separate channels , save them individual files.php code:
[a10, a11, a20, a21...]
[b10, b11, b20, b21...]
know loop, seems terribly inefficient given how data have go through.
way pull apart array this?
second.
since reading whole file @ once not option, going read in piecemeal, keep target files open whole time, , use file.flush() method on each target file after each block has been processed , written.
.flush() free memory used file buffer?
i'm working in python 2.7 on windows machine.
numpy operates on views , not data, reshape nothing data, changes how interpreted.
approach looks efficient way can in python.
why should there interpretation of binary values required? numpy arrays can pretty binary representation wish (fortran/c order, big/little endian, every basic datatype etc)
using reshape, slicing , memory maps should simple , archive performance (for python).
e.g.:
reaches 50mb/s write speed quite ok, disk manages 65-70mb.php code:
import numpy as np
#200mb
n = 100 * 1024**2
inp = np.memmap("input.data", dtype=np.uint16, mode='w+', shape=(n,))
inp[:] = np.zeros((n,))
inp[::4] = 1
inp[1::4] = 1
inp.flush()
del inp
import time
s = time.time()
inp = np.memmap("input.data", dtype=np.uint16, mode='r', shape=(n,))
a = np.memmap("a.data", dtype=np.uint16, mode='w+', shape=(n/2,))
b = np.memmap("b.data", dtype=np.uint16, mode='w+', shape=(n/2,))
tmp = inp.reshape((n/2,2))
a[:] = tmp[::2,:].reshape(((n/2,))
b[:] = tmp[1::2,:].reshape(((n/2,))
a.flush()
b.flush()
print "%g mb/s" % ((2* n / 1024**2) / (time.time() - s ))
(note input cached in memory if small, throughput under realistic scenarios lower)
if need support 32 bit operating systems have use sliding memory maps or normal file io.
the file object has underlying fixed size buffers flushed when full. should work file size. should not need worry this.since reading whole file @ once not option, going read in piecemeal, keep target files open whole time, , use file.flush() method on each target file after each block has been processed , written.
.flush() free memory used file buffer?
Forum The Ubuntu Forum Community Ubuntu Specialised Support Development & Programming Programming Talk [SOLVED] [Python] Unzipping a large list
Ubuntu
Comments
Post a Comment