Skip to content

Feature request: read headers only #37

@dojeda

Description

@dojeda

I am sorry if I file so many issues at once. What I really wanted was to implement the feature explained in this issue, but when doing so I found the problems/discussions points filed on #34 #35 and #36.

The feature I would like to propose is an extension to load_xdf that is significantly faster but only reads the headers.

My use-case is the following: I am reading and importing all my XDF files into another database/filesystem and I wanted to save the information present on the headers in a database (hence my discussion on #36). The problem is that I have a lot of files, and some of them are big (about 3Gb, probably recordings that started and was then forgotten, but it could be a long recording session).

The way I plan to implement this is to use the tag that identifies the chunk header and the chunk size in bytes (https://github.com/sccn/xdf/wiki/Specifications#chunk). When the tag is not of types FileHeader (1) or StreamHeader (2) I will move the file pointer to the beginning of the next chunk.

I managed to achieve this with the following code:

def load_xdf(filename,
             ...,
             headers_only=False):

           ...

            # read [Tag]
            tag = struct.unpack('<H', f.read(2))[0]
            log_str = ' Read tag: {} at {} bytes, length={}'.format(tag, f.tell(), chunklen)
            if tag in [2, 3, 4, 6]:
                StreamId = struct.unpack('<I', f.read(4))[0]
                log_str += ', StreamId={}'.format(StreamId)

            logger.debug(log_str)
            # ^^^ Keeping this code to show a reference of where the modification goes

            # Quick read of header only: when the chunk if not a header, move
            # the file pointer to the beginning of the next chunk
            if headers_only and tag not in (1, 2):
                offset = 2   # We already read 2 bytes for the tag
                if tag in (2, 3, 4, 6):
                    # In these cases, we already read 4 bytes!
                    offset += 4
                # Move n=chunklen-offset bytes forward, relative to current position (whence=1)
                f.seek(chunklen - offset, 1)
                continue

With this modification, I can cut down the read time of a 3Gb file from 2 1/2 minutes to 5 seconds. Considering I have several hundreds of files (not that big, though), I am saving quite a lot of time.
If you guys agree with this modification, I would gladly make a PR for it.

Metadata

Metadata

Assignees

No one assigned

    Labels

    No labels
    No labels

    Type

    No type

    Projects

    No projects

    Milestone

    No milestone

    Relationships

    None yet

    Development

    No branches or pull requests

    Issue actions