Skip to content

Index tarball which starts at a non-zero byte offset #11

Open
@ProximaNova

Description

Feature request: make tarindexer able to index a TAR file where the TAR data does not start at the first/zeroth byte of the file.

Motivation: for a multi-part tarball like file.tar.aa, file.tar.ab, file.tar.ac, etc. where each one is 1.8 GB in size. Only .aa has tar data starting at byte offset zero; all subsequent ones almost always don't.

For Unix Standard tar (ustar), the start of valid TAR data is the start of any 512-byte header block. To find that, look for the magic word "ustar" directly in the binary data; the header starts 257 bytes before the offset of the start of string "ustar". if you think a header will show up in the first 9 megabytes, run "$ head -c 9000111 f.tar | grep -boa ustar" to find it.

At github.com/ProximaNova/tarindexer the code and documentation all completely works to do this:

Usage:

create index file (set offset to 0 for normal TAR files):

tarindexer -i tarfile.tar indexfile.index 0

As of

You can set the offset to 985532 instead of 0 if that's where it starts, for example. This isn't a pull request because I feel like the code isn't very finished. That's because now it requires you to specify the offset anytime you run it (even for normal .tar files which start at zero). Further programming which I don't really feel like doing now: make offset an optional parameter and make it default to 0 (zero). I looked at Python's tarfile module: seems it doesn't have a thing to skip random bytes at the start of the file until it finds a valid TAR (pax) header.

Edit 1: forgot to write, with tarindexer minus my improvements, if you try to index a TAR file like file.tar.ab (tar starts at non-0 offset) it will just fail and say invalid header due to only being able to look at the file starting at zero offset. Therefore this issue is not just a feature request but also a big fix.

Edit 2: "minor" fixes. Edit 3: minor fix.

Activity

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Metadata

Assignees

No one assigned

    Labels

    No labels
    No labels

    Projects

    No projects

    Milestone

    No milestone

    Relationships

    None yet

    Development

    No branches or pull requests

    Issue actions

      Index tarball which starts at a non-zero byte offset · Issue #11 · devsnd/tarindexer