What method does unzip use to find a single file in an archive?

Let's say I create 100 files with random text data of size 30MB each. Now I create a zip archive with 0 compression i.e. zip dataset.zip -r -0 *.txt. Now I want to extract just one file from this archive.

As described here, there are two ways of unzipping/extracting files from archives:

Seek to the end of the file and lookup the central directory. Then use that for fast random access to the file to be extracted.(Amortized O(1) complexity)

Look through each local header and extract the one where theres a match.(O(n) complexity)

Which method does unzip use? From my experiments it seems like it uses method 2?

edited Jan 30 at 1:56

Olorin

3,3331417

asked Jan 29 at 17:59

tangy

1315

1

Reading the central directory is still a linear search just like going through each local header and so both are O(n). Reading the central directory will typically mean reading a lot fewer disk blocks but the number of blocks read is also still O(n).

– Ross Ridge
Jan 30 at 2:47

add a comment |

As described here, there are two ways of unzipping/extracting files from archives:

Seek to the end of the file and lookup the central directory. Then use that for fast random access to the file to be extracted.(Amortized O(1) complexity)

Look through each local header and extract the one where theres a match.(O(n) complexity)

Which method does unzip use? From my experiments it seems like it uses method 2?

edited Jan 30 at 1:56

Olorin

3,3331417

asked Jan 29 at 17:59

tangy

1315

1

Reading the central directory is still a linear search just like going through each local header and so both are O(n). Reading the central directory will typically mean reading a lot fewer disk blocks but the number of blocks read is also still O(n).

– Ross Ridge
Jan 30 at 2:47

add a comment |

As described here, there are two ways of unzipping/extracting files from archives:

Seek to the end of the file and lookup the central directory. Then use that for fast random access to the file to be extracted.(Amortized O(1) complexity)

Look through each local header and extract the one where theres a match.(O(n) complexity)

Which method does unzip use? From my experiments it seems like it uses method 2?

edited Jan 30 at 1:56

Olorin

3,3331417

asked Jan 29 at 17:59

tangy

1315

As described here, there are two ways of unzipping/extracting files from archives:

Seek to the end of the file and lookup the central directory. Then use that for fast random access to the file to be extracted.(Amortized O(1) complexity)

Look through each local header and extract the one where theres a match.(O(n) complexity)

Which method does unzip use? From my experiments it seems like it uses method 2?

zip archive

edited Jan 30 at 1:56

Olorin

3,3331417

asked Jan 29 at 17:59

tangy

1315

edited Jan 30 at 1:56

Olorin

3,3331417

asked Jan 29 at 17:59

tangy

1315

edited Jan 30 at 1:56

Olorin

3,3331417

edited Jan 30 at 1:56

Olorin

3,3331417

edited Jan 30 at 1:56

Olorin

3,3331417

asked Jan 29 at 17:59

tangy

1315

asked Jan 29 at 17:59

tangy

1315

asked Jan 29 at 17:59

tangy

1315

1

Reading the central directory is still a linear search just like going through each local header and so both are O(n). Reading the central directory will typically mean reading a lot fewer disk blocks but the number of blocks read is also still O(n).

– Ross Ridge
Jan 30 at 2:47

add a comment |

1

Reading the central directory is still a linear search just like going through each local header and so both are O(n). Reading the central directory will typically mean reading a lot fewer disk blocks but the number of blocks read is also still O(n).

– Ross Ridge
Jan 30 at 2:47

Reading the central directory is still a linear search just like going through each local header and so both are O(n). Reading the central directory will typically mean reading a lot fewer disk blocks but the number of blocks read is also still O(n).

– Ross Ridge
Jan 30 at 2:47

add a comment |

2 Answers
2

active

oldest

votes

When searching for a single file in a large archive, it uses method 1, which you can see using strace:

open("dataset.zip", O_RDONLY)           = 3

ioctl(1, TIOCGWINSZ, 0x7fff9a895920)    = -1 ENOTTY (Inappropriate ioctl for device)

write(1, "Archive:  dataset.zipn", 22Archive:  dataset.zip

) = 22

lseek(3, 943718400, SEEK_SET)           = 943718400

read(3, "340P356(s34230620520127360U[250/2207346<252+u2342251[<2310E342274"..., 4522) = 4522

lseek(3, 943722880, SEEK_SET)           = 943722880

read(3, "3f225P\uxv14350343503", 20) = 20

lseek(3, 943718400, SEEK_SET)           = 943718400

read(3, "340P356(s34230620520127360U[250/2207346<252+u2342251[<2310E342274"..., 8192) = 4522

lseek(3, 849346560, SEEK_SET)           = 849346560

read(3, "D262nv210343240C24227344367q300223231306330275266213276M7I'&352234J"..., 8192) = 8192

stat("rand-28.txt", 0x559f43e0a550)     = -1 ENOENT (No such file or directory)

lstat("rand-28.txt", 0x559f43e0a550)    = -1 ENOENT (No such file or directory)

stat("rand-28.txt", 0x559f43e0a550)     = -1 ENOENT (No such file or directory)

lstat("rand-28.txt", 0x559f43e0a550)    = -1 ENOENT (No such file or directory)

open("rand-28.txt", O_RDWR|O_CREAT|O_TRUNC, 0666) = 4

ioctl(1, TIOCGWINSZ, 0x7fff9a895790)    = -1 ENOTTY (Inappropriate ioctl for device)

write(1, " extracting: rand-28.txt        "..., 37 extracting: rand-28.txt             ) = 37

read(3, "2753279Y206223217}355W%:220YNT257260z^361T242237021336372+306310"..., 8192) = 8192

unzip opens dataset.zip, seeks to the end, then seeks to the start of the requested file in the archive (rand-28.txt, at offset 849346560) and reads from there.

The central directory is found by scanning the last 65557 bytes of the archive; see the code starting here:

/*---------------------------------------------------------------------------

    Find and process the end-of-central-directory header.  UnZip need only

    check last 65557 bytes of zipfile:  comment may be up to 65535, end-of-

    central-directory record is 18 bytes, and signature itself is 4 bytes;

    add some to allow for appended garbage.  Since ZipInfo is often used as

    a debugging tool, search the whole zipfile if zipinfo_mode is true.

  ---------------------------------------------------------------------------*/

edited Jan 29 at 21:50

answered Jan 29 at 18:05

Stephen Kitt

170k24385462

Could you add additional information if possible about how it actually finds the central directory record(which here seems to be 943718400)

– tangy
Jan 29 at 19:09

1

See my update. Let me know if you would like an interpretation of the code!

– Stephen Kitt
Jan 29 at 21:51

Thanks a ton - both for the addition of relevant sections with references and the offer to help understand! I'll try reading it on my own at first and request you if I do :)

– tangy
Jan 29 at 22:15

add a comment |

Actually it's a mixture. unzip reads some data from a known location, and then reads data blocks related to (but not identical with) the target entry in the zip-file.

The design of zip/unzip is explained in comments in the source-files. Here's the pertinent one from extract.c:

/*--------------------------------------------------------------------------- 

    The basic idea of this function is as follows.  Since the central di- 

    rectory lies at the end of the zipfile and the member files lie at the 

    beginning or middle or wherever, it is not very desirable to simply 

    read a central directory entry, jump to the member and extract it, and 

    then jump back to the central directory.  In the case of a large zipfile 

    this would lead to a whole lot of disk-grinding, especially if each mem- 

    ber file is small.  Instead, we read from the central directory the per- 

    tinent information for a block of files, then go extract/test the whole 

    block.  Thus this routine contains two small(er) loops within a very 

    large outer loop:  the first of the small ones reads a block of files 

    from the central directory; the second extracts or tests each file; and 

    the outer one loops over blocks.  There's some file-pointer positioning 

    stuff in between, but that's about it.  Btw, it's because of this jump- 

    ing around that we can afford to be lenient if an error occurs in one of 

    the member files:  we should still be able to go find the other members, 

    since we know the offset of each from the beginning of the zipfile. 

  ---------------------------------------------------------------------------*/

The format itself is mostly derived from PK-Ware's implementation, and is summarized in programming information text-files. According to that, there's more than one type of record in the central directory as well, so unzip cannot readily go to the end of the file and make an array of entries to lookup the target file.

Now... if you take the time to read the source code, you'll discover that unzip reads buffers of 8192 bytes (look for INBUFSIZ). I'd only use the single-file extract for a fairly large zip file (I had in mind the Java sources), but even for a smaller zip-file, you can see the effect of the buffer size. To see this, I zipped up the Git files for PuTTY, which gave 2727 files (counting a copy of the git log). Java was bigger than than 20 years ago, and hasn't shrunk. Extracting that log from the zip-file (chosen since it wouldn't be at the end of an alphabetically-sorted index and likely not in the first block read from the central directory) gave this from strace for the lseek calls:

lseek(3, -2252, SEEK_CUR)               = 1267

lseek(3, 120463360, SEEK_SET)           = 120463360

lseek(3, 120468731, SEEK_SET)           = 120468731

lseek(3, 120135680, SEEK_SET)           = 120135680

lseek(3, 270336, SEEK_SET)              = 270336

lseek(3, 120463360, SEEK_SET)           = 120463360

As usual, with benchmarks, ymmv.

edited Jan 30 at 21:19

answered Jan 29 at 21:43

Thomas Dickey

52.8k597170

While this makes sense if we are extracting the entire archive, for a single archive why would I need to do things this way? I think it might be using another routine in the case mentioned in the question eg. extract_or_test_member?

– tangy
Jan 29 at 22:09

typo - i meant single file within the archive. With a single file just 2 seeks would be needed right?

– tangy
Jan 29 at 22:15

More than that: it has to read the record which tells how large the central directory is, and it's not guaranteed that it'll read the whole directory while it's doing that.

– Thomas Dickey
Jan 29 at 22:21

add a comment |

Your Answer

StackExchange.ready(function() {
var channelOptions = {
tags: "".split(" "),
id: "106"
};
initTagRenderer("".split(" "), "".split(" "), channelOptions);

StackExchange.using("externalEditor", function() {
// Have to fire editor after snippets, if snippets enabled
if (StackExchange.settings.snippets.snippetsEnabled) {
StackExchange.using("snippets", function() {
createEditor();
});
}
else {
createEditor();
}
});

function createEditor() {
StackExchange.prepareEditor({
heartbeatType: 'answer',
autoActivateHeartbeat: false,
convertImagesToLinks: false,
noModals: true,
showLowRepImageUploadWarning: true,
reputationToPostImages: null,
bindNavPrevention: true,
postfix: "",
imageUploader: {
brandingHtml: "Powered by u003ca class="icon-imgur-white" href="https://imgur.com/"u003eu003c/au003e",
contentPolicyHtml: "User contributions licensed under u003ca href="https://creativecommons.org/licenses/by-sa/3.0/"u003ecc by-sa 3.0 with attribution requiredu003c/au003e u003ca href="https://stackoverflow.com/legal/content-policy"u003e(content policy)u003c/au003e",
allowUrls: true
},
onDemand: true,
discardSelector: ".discard-answer"
,immediatelyShowMarkdownHelp:true
});

}
});

draft saved

draft discarded

Sign up or log in

StackExchange.ready(function () {
StackExchange.helpers.onClickDraftSave('#login-link');
});

Post as a guest

Name

Required, but never shown

StackExchange.ready(
function () {
StackExchange.openid.initPostLogin('.new-post-login', 'https%3a%2f%2funix.stackexchange.com%2fquestions%2f497509%2fwhat-method-does-unzip-use-to-find-a-single-file-in-an-archive%23new-answer', 'question_page');
}
);

Post as a guest

Name

Required, but never shown

2 Answers
2

active

oldest

votes

2 Answers
2

active

oldest

votes

When searching for a single file in a large archive, it uses method 1, which you can see using strace:

open("dataset.zip", O_RDONLY)           = 3

ioctl(1, TIOCGWINSZ, 0x7fff9a895920)    = -1 ENOTTY (Inappropriate ioctl for device)

write(1, "Archive:  dataset.zipn", 22Archive:  dataset.zip

) = 22

lseek(3, 943718400, SEEK_SET)           = 943718400

read(3, "340P356(s34230620520127360U[250/2207346<252+u2342251[<2310E342274"..., 4522) = 4522

lseek(3, 943722880, SEEK_SET)           = 943722880

read(3, "3f225P\uxv14350343503", 20) = 20

lseek(3, 943718400, SEEK_SET)           = 943718400

read(3, "340P356(s34230620520127360U[250/2207346<252+u2342251[<2310E342274"..., 8192) = 4522

lseek(3, 849346560, SEEK_SET)           = 849346560

read(3, "D262nv210343240C24227344367q300223231306330275266213276M7I'&352234J"..., 8192) = 8192

stat("rand-28.txt", 0x559f43e0a550)     = -1 ENOENT (No such file or directory)

lstat("rand-28.txt", 0x559f43e0a550)    = -1 ENOENT (No such file or directory)

stat("rand-28.txt", 0x559f43e0a550)     = -1 ENOENT (No such file or directory)

lstat("rand-28.txt", 0x559f43e0a550)    = -1 ENOENT (No such file or directory)

open("rand-28.txt", O_RDWR|O_CREAT|O_TRUNC, 0666) = 4

ioctl(1, TIOCGWINSZ, 0x7fff9a895790)    = -1 ENOTTY (Inappropriate ioctl for device)

write(1, " extracting: rand-28.txt        "..., 37 extracting: rand-28.txt             ) = 37

read(3, "2753279Y206223217}355W%:220YNT257260z^361T242237021336372+306310"..., 8192) = 8192

unzip opens dataset.zip, seeks to the end, then seeks to the start of the requested file in the archive (rand-28.txt, at offset 849346560) and reads from there.

The central directory is found by scanning the last 65557 bytes of the archive; see the code starting here:

/*---------------------------------------------------------------------------

    Find and process the end-of-central-directory header.  UnZip need only

    check last 65557 bytes of zipfile:  comment may be up to 65535, end-of-

    central-directory record is 18 bytes, and signature itself is 4 bytes;

    add some to allow for appended garbage.  Since ZipInfo is often used as

    a debugging tool, search the whole zipfile if zipinfo_mode is true.

  ---------------------------------------------------------------------------*/

edited Jan 29 at 21:50

answered Jan 29 at 18:05

Stephen Kitt

170k24385462

Could you add additional information if possible about how it actually finds the central directory record(which here seems to be 943718400)

– tangy
Jan 29 at 19:09

1

See my update. Let me know if you would like an interpretation of the code!

– Stephen Kitt
Jan 29 at 21:51

Thanks a ton - both for the addition of relevant sections with references and the offer to help understand! I'll try reading it on my own at first and request you if I do :)

– tangy
Jan 29 at 22:15

add a comment |

When searching for a single file in a large archive, it uses method 1, which you can see using strace:

open("dataset.zip", O_RDONLY)           = 3

ioctl(1, TIOCGWINSZ, 0x7fff9a895920)    = -1 ENOTTY (Inappropriate ioctl for device)

write(1, "Archive:  dataset.zipn", 22Archive:  dataset.zip

) = 22

lseek(3, 943718400, SEEK_SET)           = 943718400

read(3, "340P356(s34230620520127360U[250/2207346<252+u2342251[<2310E342274"..., 4522) = 4522

lseek(3, 943722880, SEEK_SET)           = 943722880

read(3, "3f225P\uxv14350343503", 20) = 20

lseek(3, 943718400, SEEK_SET)           = 943718400

read(3, "340P356(s34230620520127360U[250/2207346<252+u2342251[<2310E342274"..., 8192) = 4522

lseek(3, 849346560, SEEK_SET)           = 849346560

read(3, "D262nv210343240C24227344367q300223231306330275266213276M7I'&352234J"..., 8192) = 8192

stat("rand-28.txt", 0x559f43e0a550)     = -1 ENOENT (No such file or directory)

lstat("rand-28.txt", 0x559f43e0a550)    = -1 ENOENT (No such file or directory)

stat("rand-28.txt", 0x559f43e0a550)     = -1 ENOENT (No such file or directory)

lstat("rand-28.txt", 0x559f43e0a550)    = -1 ENOENT (No such file or directory)

open("rand-28.txt", O_RDWR|O_CREAT|O_TRUNC, 0666) = 4

ioctl(1, TIOCGWINSZ, 0x7fff9a895790)    = -1 ENOTTY (Inappropriate ioctl for device)

write(1, " extracting: rand-28.txt        "..., 37 extracting: rand-28.txt             ) = 37

read(3, "2753279Y206223217}355W%:220YNT257260z^361T242237021336372+306310"..., 8192) = 8192

unzip opens dataset.zip, seeks to the end, then seeks to the start of the requested file in the archive (rand-28.txt, at offset 849346560) and reads from there.

The central directory is found by scanning the last 65557 bytes of the archive; see the code starting here:

/*---------------------------------------------------------------------------

    Find and process the end-of-central-directory header.  UnZip need only

    check last 65557 bytes of zipfile:  comment may be up to 65535, end-of-

    central-directory record is 18 bytes, and signature itself is 4 bytes;

    add some to allow for appended garbage.  Since ZipInfo is often used as

    a debugging tool, search the whole zipfile if zipinfo_mode is true.

  ---------------------------------------------------------------------------*/

edited Jan 29 at 21:50

answered Jan 29 at 18:05

Stephen Kitt

170k24385462

Could you add additional information if possible about how it actually finds the central directory record(which here seems to be 943718400)

– tangy
Jan 29 at 19:09

1

See my update. Let me know if you would like an interpretation of the code!

– Stephen Kitt
Jan 29 at 21:51

Thanks a ton - both for the addition of relevant sections with references and the offer to help understand! I'll try reading it on my own at first and request you if I do :)

– tangy
Jan 29 at 22:15

add a comment |

When searching for a single file in a large archive, it uses method 1, which you can see using strace:

open("dataset.zip", O_RDONLY)           = 3

ioctl(1, TIOCGWINSZ, 0x7fff9a895920)    = -1 ENOTTY (Inappropriate ioctl for device)

write(1, "Archive:  dataset.zipn", 22Archive:  dataset.zip

) = 22

lseek(3, 943718400, SEEK_SET)           = 943718400

read(3, "340P356(s34230620520127360U[250/2207346<252+u2342251[<2310E342274"..., 4522) = 4522

lseek(3, 943722880, SEEK_SET)           = 943722880

read(3, "3f225P\uxv14350343503", 20) = 20

lseek(3, 943718400, SEEK_SET)           = 943718400

read(3, "340P356(s34230620520127360U[250/2207346<252+u2342251[<2310E342274"..., 8192) = 4522

lseek(3, 849346560, SEEK_SET)           = 849346560

read(3, "D262nv210343240C24227344367q300223231306330275266213276M7I'&352234J"..., 8192) = 8192

stat("rand-28.txt", 0x559f43e0a550)     = -1 ENOENT (No such file or directory)

lstat("rand-28.txt", 0x559f43e0a550)    = -1 ENOENT (No such file or directory)

stat("rand-28.txt", 0x559f43e0a550)     = -1 ENOENT (No such file or directory)

lstat("rand-28.txt", 0x559f43e0a550)    = -1 ENOENT (No such file or directory)

open("rand-28.txt", O_RDWR|O_CREAT|O_TRUNC, 0666) = 4

ioctl(1, TIOCGWINSZ, 0x7fff9a895790)    = -1 ENOTTY (Inappropriate ioctl for device)

write(1, " extracting: rand-28.txt        "..., 37 extracting: rand-28.txt             ) = 37

read(3, "2753279Y206223217}355W%:220YNT257260z^361T242237021336372+306310"..., 8192) = 8192

unzip opens dataset.zip, seeks to the end, then seeks to the start of the requested file in the archive (rand-28.txt, at offset 849346560) and reads from there.

The central directory is found by scanning the last 65557 bytes of the archive; see the code starting here:

/*---------------------------------------------------------------------------

    Find and process the end-of-central-directory header.  UnZip need only

    check last 65557 bytes of zipfile:  comment may be up to 65535, end-of-

    central-directory record is 18 bytes, and signature itself is 4 bytes;

    add some to allow for appended garbage.  Since ZipInfo is often used as

    a debugging tool, search the whole zipfile if zipinfo_mode is true.

  ---------------------------------------------------------------------------*/

edited Jan 29 at 21:50

answered Jan 29 at 18:05

Stephen Kitt

170k24385462

When searching for a single file in a large archive, it uses method 1, which you can see using strace:

open("dataset.zip", O_RDONLY)           = 3

ioctl(1, TIOCGWINSZ, 0x7fff9a895920)    = -1 ENOTTY (Inappropriate ioctl for device)

write(1, "Archive:  dataset.zipn", 22Archive:  dataset.zip

) = 22

lseek(3, 943718400, SEEK_SET)           = 943718400

read(3, "340P356(s34230620520127360U[250/2207346<252+u2342251[<2310E342274"..., 4522) = 4522

lseek(3, 943722880, SEEK_SET)           = 943722880

read(3, "3f225P\uxv14350343503", 20) = 20

lseek(3, 943718400, SEEK_SET)           = 943718400

read(3, "340P356(s34230620520127360U[250/2207346<252+u2342251[<2310E342274"..., 8192) = 4522

lseek(3, 849346560, SEEK_SET)           = 849346560

read(3, "D262nv210343240C24227344367q300223231306330275266213276M7I'&352234J"..., 8192) = 8192

stat("rand-28.txt", 0x559f43e0a550)     = -1 ENOENT (No such file or directory)

lstat("rand-28.txt", 0x559f43e0a550)    = -1 ENOENT (No such file or directory)

stat("rand-28.txt", 0x559f43e0a550)     = -1 ENOENT (No such file or directory)

lstat("rand-28.txt", 0x559f43e0a550)    = -1 ENOENT (No such file or directory)

open("rand-28.txt", O_RDWR|O_CREAT|O_TRUNC, 0666) = 4

ioctl(1, TIOCGWINSZ, 0x7fff9a895790)    = -1 ENOTTY (Inappropriate ioctl for device)

write(1, " extracting: rand-28.txt        "..., 37 extracting: rand-28.txt             ) = 37

read(3, "2753279Y206223217}355W%:220YNT257260z^361T242237021336372+306310"..., 8192) = 8192

unzip opens dataset.zip, seeks to the end, then seeks to the start of the requested file in the archive (rand-28.txt, at offset 849346560) and reads from there.

The central directory is found by scanning the last 65557 bytes of the archive; see the code starting here:

/*---------------------------------------------------------------------------

    Find and process the end-of-central-directory header.  UnZip need only

    check last 65557 bytes of zipfile:  comment may be up to 65535, end-of-

    central-directory record is 18 bytes, and signature itself is 4 bytes;

    add some to allow for appended garbage.  Since ZipInfo is often used as

    a debugging tool, search the whole zipfile if zipinfo_mode is true.

  ---------------------------------------------------------------------------*/

edited Jan 29 at 21:50

answered Jan 29 at 18:05

Stephen Kitt

170k24385462

edited Jan 29 at 21:50

answered Jan 29 at 18:05

Stephen Kitt

170k24385462

answered Jan 29 at 18:05

Stephen Kitt

170k24385462

answered Jan 29 at 18:05

Stephen Kitt

170k24385462

Could you add additional information if possible about how it actually finds the central directory record(which here seems to be 943718400)

– tangy
Jan 29 at 19:09

1

See my update. Let me know if you would like an interpretation of the code!

– Stephen Kitt
Jan 29 at 21:51

Thanks a ton - both for the addition of relevant sections with references and the offer to help understand! I'll try reading it on my own at first and request you if I do :)

– tangy
Jan 29 at 22:15

add a comment |

Could you add additional information if possible about how it actually finds the central directory record(which here seems to be 943718400)

– tangy
Jan 29 at 19:09

1

See my update. Let me know if you would like an interpretation of the code!

– Stephen Kitt
Jan 29 at 21:51

Thanks a ton - both for the addition of relevant sections with references and the offer to help understand! I'll try reading it on my own at first and request you if I do :)

– tangy
Jan 29 at 22:15

Could you add additional information if possible about how it actually finds the central directory record(which here seems to be 943718400)

– tangy
Jan 29 at 19:09

See my update. Let me know if you would like an interpretation of the code!

– Stephen Kitt
Jan 29 at 21:51

Thanks a ton - both for the addition of relevant sections with references and the offer to help understand! I'll try reading it on my own at first and request you if I do :)

– tangy
Jan 29 at 22:15

add a comment |

Actually it's a mixture. unzip reads some data from a known location, and then reads data blocks related to (but not identical with) the target entry in the zip-file.

The design of zip/unzip is explained in comments in the source-files. Here's the pertinent one from extract.c:

/*--------------------------------------------------------------------------- 

    The basic idea of this function is as follows.  Since the central di- 

    rectory lies at the end of the zipfile and the member files lie at the 

    beginning or middle or wherever, it is not very desirable to simply 

    read a central directory entry, jump to the member and extract it, and 

    then jump back to the central directory.  In the case of a large zipfile 

    this would lead to a whole lot of disk-grinding, especially if each mem- 

    ber file is small.  Instead, we read from the central directory the per- 

    tinent information for a block of files, then go extract/test the whole 

    block.  Thus this routine contains two small(er) loops within a very 

    large outer loop:  the first of the small ones reads a block of files 

    from the central directory; the second extracts or tests each file; and 

    the outer one loops over blocks.  There's some file-pointer positioning 

    stuff in between, but that's about it.  Btw, it's because of this jump- 

    ing around that we can afford to be lenient if an error occurs in one of 

    the member files:  we should still be able to go find the other members, 

    since we know the offset of each from the beginning of the zipfile. 

  ---------------------------------------------------------------------------*/

lseek(3, -2252, SEEK_CUR)               = 1267

lseek(3, 120463360, SEEK_SET)           = 120463360

lseek(3, 120468731, SEEK_SET)           = 120468731

lseek(3, 120135680, SEEK_SET)           = 120135680

lseek(3, 270336, SEEK_SET)              = 270336

lseek(3, 120463360, SEEK_SET)           = 120463360

As usual, with benchmarks, ymmv.

edited Jan 30 at 21:19

answered Jan 29 at 21:43

Thomas Dickey

52.8k597170

While this makes sense if we are extracting the entire archive, for a single archive why would I need to do things this way? I think it might be using another routine in the case mentioned in the question eg. extract_or_test_member?

– tangy
Jan 29 at 22:09

typo - i meant single file within the archive. With a single file just 2 seeks would be needed right?

– tangy
Jan 29 at 22:15

More than that: it has to read the record which tells how large the central directory is, and it's not guaranteed that it'll read the whole directory while it's doing that.

– Thomas Dickey
Jan 29 at 22:21

add a comment |

Actually it's a mixture. unzip reads some data from a known location, and then reads data blocks related to (but not identical with) the target entry in the zip-file.

The design of zip/unzip is explained in comments in the source-files. Here's the pertinent one from extract.c:

/*--------------------------------------------------------------------------- 

    The basic idea of this function is as follows.  Since the central di- 

    rectory lies at the end of the zipfile and the member files lie at the 

    beginning or middle or wherever, it is not very desirable to simply 

    read a central directory entry, jump to the member and extract it, and 

    then jump back to the central directory.  In the case of a large zipfile 

    this would lead to a whole lot of disk-grinding, especially if each mem- 

    ber file is small.  Instead, we read from the central directory the per- 

    tinent information for a block of files, then go extract/test the whole 

    block.  Thus this routine contains two small(er) loops within a very 

    large outer loop:  the first of the small ones reads a block of files 

    from the central directory; the second extracts or tests each file; and 

    the outer one loops over blocks.  There's some file-pointer positioning 

    stuff in between, but that's about it.  Btw, it's because of this jump- 

    ing around that we can afford to be lenient if an error occurs in one of 

    the member files:  we should still be able to go find the other members, 

    since we know the offset of each from the beginning of the zipfile. 

  ---------------------------------------------------------------------------*/

lseek(3, -2252, SEEK_CUR)               = 1267

lseek(3, 120463360, SEEK_SET)           = 120463360

lseek(3, 120468731, SEEK_SET)           = 120468731

lseek(3, 120135680, SEEK_SET)           = 120135680

lseek(3, 270336, SEEK_SET)              = 270336

lseek(3, 120463360, SEEK_SET)           = 120463360

As usual, with benchmarks, ymmv.

edited Jan 30 at 21:19

answered Jan 29 at 21:43

Thomas Dickey

52.8k597170

While this makes sense if we are extracting the entire archive, for a single archive why would I need to do things this way? I think it might be using another routine in the case mentioned in the question eg. extract_or_test_member?

– tangy
Jan 29 at 22:09

typo - i meant single file within the archive. With a single file just 2 seeks would be needed right?

– tangy
Jan 29 at 22:15

More than that: it has to read the record which tells how large the central directory is, and it's not guaranteed that it'll read the whole directory while it's doing that.

– Thomas Dickey
Jan 29 at 22:21

add a comment |

Actually it's a mixture. unzip reads some data from a known location, and then reads data blocks related to (but not identical with) the target entry in the zip-file.

The design of zip/unzip is explained in comments in the source-files. Here's the pertinent one from extract.c:

/*--------------------------------------------------------------------------- 

    The basic idea of this function is as follows.  Since the central di- 

    rectory lies at the end of the zipfile and the member files lie at the 

    beginning or middle or wherever, it is not very desirable to simply 

    read a central directory entry, jump to the member and extract it, and 

    then jump back to the central directory.  In the case of a large zipfile 

    this would lead to a whole lot of disk-grinding, especially if each mem- 

    ber file is small.  Instead, we read from the central directory the per- 

    tinent information for a block of files, then go extract/test the whole 

    block.  Thus this routine contains two small(er) loops within a very 

    large outer loop:  the first of the small ones reads a block of files 

    from the central directory; the second extracts or tests each file; and 

    the outer one loops over blocks.  There's some file-pointer positioning 

    stuff in between, but that's about it.  Btw, it's because of this jump- 

    ing around that we can afford to be lenient if an error occurs in one of 

    the member files:  we should still be able to go find the other members, 

    since we know the offset of each from the beginning of the zipfile. 

  ---------------------------------------------------------------------------*/

lseek(3, -2252, SEEK_CUR)               = 1267

lseek(3, 120463360, SEEK_SET)           = 120463360

lseek(3, 120468731, SEEK_SET)           = 120468731

lseek(3, 120135680, SEEK_SET)           = 120135680

lseek(3, 270336, SEEK_SET)              = 270336

lseek(3, 120463360, SEEK_SET)           = 120463360

As usual, with benchmarks, ymmv.

edited Jan 30 at 21:19

answered Jan 29 at 21:43

Thomas Dickey

52.8k597170

Actually it's a mixture. unzip reads some data from a known location, and then reads data blocks related to (but not identical with) the target entry in the zip-file.

The design of zip/unzip is explained in comments in the source-files. Here's the pertinent one from extract.c:

/*--------------------------------------------------------------------------- 

    The basic idea of this function is as follows.  Since the central di- 

    rectory lies at the end of the zipfile and the member files lie at the 

    beginning or middle or wherever, it is not very desirable to simply 

    read a central directory entry, jump to the member and extract it, and 

    then jump back to the central directory.  In the case of a large zipfile 

    this would lead to a whole lot of disk-grinding, especially if each mem- 

    ber file is small.  Instead, we read from the central directory the per- 

    tinent information for a block of files, then go extract/test the whole 

    block.  Thus this routine contains two small(er) loops within a very 

    large outer loop:  the first of the small ones reads a block of files 

    from the central directory; the second extracts or tests each file; and 

    the outer one loops over blocks.  There's some file-pointer positioning 

    stuff in between, but that's about it.  Btw, it's because of this jump- 

    ing around that we can afford to be lenient if an error occurs in one of 

    the member files:  we should still be able to go find the other members, 

    since we know the offset of each from the beginning of the zipfile. 

  ---------------------------------------------------------------------------*/

lseek(3, -2252, SEEK_CUR)               = 1267

lseek(3, 120463360, SEEK_SET)           = 120463360

lseek(3, 120468731, SEEK_SET)           = 120468731

lseek(3, 120135680, SEEK_SET)           = 120135680

lseek(3, 270336, SEEK_SET)              = 270336

lseek(3, 120463360, SEEK_SET)           = 120463360

As usual, with benchmarks, ymmv.

edited Jan 30 at 21:19

answered Jan 29 at 21:43

Thomas Dickey

52.8k597170

edited Jan 30 at 21:19

answered Jan 29 at 21:43

Thomas Dickey

52.8k597170

answered Jan 29 at 21:43

Thomas Dickey

52.8k597170

answered Jan 29 at 21:43

Thomas Dickey

52.8k597170

While this makes sense if we are extracting the entire archive, for a single archive why would I need to do things this way? I think it might be using another routine in the case mentioned in the question eg. extract_or_test_member?

– tangy
Jan 29 at 22:09

typo - i meant single file within the archive. With a single file just 2 seeks would be needed right?

– tangy
Jan 29 at 22:15

More than that: it has to read the record which tells how large the central directory is, and it's not guaranteed that it'll read the whole directory while it's doing that.

– Thomas Dickey
Jan 29 at 22:21

add a comment |

While this makes sense if we are extracting the entire archive, for a single archive why would I need to do things this way? I think it might be using another routine in the case mentioned in the question eg. extract_or_test_member?

– tangy
Jan 29 at 22:09

typo - i meant single file within the archive. With a single file just 2 seeks would be needed right?

– tangy
Jan 29 at 22:15

More than that: it has to read the record which tells how large the central directory is, and it's not guaranteed that it'll read the whole directory while it's doing that.

– Thomas Dickey
Jan 29 at 22:21

While this makes sense if we are extracting the entire archive, for a single archive why would I need to do things this way? I think it might be using another routine in the case mentioned in the question eg. extract_or_test_member?

– tangy
Jan 29 at 22:09

typo - i meant single file within the archive. With a single file just 2 seeks would be needed right?

– tangy
Jan 29 at 22:15

More than that: it has to read the record which tells how large the central directory is, and it's not guaranteed that it'll read the whole directory while it's doing that.

– Thomas Dickey
Jan 29 at 22:21

add a comment |

draft saved

draft discarded

Thanks for contributing an answer to Unix & Linux Stack Exchange!

Please be sure to answer the question. Provide details and share your research!

But avoid …

Asking for help, clarification, or responding to other answers.

Making statements based on opinion; back them up with references or personal experience.

To learn more, see our tips on writing great answers.

draft saved

draft discarded

Sign up or log in

StackExchange.ready(function () {
StackExchange.helpers.onClickDraftSave('#login-link');
});

Post as a guest

Name

Required, but never shown

Post as a guest

Name

Required, but never shown

Sign up or log in

StackExchange.ready(function () {
StackExchange.helpers.onClickDraftSave('#login-link');
});

Post as a guest

Name

Required, but never shown

Sign up or log in

StackExchange.ready(function () {
StackExchange.helpers.onClickDraftSave('#login-link');
});

Post as a guest

Name

Required, but never shown

Sign up or log in

StackExchange.ready(function () {
StackExchange.helpers.onClickDraftSave('#login-link');
});

Post as a guest

Name

Required, but never shown

Name

Required, but never shown

Name

Required, but never shown

This page is only for reference, If you need detailed information, please check here

搜尋此網誌

Cfrgtkky