What method does unzip use to find a single file in an archive?
Let's say I create 100 files with random text data of size 30MB each. Now I create a zip archive with 0 compression i.e. zip dataset.zip -r -0 *.txt
. Now I want to extract just one file from this archive.
As described here, there are two ways of unzipping/extracting files from archives:
- Seek to the end of the file and lookup the central directory. Then use that for fast random access to the file to be extracted.(Amortized
O(1)
complexity) - Look through each local header and extract the one where theres a match.(
O(n)
complexity)
Which method does unzip use? From my experiments it seems like it uses method 2?
zip archive
add a comment |
Let's say I create 100 files with random text data of size 30MB each. Now I create a zip archive with 0 compression i.e. zip dataset.zip -r -0 *.txt
. Now I want to extract just one file from this archive.
As described here, there are two ways of unzipping/extracting files from archives:
- Seek to the end of the file and lookup the central directory. Then use that for fast random access to the file to be extracted.(Amortized
O(1)
complexity) - Look through each local header and extract the one where theres a match.(
O(n)
complexity)
Which method does unzip use? From my experiments it seems like it uses method 2?
zip archive
1
Reading the central directory is still a linear search just like going through each local header and so both are O(n). Reading the central directory will typically mean reading a lot fewer disk blocks but the number of blocks read is also still O(n).
– Ross Ridge
Jan 30 at 2:47
add a comment |
Let's say I create 100 files with random text data of size 30MB each. Now I create a zip archive with 0 compression i.e. zip dataset.zip -r -0 *.txt
. Now I want to extract just one file from this archive.
As described here, there are two ways of unzipping/extracting files from archives:
- Seek to the end of the file and lookup the central directory. Then use that for fast random access to the file to be extracted.(Amortized
O(1)
complexity) - Look through each local header and extract the one where theres a match.(
O(n)
complexity)
Which method does unzip use? From my experiments it seems like it uses method 2?
zip archive
Let's say I create 100 files with random text data of size 30MB each. Now I create a zip archive with 0 compression i.e. zip dataset.zip -r -0 *.txt
. Now I want to extract just one file from this archive.
As described here, there are two ways of unzipping/extracting files from archives:
- Seek to the end of the file and lookup the central directory. Then use that for fast random access to the file to be extracted.(Amortized
O(1)
complexity) - Look through each local header and extract the one where theres a match.(
O(n)
complexity)
Which method does unzip use? From my experiments it seems like it uses method 2?
zip archive
zip archive
edited Jan 30 at 1:56
Olorin
3,3331417
3,3331417
asked Jan 29 at 17:59
tangytangy
1315
1315
1
Reading the central directory is still a linear search just like going through each local header and so both are O(n). Reading the central directory will typically mean reading a lot fewer disk blocks but the number of blocks read is also still O(n).
– Ross Ridge
Jan 30 at 2:47
add a comment |
1
Reading the central directory is still a linear search just like going through each local header and so both are O(n). Reading the central directory will typically mean reading a lot fewer disk blocks but the number of blocks read is also still O(n).
– Ross Ridge
Jan 30 at 2:47
1
1
Reading the central directory is still a linear search just like going through each local header and so both are O(n). Reading the central directory will typically mean reading a lot fewer disk blocks but the number of blocks read is also still O(n).
– Ross Ridge
Jan 30 at 2:47
Reading the central directory is still a linear search just like going through each local header and so both are O(n). Reading the central directory will typically mean reading a lot fewer disk blocks but the number of blocks read is also still O(n).
– Ross Ridge
Jan 30 at 2:47
add a comment |
2 Answers
2
active
oldest
votes
When searching for a single file in a large archive, it uses method 1, which you can see using strace
:
open("dataset.zip", O_RDONLY) = 3
ioctl(1, TIOCGWINSZ, 0x7fff9a895920) = -1 ENOTTY (Inappropriate ioctl for device)
write(1, "Archive: dataset.zipn", 22Archive: dataset.zip
) = 22
lseek(3, 943718400, SEEK_SET) = 943718400
read(3, "340P356(s34230620520127360U[250/2207346<252+u2342251[<2310E342274"..., 4522) = 4522
lseek(3, 943722880, SEEK_SET) = 943722880
read(3, "3f225P\uxv14350343503", 20) = 20
lseek(3, 943718400, SEEK_SET) = 943718400
read(3, "340P356(s34230620520127360U[250/2207346<252+u2342251[<2310E342274"..., 8192) = 4522
lseek(3, 849346560, SEEK_SET) = 849346560
read(3, "D262nv210343240C24227344367q300223231306330275266213276M7I'&352234J"..., 8192) = 8192
stat("rand-28.txt", 0x559f43e0a550) = -1 ENOENT (No such file or directory)
lstat("rand-28.txt", 0x559f43e0a550) = -1 ENOENT (No such file or directory)
stat("rand-28.txt", 0x559f43e0a550) = -1 ENOENT (No such file or directory)
lstat("rand-28.txt", 0x559f43e0a550) = -1 ENOENT (No such file or directory)
open("rand-28.txt", O_RDWR|O_CREAT|O_TRUNC, 0666) = 4
ioctl(1, TIOCGWINSZ, 0x7fff9a895790) = -1 ENOTTY (Inappropriate ioctl for device)
write(1, " extracting: rand-28.txt "..., 37 extracting: rand-28.txt ) = 37
read(3, "2753279Y206223217}355W%:220YNT257260z^361T242237021336372+306310"..., 8192) = 8192
unzip
opens dataset.zip
, seeks to the end, then seeks to the start of the requested file in the archive (rand-28.txt
, at offset 849346560) and reads from there.
The central directory is found by scanning the last 65557 bytes of the archive; see the code starting here:
/*---------------------------------------------------------------------------
Find and process the end-of-central-directory header. UnZip need only
check last 65557 bytes of zipfile: comment may be up to 65535, end-of-
central-directory record is 18 bytes, and signature itself is 4 bytes;
add some to allow for appended garbage. Since ZipInfo is often used as
a debugging tool, search the whole zipfile if zipinfo_mode is true.
---------------------------------------------------------------------------*/
Could you add additional information if possible about how it actually finds the central directory record(which here seems to be 943718400)
– tangy
Jan 29 at 19:09
1
See my update. Let me know if you would like an interpretation of the code!
– Stephen Kitt
Jan 29 at 21:51
Thanks a ton - both for the addition of relevant sections with references and the offer to help understand! I'll try reading it on my own at first and request you if I do :)
– tangy
Jan 29 at 22:15
add a comment |
Actually it's a mixture. unzip reads some data from a known location, and then reads data blocks related to (but not identical with) the target entry in the zip-file.
The design of zip/unzip is explained in comments in the source-files. Here's the pertinent one from extract.c
:
/*---------------------------------------------------------------------------
The basic idea of this function is as follows. Since the central di-
rectory lies at the end of the zipfile and the member files lie at the
beginning or middle or wherever, it is not very desirable to simply
read a central directory entry, jump to the member and extract it, and
then jump back to the central directory. In the case of a large zipfile
this would lead to a whole lot of disk-grinding, especially if each mem-
ber file is small. Instead, we read from the central directory the per-
tinent information for a block of files, then go extract/test the whole
block. Thus this routine contains two small(er) loops within a very
large outer loop: the first of the small ones reads a block of files
from the central directory; the second extracts or tests each file; and
the outer one loops over blocks. There's some file-pointer positioning
stuff in between, but that's about it. Btw, it's because of this jump-
ing around that we can afford to be lenient if an error occurs in one of
the member files: we should still be able to go find the other members,
since we know the offset of each from the beginning of the zipfile.
---------------------------------------------------------------------------*/
The format itself is mostly derived from PK-Ware's implementation, and is summarized in programming information text-files. According to that, there's more than one type of record in the central directory as well, so unzip cannot readily go to the end of the file and make an array of entries to lookup the target file.
Now... if you take the time to read the source code, you'll discover that unzip
reads buffers of 8192 bytes (look for INBUFSIZ
). I'd only use the single-file extract for a fairly large zip file (I had in mind the Java sources), but even for a smaller zip-file, you can see the effect of the buffer size. To see this, I zipped up the Git files for PuTTY, which gave 2727 files (counting a copy of the git log). Java was bigger than than 20 years ago, and hasn't shrunk. Extracting that log from the zip-file (chosen since it wouldn't be at the end of an alphabetically-sorted index and likely not in the first block read from the central directory) gave this from strace
for the lseek
calls:
lseek(3, -2252, SEEK_CUR) = 1267
lseek(3, 120463360, SEEK_SET) = 120463360
lseek(3, 120468731, SEEK_SET) = 120468731
lseek(3, 120135680, SEEK_SET) = 120135680
lseek(3, 270336, SEEK_SET) = 270336
lseek(3, 120463360, SEEK_SET) = 120463360
As usual, with benchmarks, ymmv.
While this makes sense if we are extracting the entire archive, for a single archive why would I need to do things this way? I think it might be using another routine in the case mentioned in the question eg.extract_or_test_member
?
– tangy
Jan 29 at 22:09
typo - i meant single file within the archive. With a single file just 2 seeks would be needed right?
– tangy
Jan 29 at 22:15
More than that: it has to read the record which tells how large the central directory is, and it's not guaranteed that it'll read the whole directory while it's doing that.
– Thomas Dickey
Jan 29 at 22:21
add a comment |
Your Answer
StackExchange.ready(function() {
var channelOptions = {
tags: "".split(" "),
id: "106"
};
initTagRenderer("".split(" "), "".split(" "), channelOptions);
StackExchange.using("externalEditor", function() {
// Have to fire editor after snippets, if snippets enabled
if (StackExchange.settings.snippets.snippetsEnabled) {
StackExchange.using("snippets", function() {
createEditor();
});
}
else {
createEditor();
}
});
function createEditor() {
StackExchange.prepareEditor({
heartbeatType: 'answer',
autoActivateHeartbeat: false,
convertImagesToLinks: false,
noModals: true,
showLowRepImageUploadWarning: true,
reputationToPostImages: null,
bindNavPrevention: true,
postfix: "",
imageUploader: {
brandingHtml: "Powered by u003ca class="icon-imgur-white" href="https://imgur.com/"u003eu003c/au003e",
contentPolicyHtml: "User contributions licensed under u003ca href="https://creativecommons.org/licenses/by-sa/3.0/"u003ecc by-sa 3.0 with attribution requiredu003c/au003e u003ca href="https://stackoverflow.com/legal/content-policy"u003e(content policy)u003c/au003e",
allowUrls: true
},
onDemand: true,
discardSelector: ".discard-answer"
,immediatelyShowMarkdownHelp:true
});
}
});
Sign up or log in
StackExchange.ready(function () {
StackExchange.helpers.onClickDraftSave('#login-link');
});
Sign up using Google
Sign up using Facebook
Sign up using Email and Password
Post as a guest
Required, but never shown
StackExchange.ready(
function () {
StackExchange.openid.initPostLogin('.new-post-login', 'https%3a%2f%2funix.stackexchange.com%2fquestions%2f497509%2fwhat-method-does-unzip-use-to-find-a-single-file-in-an-archive%23new-answer', 'question_page');
}
);
Post as a guest
Required, but never shown
2 Answers
2
active
oldest
votes
2 Answers
2
active
oldest
votes
active
oldest
votes
active
oldest
votes
When searching for a single file in a large archive, it uses method 1, which you can see using strace
:
open("dataset.zip", O_RDONLY) = 3
ioctl(1, TIOCGWINSZ, 0x7fff9a895920) = -1 ENOTTY (Inappropriate ioctl for device)
write(1, "Archive: dataset.zipn", 22Archive: dataset.zip
) = 22
lseek(3, 943718400, SEEK_SET) = 943718400
read(3, "340P356(s34230620520127360U[250/2207346<252+u2342251[<2310E342274"..., 4522) = 4522
lseek(3, 943722880, SEEK_SET) = 943722880
read(3, "3f225P\uxv14350343503", 20) = 20
lseek(3, 943718400, SEEK_SET) = 943718400
read(3, "340P356(s34230620520127360U[250/2207346<252+u2342251[<2310E342274"..., 8192) = 4522
lseek(3, 849346560, SEEK_SET) = 849346560
read(3, "D262nv210343240C24227344367q300223231306330275266213276M7I'&352234J"..., 8192) = 8192
stat("rand-28.txt", 0x559f43e0a550) = -1 ENOENT (No such file or directory)
lstat("rand-28.txt", 0x559f43e0a550) = -1 ENOENT (No such file or directory)
stat("rand-28.txt", 0x559f43e0a550) = -1 ENOENT (No such file or directory)
lstat("rand-28.txt", 0x559f43e0a550) = -1 ENOENT (No such file or directory)
open("rand-28.txt", O_RDWR|O_CREAT|O_TRUNC, 0666) = 4
ioctl(1, TIOCGWINSZ, 0x7fff9a895790) = -1 ENOTTY (Inappropriate ioctl for device)
write(1, " extracting: rand-28.txt "..., 37 extracting: rand-28.txt ) = 37
read(3, "2753279Y206223217}355W%:220YNT257260z^361T242237021336372+306310"..., 8192) = 8192
unzip
opens dataset.zip
, seeks to the end, then seeks to the start of the requested file in the archive (rand-28.txt
, at offset 849346560) and reads from there.
The central directory is found by scanning the last 65557 bytes of the archive; see the code starting here:
/*---------------------------------------------------------------------------
Find and process the end-of-central-directory header. UnZip need only
check last 65557 bytes of zipfile: comment may be up to 65535, end-of-
central-directory record is 18 bytes, and signature itself is 4 bytes;
add some to allow for appended garbage. Since ZipInfo is often used as
a debugging tool, search the whole zipfile if zipinfo_mode is true.
---------------------------------------------------------------------------*/
Could you add additional information if possible about how it actually finds the central directory record(which here seems to be 943718400)
– tangy
Jan 29 at 19:09
1
See my update. Let me know if you would like an interpretation of the code!
– Stephen Kitt
Jan 29 at 21:51
Thanks a ton - both for the addition of relevant sections with references and the offer to help understand! I'll try reading it on my own at first and request you if I do :)
– tangy
Jan 29 at 22:15
add a comment |
When searching for a single file in a large archive, it uses method 1, which you can see using strace
:
open("dataset.zip", O_RDONLY) = 3
ioctl(1, TIOCGWINSZ, 0x7fff9a895920) = -1 ENOTTY (Inappropriate ioctl for device)
write(1, "Archive: dataset.zipn", 22Archive: dataset.zip
) = 22
lseek(3, 943718400, SEEK_SET) = 943718400
read(3, "340P356(s34230620520127360U[250/2207346<252+u2342251[<2310E342274"..., 4522) = 4522
lseek(3, 943722880, SEEK_SET) = 943722880
read(3, "3f225P\uxv14350343503", 20) = 20
lseek(3, 943718400, SEEK_SET) = 943718400
read(3, "340P356(s34230620520127360U[250/2207346<252+u2342251[<2310E342274"..., 8192) = 4522
lseek(3, 849346560, SEEK_SET) = 849346560
read(3, "D262nv210343240C24227344367q300223231306330275266213276M7I'&352234J"..., 8192) = 8192
stat("rand-28.txt", 0x559f43e0a550) = -1 ENOENT (No such file or directory)
lstat("rand-28.txt", 0x559f43e0a550) = -1 ENOENT (No such file or directory)
stat("rand-28.txt", 0x559f43e0a550) = -1 ENOENT (No such file or directory)
lstat("rand-28.txt", 0x559f43e0a550) = -1 ENOENT (No such file or directory)
open("rand-28.txt", O_RDWR|O_CREAT|O_TRUNC, 0666) = 4
ioctl(1, TIOCGWINSZ, 0x7fff9a895790) = -1 ENOTTY (Inappropriate ioctl for device)
write(1, " extracting: rand-28.txt "..., 37 extracting: rand-28.txt ) = 37
read(3, "2753279Y206223217}355W%:220YNT257260z^361T242237021336372+306310"..., 8192) = 8192
unzip
opens dataset.zip
, seeks to the end, then seeks to the start of the requested file in the archive (rand-28.txt
, at offset 849346560) and reads from there.
The central directory is found by scanning the last 65557 bytes of the archive; see the code starting here:
/*---------------------------------------------------------------------------
Find and process the end-of-central-directory header. UnZip need only
check last 65557 bytes of zipfile: comment may be up to 65535, end-of-
central-directory record is 18 bytes, and signature itself is 4 bytes;
add some to allow for appended garbage. Since ZipInfo is often used as
a debugging tool, search the whole zipfile if zipinfo_mode is true.
---------------------------------------------------------------------------*/
Could you add additional information if possible about how it actually finds the central directory record(which here seems to be 943718400)
– tangy
Jan 29 at 19:09
1
See my update. Let me know if you would like an interpretation of the code!
– Stephen Kitt
Jan 29 at 21:51
Thanks a ton - both for the addition of relevant sections with references and the offer to help understand! I'll try reading it on my own at first and request you if I do :)
– tangy
Jan 29 at 22:15
add a comment |
When searching for a single file in a large archive, it uses method 1, which you can see using strace
:
open("dataset.zip", O_RDONLY) = 3
ioctl(1, TIOCGWINSZ, 0x7fff9a895920) = -1 ENOTTY (Inappropriate ioctl for device)
write(1, "Archive: dataset.zipn", 22Archive: dataset.zip
) = 22
lseek(3, 943718400, SEEK_SET) = 943718400
read(3, "340P356(s34230620520127360U[250/2207346<252+u2342251[<2310E342274"..., 4522) = 4522
lseek(3, 943722880, SEEK_SET) = 943722880
read(3, "3f225P\uxv14350343503", 20) = 20
lseek(3, 943718400, SEEK_SET) = 943718400
read(3, "340P356(s34230620520127360U[250/2207346<252+u2342251[<2310E342274"..., 8192) = 4522
lseek(3, 849346560, SEEK_SET) = 849346560
read(3, "D262nv210343240C24227344367q300223231306330275266213276M7I'&352234J"..., 8192) = 8192
stat("rand-28.txt", 0x559f43e0a550) = -1 ENOENT (No such file or directory)
lstat("rand-28.txt", 0x559f43e0a550) = -1 ENOENT (No such file or directory)
stat("rand-28.txt", 0x559f43e0a550) = -1 ENOENT (No such file or directory)
lstat("rand-28.txt", 0x559f43e0a550) = -1 ENOENT (No such file or directory)
open("rand-28.txt", O_RDWR|O_CREAT|O_TRUNC, 0666) = 4
ioctl(1, TIOCGWINSZ, 0x7fff9a895790) = -1 ENOTTY (Inappropriate ioctl for device)
write(1, " extracting: rand-28.txt "..., 37 extracting: rand-28.txt ) = 37
read(3, "2753279Y206223217}355W%:220YNT257260z^361T242237021336372+306310"..., 8192) = 8192
unzip
opens dataset.zip
, seeks to the end, then seeks to the start of the requested file in the archive (rand-28.txt
, at offset 849346560) and reads from there.
The central directory is found by scanning the last 65557 bytes of the archive; see the code starting here:
/*---------------------------------------------------------------------------
Find and process the end-of-central-directory header. UnZip need only
check last 65557 bytes of zipfile: comment may be up to 65535, end-of-
central-directory record is 18 bytes, and signature itself is 4 bytes;
add some to allow for appended garbage. Since ZipInfo is often used as
a debugging tool, search the whole zipfile if zipinfo_mode is true.
---------------------------------------------------------------------------*/
When searching for a single file in a large archive, it uses method 1, which you can see using strace
:
open("dataset.zip", O_RDONLY) = 3
ioctl(1, TIOCGWINSZ, 0x7fff9a895920) = -1 ENOTTY (Inappropriate ioctl for device)
write(1, "Archive: dataset.zipn", 22Archive: dataset.zip
) = 22
lseek(3, 943718400, SEEK_SET) = 943718400
read(3, "340P356(s34230620520127360U[250/2207346<252+u2342251[<2310E342274"..., 4522) = 4522
lseek(3, 943722880, SEEK_SET) = 943722880
read(3, "3f225P\uxv14350343503", 20) = 20
lseek(3, 943718400, SEEK_SET) = 943718400
read(3, "340P356(s34230620520127360U[250/2207346<252+u2342251[<2310E342274"..., 8192) = 4522
lseek(3, 849346560, SEEK_SET) = 849346560
read(3, "D262nv210343240C24227344367q300223231306330275266213276M7I'&352234J"..., 8192) = 8192
stat("rand-28.txt", 0x559f43e0a550) = -1 ENOENT (No such file or directory)
lstat("rand-28.txt", 0x559f43e0a550) = -1 ENOENT (No such file or directory)
stat("rand-28.txt", 0x559f43e0a550) = -1 ENOENT (No such file or directory)
lstat("rand-28.txt", 0x559f43e0a550) = -1 ENOENT (No such file or directory)
open("rand-28.txt", O_RDWR|O_CREAT|O_TRUNC, 0666) = 4
ioctl(1, TIOCGWINSZ, 0x7fff9a895790) = -1 ENOTTY (Inappropriate ioctl for device)
write(1, " extracting: rand-28.txt "..., 37 extracting: rand-28.txt ) = 37
read(3, "2753279Y206223217}355W%:220YNT257260z^361T242237021336372+306310"..., 8192) = 8192
unzip
opens dataset.zip
, seeks to the end, then seeks to the start of the requested file in the archive (rand-28.txt
, at offset 849346560) and reads from there.
The central directory is found by scanning the last 65557 bytes of the archive; see the code starting here:
/*---------------------------------------------------------------------------
Find and process the end-of-central-directory header. UnZip need only
check last 65557 bytes of zipfile: comment may be up to 65535, end-of-
central-directory record is 18 bytes, and signature itself is 4 bytes;
add some to allow for appended garbage. Since ZipInfo is often used as
a debugging tool, search the whole zipfile if zipinfo_mode is true.
---------------------------------------------------------------------------*/
edited Jan 29 at 21:50
answered Jan 29 at 18:05
Stephen KittStephen Kitt
170k24385462
170k24385462
Could you add additional information if possible about how it actually finds the central directory record(which here seems to be 943718400)
– tangy
Jan 29 at 19:09
1
See my update. Let me know if you would like an interpretation of the code!
– Stephen Kitt
Jan 29 at 21:51
Thanks a ton - both for the addition of relevant sections with references and the offer to help understand! I'll try reading it on my own at first and request you if I do :)
– tangy
Jan 29 at 22:15
add a comment |
Could you add additional information if possible about how it actually finds the central directory record(which here seems to be 943718400)
– tangy
Jan 29 at 19:09
1
See my update. Let me know if you would like an interpretation of the code!
– Stephen Kitt
Jan 29 at 21:51
Thanks a ton - both for the addition of relevant sections with references and the offer to help understand! I'll try reading it on my own at first and request you if I do :)
– tangy
Jan 29 at 22:15
Could you add additional information if possible about how it actually finds the central directory record(which here seems to be 943718400)
– tangy
Jan 29 at 19:09
Could you add additional information if possible about how it actually finds the central directory record(which here seems to be 943718400)
– tangy
Jan 29 at 19:09
1
1
See my update. Let me know if you would like an interpretation of the code!
– Stephen Kitt
Jan 29 at 21:51
See my update. Let me know if you would like an interpretation of the code!
– Stephen Kitt
Jan 29 at 21:51
Thanks a ton - both for the addition of relevant sections with references and the offer to help understand! I'll try reading it on my own at first and request you if I do :)
– tangy
Jan 29 at 22:15
Thanks a ton - both for the addition of relevant sections with references and the offer to help understand! I'll try reading it on my own at first and request you if I do :)
– tangy
Jan 29 at 22:15
add a comment |
Actually it's a mixture. unzip reads some data from a known location, and then reads data blocks related to (but not identical with) the target entry in the zip-file.
The design of zip/unzip is explained in comments in the source-files. Here's the pertinent one from extract.c
:
/*---------------------------------------------------------------------------
The basic idea of this function is as follows. Since the central di-
rectory lies at the end of the zipfile and the member files lie at the
beginning or middle or wherever, it is not very desirable to simply
read a central directory entry, jump to the member and extract it, and
then jump back to the central directory. In the case of a large zipfile
this would lead to a whole lot of disk-grinding, especially if each mem-
ber file is small. Instead, we read from the central directory the per-
tinent information for a block of files, then go extract/test the whole
block. Thus this routine contains two small(er) loops within a very
large outer loop: the first of the small ones reads a block of files
from the central directory; the second extracts or tests each file; and
the outer one loops over blocks. There's some file-pointer positioning
stuff in between, but that's about it. Btw, it's because of this jump-
ing around that we can afford to be lenient if an error occurs in one of
the member files: we should still be able to go find the other members,
since we know the offset of each from the beginning of the zipfile.
---------------------------------------------------------------------------*/
The format itself is mostly derived from PK-Ware's implementation, and is summarized in programming information text-files. According to that, there's more than one type of record in the central directory as well, so unzip cannot readily go to the end of the file and make an array of entries to lookup the target file.
Now... if you take the time to read the source code, you'll discover that unzip
reads buffers of 8192 bytes (look for INBUFSIZ
). I'd only use the single-file extract for a fairly large zip file (I had in mind the Java sources), but even for a smaller zip-file, you can see the effect of the buffer size. To see this, I zipped up the Git files for PuTTY, which gave 2727 files (counting a copy of the git log). Java was bigger than than 20 years ago, and hasn't shrunk. Extracting that log from the zip-file (chosen since it wouldn't be at the end of an alphabetically-sorted index and likely not in the first block read from the central directory) gave this from strace
for the lseek
calls:
lseek(3, -2252, SEEK_CUR) = 1267
lseek(3, 120463360, SEEK_SET) = 120463360
lseek(3, 120468731, SEEK_SET) = 120468731
lseek(3, 120135680, SEEK_SET) = 120135680
lseek(3, 270336, SEEK_SET) = 270336
lseek(3, 120463360, SEEK_SET) = 120463360
As usual, with benchmarks, ymmv.
While this makes sense if we are extracting the entire archive, for a single archive why would I need to do things this way? I think it might be using another routine in the case mentioned in the question eg.extract_or_test_member
?
– tangy
Jan 29 at 22:09
typo - i meant single file within the archive. With a single file just 2 seeks would be needed right?
– tangy
Jan 29 at 22:15
More than that: it has to read the record which tells how large the central directory is, and it's not guaranteed that it'll read the whole directory while it's doing that.
– Thomas Dickey
Jan 29 at 22:21
add a comment |
Actually it's a mixture. unzip reads some data from a known location, and then reads data blocks related to (but not identical with) the target entry in the zip-file.
The design of zip/unzip is explained in comments in the source-files. Here's the pertinent one from extract.c
:
/*---------------------------------------------------------------------------
The basic idea of this function is as follows. Since the central di-
rectory lies at the end of the zipfile and the member files lie at the
beginning or middle or wherever, it is not very desirable to simply
read a central directory entry, jump to the member and extract it, and
then jump back to the central directory. In the case of a large zipfile
this would lead to a whole lot of disk-grinding, especially if each mem-
ber file is small. Instead, we read from the central directory the per-
tinent information for a block of files, then go extract/test the whole
block. Thus this routine contains two small(er) loops within a very
large outer loop: the first of the small ones reads a block of files
from the central directory; the second extracts or tests each file; and
the outer one loops over blocks. There's some file-pointer positioning
stuff in between, but that's about it. Btw, it's because of this jump-
ing around that we can afford to be lenient if an error occurs in one of
the member files: we should still be able to go find the other members,
since we know the offset of each from the beginning of the zipfile.
---------------------------------------------------------------------------*/
The format itself is mostly derived from PK-Ware's implementation, and is summarized in programming information text-files. According to that, there's more than one type of record in the central directory as well, so unzip cannot readily go to the end of the file and make an array of entries to lookup the target file.
Now... if you take the time to read the source code, you'll discover that unzip
reads buffers of 8192 bytes (look for INBUFSIZ
). I'd only use the single-file extract for a fairly large zip file (I had in mind the Java sources), but even for a smaller zip-file, you can see the effect of the buffer size. To see this, I zipped up the Git files for PuTTY, which gave 2727 files (counting a copy of the git log). Java was bigger than than 20 years ago, and hasn't shrunk. Extracting that log from the zip-file (chosen since it wouldn't be at the end of an alphabetically-sorted index and likely not in the first block read from the central directory) gave this from strace
for the lseek
calls:
lseek(3, -2252, SEEK_CUR) = 1267
lseek(3, 120463360, SEEK_SET) = 120463360
lseek(3, 120468731, SEEK_SET) = 120468731
lseek(3, 120135680, SEEK_SET) = 120135680
lseek(3, 270336, SEEK_SET) = 270336
lseek(3, 120463360, SEEK_SET) = 120463360
As usual, with benchmarks, ymmv.
While this makes sense if we are extracting the entire archive, for a single archive why would I need to do things this way? I think it might be using another routine in the case mentioned in the question eg.extract_or_test_member
?
– tangy
Jan 29 at 22:09
typo - i meant single file within the archive. With a single file just 2 seeks would be needed right?
– tangy
Jan 29 at 22:15
More than that: it has to read the record which tells how large the central directory is, and it's not guaranteed that it'll read the whole directory while it's doing that.
– Thomas Dickey
Jan 29 at 22:21
add a comment |
Actually it's a mixture. unzip reads some data from a known location, and then reads data blocks related to (but not identical with) the target entry in the zip-file.
The design of zip/unzip is explained in comments in the source-files. Here's the pertinent one from extract.c
:
/*---------------------------------------------------------------------------
The basic idea of this function is as follows. Since the central di-
rectory lies at the end of the zipfile and the member files lie at the
beginning or middle or wherever, it is not very desirable to simply
read a central directory entry, jump to the member and extract it, and
then jump back to the central directory. In the case of a large zipfile
this would lead to a whole lot of disk-grinding, especially if each mem-
ber file is small. Instead, we read from the central directory the per-
tinent information for a block of files, then go extract/test the whole
block. Thus this routine contains two small(er) loops within a very
large outer loop: the first of the small ones reads a block of files
from the central directory; the second extracts or tests each file; and
the outer one loops over blocks. There's some file-pointer positioning
stuff in between, but that's about it. Btw, it's because of this jump-
ing around that we can afford to be lenient if an error occurs in one of
the member files: we should still be able to go find the other members,
since we know the offset of each from the beginning of the zipfile.
---------------------------------------------------------------------------*/
The format itself is mostly derived from PK-Ware's implementation, and is summarized in programming information text-files. According to that, there's more than one type of record in the central directory as well, so unzip cannot readily go to the end of the file and make an array of entries to lookup the target file.
Now... if you take the time to read the source code, you'll discover that unzip
reads buffers of 8192 bytes (look for INBUFSIZ
). I'd only use the single-file extract for a fairly large zip file (I had in mind the Java sources), but even for a smaller zip-file, you can see the effect of the buffer size. To see this, I zipped up the Git files for PuTTY, which gave 2727 files (counting a copy of the git log). Java was bigger than than 20 years ago, and hasn't shrunk. Extracting that log from the zip-file (chosen since it wouldn't be at the end of an alphabetically-sorted index and likely not in the first block read from the central directory) gave this from strace
for the lseek
calls:
lseek(3, -2252, SEEK_CUR) = 1267
lseek(3, 120463360, SEEK_SET) = 120463360
lseek(3, 120468731, SEEK_SET) = 120468731
lseek(3, 120135680, SEEK_SET) = 120135680
lseek(3, 270336, SEEK_SET) = 270336
lseek(3, 120463360, SEEK_SET) = 120463360
As usual, with benchmarks, ymmv.
Actually it's a mixture. unzip reads some data from a known location, and then reads data blocks related to (but not identical with) the target entry in the zip-file.
The design of zip/unzip is explained in comments in the source-files. Here's the pertinent one from extract.c
:
/*---------------------------------------------------------------------------
The basic idea of this function is as follows. Since the central di-
rectory lies at the end of the zipfile and the member files lie at the
beginning or middle or wherever, it is not very desirable to simply
read a central directory entry, jump to the member and extract it, and
then jump back to the central directory. In the case of a large zipfile
this would lead to a whole lot of disk-grinding, especially if each mem-
ber file is small. Instead, we read from the central directory the per-
tinent information for a block of files, then go extract/test the whole
block. Thus this routine contains two small(er) loops within a very
large outer loop: the first of the small ones reads a block of files
from the central directory; the second extracts or tests each file; and
the outer one loops over blocks. There's some file-pointer positioning
stuff in between, but that's about it. Btw, it's because of this jump-
ing around that we can afford to be lenient if an error occurs in one of
the member files: we should still be able to go find the other members,
since we know the offset of each from the beginning of the zipfile.
---------------------------------------------------------------------------*/
The format itself is mostly derived from PK-Ware's implementation, and is summarized in programming information text-files. According to that, there's more than one type of record in the central directory as well, so unzip cannot readily go to the end of the file and make an array of entries to lookup the target file.
Now... if you take the time to read the source code, you'll discover that unzip
reads buffers of 8192 bytes (look for INBUFSIZ
). I'd only use the single-file extract for a fairly large zip file (I had in mind the Java sources), but even for a smaller zip-file, you can see the effect of the buffer size. To see this, I zipped up the Git files for PuTTY, which gave 2727 files (counting a copy of the git log). Java was bigger than than 20 years ago, and hasn't shrunk. Extracting that log from the zip-file (chosen since it wouldn't be at the end of an alphabetically-sorted index and likely not in the first block read from the central directory) gave this from strace
for the lseek
calls:
lseek(3, -2252, SEEK_CUR) = 1267
lseek(3, 120463360, SEEK_SET) = 120463360
lseek(3, 120468731, SEEK_SET) = 120468731
lseek(3, 120135680, SEEK_SET) = 120135680
lseek(3, 270336, SEEK_SET) = 270336
lseek(3, 120463360, SEEK_SET) = 120463360
As usual, with benchmarks, ymmv.
edited Jan 30 at 21:19
answered Jan 29 at 21:43
Thomas DickeyThomas Dickey
52.8k597170
52.8k597170
While this makes sense if we are extracting the entire archive, for a single archive why would I need to do things this way? I think it might be using another routine in the case mentioned in the question eg.extract_or_test_member
?
– tangy
Jan 29 at 22:09
typo - i meant single file within the archive. With a single file just 2 seeks would be needed right?
– tangy
Jan 29 at 22:15
More than that: it has to read the record which tells how large the central directory is, and it's not guaranteed that it'll read the whole directory while it's doing that.
– Thomas Dickey
Jan 29 at 22:21
add a comment |
While this makes sense if we are extracting the entire archive, for a single archive why would I need to do things this way? I think it might be using another routine in the case mentioned in the question eg.extract_or_test_member
?
– tangy
Jan 29 at 22:09
typo - i meant single file within the archive. With a single file just 2 seeks would be needed right?
– tangy
Jan 29 at 22:15
More than that: it has to read the record which tells how large the central directory is, and it's not guaranteed that it'll read the whole directory while it's doing that.
– Thomas Dickey
Jan 29 at 22:21
While this makes sense if we are extracting the entire archive, for a single archive why would I need to do things this way? I think it might be using another routine in the case mentioned in the question eg.
extract_or_test_member
?– tangy
Jan 29 at 22:09
While this makes sense if we are extracting the entire archive, for a single archive why would I need to do things this way? I think it might be using another routine in the case mentioned in the question eg.
extract_or_test_member
?– tangy
Jan 29 at 22:09
typo - i meant single file within the archive. With a single file just 2 seeks would be needed right?
– tangy
Jan 29 at 22:15
typo - i meant single file within the archive. With a single file just 2 seeks would be needed right?
– tangy
Jan 29 at 22:15
More than that: it has to read the record which tells how large the central directory is, and it's not guaranteed that it'll read the whole directory while it's doing that.
– Thomas Dickey
Jan 29 at 22:21
More than that: it has to read the record which tells how large the central directory is, and it's not guaranteed that it'll read the whole directory while it's doing that.
– Thomas Dickey
Jan 29 at 22:21
add a comment |
Thanks for contributing an answer to Unix & Linux Stack Exchange!
- Please be sure to answer the question. Provide details and share your research!
But avoid …
- Asking for help, clarification, or responding to other answers.
- Making statements based on opinion; back them up with references or personal experience.
To learn more, see our tips on writing great answers.
Sign up or log in
StackExchange.ready(function () {
StackExchange.helpers.onClickDraftSave('#login-link');
});
Sign up using Google
Sign up using Facebook
Sign up using Email and Password
Post as a guest
Required, but never shown
StackExchange.ready(
function () {
StackExchange.openid.initPostLogin('.new-post-login', 'https%3a%2f%2funix.stackexchange.com%2fquestions%2f497509%2fwhat-method-does-unzip-use-to-find-a-single-file-in-an-archive%23new-answer', 'question_page');
}
);
Post as a guest
Required, but never shown
Sign up or log in
StackExchange.ready(function () {
StackExchange.helpers.onClickDraftSave('#login-link');
});
Sign up using Google
Sign up using Facebook
Sign up using Email and Password
Post as a guest
Required, but never shown
Sign up or log in
StackExchange.ready(function () {
StackExchange.helpers.onClickDraftSave('#login-link');
});
Sign up using Google
Sign up using Facebook
Sign up using Email and Password
Post as a guest
Required, but never shown
Sign up or log in
StackExchange.ready(function () {
StackExchange.helpers.onClickDraftSave('#login-link');
});
Sign up using Google
Sign up using Facebook
Sign up using Email and Password
Sign up using Google
Sign up using Facebook
Sign up using Email and Password
Post as a guest
Required, but never shown
Required, but never shown
Required, but never shown
Required, but never shown
Required, but never shown
Required, but never shown
Required, but never shown
Required, but never shown
Required, but never shown
1
Reading the central directory is still a linear search just like going through each local header and so both are O(n). Reading the central directory will typically mean reading a lot fewer disk blocks but the number of blocks read is also still O(n).
– Ross Ridge
Jan 30 at 2:47