What method does unzip use to find a single file in an archive?












5















Let's say I create 100 files with random text data of size 30MB each. Now I create a zip archive with 0 compression i.e. zip dataset.zip -r -0 *.txt. Now I want to extract just one file from this archive.



As described here, there are two ways of unzipping/extracting files from archives:




  1. Seek to the end of the file and lookup the central directory. Then use that for fast random access to the file to be extracted.(Amortized O(1) complexity)

  2. Look through each local header and extract the one where theres a match.(O(n) complexity)


Which method does unzip use? From my experiments it seems like it uses method 2?










share|improve this question




















  • 1





    Reading the central directory is still a linear search just like going through each local header and so both are O(n). Reading the central directory will typically mean reading a lot fewer disk blocks but the number of blocks read is also still O(n).

    – Ross Ridge
    Jan 30 at 2:47
















5















Let's say I create 100 files with random text data of size 30MB each. Now I create a zip archive with 0 compression i.e. zip dataset.zip -r -0 *.txt. Now I want to extract just one file from this archive.



As described here, there are two ways of unzipping/extracting files from archives:




  1. Seek to the end of the file and lookup the central directory. Then use that for fast random access to the file to be extracted.(Amortized O(1) complexity)

  2. Look through each local header and extract the one where theres a match.(O(n) complexity)


Which method does unzip use? From my experiments it seems like it uses method 2?










share|improve this question




















  • 1





    Reading the central directory is still a linear search just like going through each local header and so both are O(n). Reading the central directory will typically mean reading a lot fewer disk blocks but the number of blocks read is also still O(n).

    – Ross Ridge
    Jan 30 at 2:47














5












5








5


1






Let's say I create 100 files with random text data of size 30MB each. Now I create a zip archive with 0 compression i.e. zip dataset.zip -r -0 *.txt. Now I want to extract just one file from this archive.



As described here, there are two ways of unzipping/extracting files from archives:




  1. Seek to the end of the file and lookup the central directory. Then use that for fast random access to the file to be extracted.(Amortized O(1) complexity)

  2. Look through each local header and extract the one where theres a match.(O(n) complexity)


Which method does unzip use? From my experiments it seems like it uses method 2?










share|improve this question
















Let's say I create 100 files with random text data of size 30MB each. Now I create a zip archive with 0 compression i.e. zip dataset.zip -r -0 *.txt. Now I want to extract just one file from this archive.



As described here, there are two ways of unzipping/extracting files from archives:




  1. Seek to the end of the file and lookup the central directory. Then use that for fast random access to the file to be extracted.(Amortized O(1) complexity)

  2. Look through each local header and extract the one where theres a match.(O(n) complexity)


Which method does unzip use? From my experiments it seems like it uses method 2?







zip archive






share|improve this question















share|improve this question













share|improve this question




share|improve this question








edited Jan 30 at 1:56









Olorin

3,3331417




3,3331417










asked Jan 29 at 17:59









tangytangy

1315




1315








  • 1





    Reading the central directory is still a linear search just like going through each local header and so both are O(n). Reading the central directory will typically mean reading a lot fewer disk blocks but the number of blocks read is also still O(n).

    – Ross Ridge
    Jan 30 at 2:47














  • 1





    Reading the central directory is still a linear search just like going through each local header and so both are O(n). Reading the central directory will typically mean reading a lot fewer disk blocks but the number of blocks read is also still O(n).

    – Ross Ridge
    Jan 30 at 2:47








1




1





Reading the central directory is still a linear search just like going through each local header and so both are O(n). Reading the central directory will typically mean reading a lot fewer disk blocks but the number of blocks read is also still O(n).

– Ross Ridge
Jan 30 at 2:47





Reading the central directory is still a linear search just like going through each local header and so both are O(n). Reading the central directory will typically mean reading a lot fewer disk blocks but the number of blocks read is also still O(n).

– Ross Ridge
Jan 30 at 2:47










2 Answers
2






active

oldest

votes


















11














When searching for a single file in a large archive, it uses method 1, which you can see using strace:



open("dataset.zip", O_RDONLY)           = 3
ioctl(1, TIOCGWINSZ, 0x7fff9a895920) = -1 ENOTTY (Inappropriate ioctl for device)
write(1, "Archive: dataset.zipn", 22Archive: dataset.zip
) = 22
lseek(3, 943718400, SEEK_SET) = 943718400
read(3, "340P356(s34230620520127360U[250/2207346<252+u2342251[<2310E342274"..., 4522) = 4522
lseek(3, 943722880, SEEK_SET) = 943722880
read(3, "3f225P\uxv14350343503", 20) = 20
lseek(3, 943718400, SEEK_SET) = 943718400
read(3, "340P356(s34230620520127360U[250/2207346<252+u2342251[<2310E342274"..., 8192) = 4522
lseek(3, 849346560, SEEK_SET) = 849346560
read(3, "D262nv210343240C24227344367q300223231306330275266213276M7I'&352234J"..., 8192) = 8192
stat("rand-28.txt", 0x559f43e0a550) = -1 ENOENT (No such file or directory)
lstat("rand-28.txt", 0x559f43e0a550) = -1 ENOENT (No such file or directory)
stat("rand-28.txt", 0x559f43e0a550) = -1 ENOENT (No such file or directory)
lstat("rand-28.txt", 0x559f43e0a550) = -1 ENOENT (No such file or directory)
open("rand-28.txt", O_RDWR|O_CREAT|O_TRUNC, 0666) = 4
ioctl(1, TIOCGWINSZ, 0x7fff9a895790) = -1 ENOTTY (Inappropriate ioctl for device)
write(1, " extracting: rand-28.txt "..., 37 extracting: rand-28.txt ) = 37
read(3, "2753279Y206223217}355W%:220YNT257260z^361T242237021336372+306310"..., 8192) = 8192


unzip opens dataset.zip, seeks to the end, then seeks to the start of the requested file in the archive (rand-28.txt, at offset 849346560) and reads from there.



The central directory is found by scanning the last 65557 bytes of the archive; see the code starting here:



/*---------------------------------------------------------------------------
Find and process the end-of-central-directory header. UnZip need only
check last 65557 bytes of zipfile: comment may be up to 65535, end-of-
central-directory record is 18 bytes, and signature itself is 4 bytes;
add some to allow for appended garbage. Since ZipInfo is often used as
a debugging tool, search the whole zipfile if zipinfo_mode is true.
---------------------------------------------------------------------------*/





share|improve this answer


























  • Could you add additional information if possible about how it actually finds the central directory record(which here seems to be 943718400)

    – tangy
    Jan 29 at 19:09






  • 1





    See my update. Let me know if you would like an interpretation of the code!

    – Stephen Kitt
    Jan 29 at 21:51











  • Thanks a ton - both for the addition of relevant sections with references and the offer to help understand! I'll try reading it on my own at first and request you if I do :)

    – tangy
    Jan 29 at 22:15



















3














Actually it's a mixture. unzip reads some data from a known location, and then reads data blocks related to (but not identical with) the target entry in the zip-file.



The design of zip/unzip is explained in comments in the source-files. Here's the pertinent one from extract.c:



/*--------------------------------------------------------------------------- 
The basic idea of this function is as follows. Since the central di-
rectory lies at the end of the zipfile and the member files lie at the
beginning or middle or wherever, it is not very desirable to simply
read a central directory entry, jump to the member and extract it, and
then jump back to the central directory. In the case of a large zipfile
this would lead to a whole lot of disk-grinding, especially if each mem-
ber file is small. Instead, we read from the central directory the per-
tinent information for a block of files, then go extract/test the whole
block. Thus this routine contains two small(er) loops within a very
large outer loop: the first of the small ones reads a block of files
from the central directory; the second extracts or tests each file; and
the outer one loops over blocks. There's some file-pointer positioning
stuff in between, but that's about it. Btw, it's because of this jump-
ing around that we can afford to be lenient if an error occurs in one of
the member files: we should still be able to go find the other members,
since we know the offset of each from the beginning of the zipfile.
---------------------------------------------------------------------------*/


The format itself is mostly derived from PK-Ware's implementation, and is summarized in programming information text-files. According to that, there's more than one type of record in the central directory as well, so unzip cannot readily go to the end of the file and make an array of entries to lookup the target file.



Now... if you take the time to read the source code, you'll discover that unzip reads buffers of 8192 bytes (look for INBUFSIZ). I'd only use the single-file extract for a fairly large zip file (I had in mind the Java sources), but even for a smaller zip-file, you can see the effect of the buffer size. To see this, I zipped up the Git files for PuTTY, which gave 2727 files (counting a copy of the git log). Java was bigger than than 20 years ago, and hasn't shrunk. Extracting that log from the zip-file (chosen since it wouldn't be at the end of an alphabetically-sorted index and likely not in the first block read from the central directory) gave this from strace for the lseek calls:



lseek(3, -2252, SEEK_CUR)               = 1267
lseek(3, 120463360, SEEK_SET) = 120463360
lseek(3, 120468731, SEEK_SET) = 120468731
lseek(3, 120135680, SEEK_SET) = 120135680
lseek(3, 270336, SEEK_SET) = 270336
lseek(3, 120463360, SEEK_SET) = 120463360


As usual, with benchmarks, ymmv.






share|improve this answer


























  • While this makes sense if we are extracting the entire archive, for a single archive why would I need to do things this way? I think it might be using another routine in the case mentioned in the question eg. extract_or_test_member?

    – tangy
    Jan 29 at 22:09











  • typo - i meant single file within the archive. With a single file just 2 seeks would be needed right?

    – tangy
    Jan 29 at 22:15













  • More than that: it has to read the record which tells how large the central directory is, and it's not guaranteed that it'll read the whole directory while it's doing that.

    – Thomas Dickey
    Jan 29 at 22:21











Your Answer








StackExchange.ready(function() {
var channelOptions = {
tags: "".split(" "),
id: "106"
};
initTagRenderer("".split(" "), "".split(" "), channelOptions);

StackExchange.using("externalEditor", function() {
// Have to fire editor after snippets, if snippets enabled
if (StackExchange.settings.snippets.snippetsEnabled) {
StackExchange.using("snippets", function() {
createEditor();
});
}
else {
createEditor();
}
});

function createEditor() {
StackExchange.prepareEditor({
heartbeatType: 'answer',
autoActivateHeartbeat: false,
convertImagesToLinks: false,
noModals: true,
showLowRepImageUploadWarning: true,
reputationToPostImages: null,
bindNavPrevention: true,
postfix: "",
imageUploader: {
brandingHtml: "Powered by u003ca class="icon-imgur-white" href="https://imgur.com/"u003eu003c/au003e",
contentPolicyHtml: "User contributions licensed under u003ca href="https://creativecommons.org/licenses/by-sa/3.0/"u003ecc by-sa 3.0 with attribution requiredu003c/au003e u003ca href="https://stackoverflow.com/legal/content-policy"u003e(content policy)u003c/au003e",
allowUrls: true
},
onDemand: true,
discardSelector: ".discard-answer"
,immediatelyShowMarkdownHelp:true
});


}
});














draft saved

draft discarded


















StackExchange.ready(
function () {
StackExchange.openid.initPostLogin('.new-post-login', 'https%3a%2f%2funix.stackexchange.com%2fquestions%2f497509%2fwhat-method-does-unzip-use-to-find-a-single-file-in-an-archive%23new-answer', 'question_page');
}
);

Post as a guest















Required, but never shown

























2 Answers
2






active

oldest

votes








2 Answers
2






active

oldest

votes









active

oldest

votes






active

oldest

votes









11














When searching for a single file in a large archive, it uses method 1, which you can see using strace:



open("dataset.zip", O_RDONLY)           = 3
ioctl(1, TIOCGWINSZ, 0x7fff9a895920) = -1 ENOTTY (Inappropriate ioctl for device)
write(1, "Archive: dataset.zipn", 22Archive: dataset.zip
) = 22
lseek(3, 943718400, SEEK_SET) = 943718400
read(3, "340P356(s34230620520127360U[250/2207346<252+u2342251[<2310E342274"..., 4522) = 4522
lseek(3, 943722880, SEEK_SET) = 943722880
read(3, "3f225P\uxv14350343503", 20) = 20
lseek(3, 943718400, SEEK_SET) = 943718400
read(3, "340P356(s34230620520127360U[250/2207346<252+u2342251[<2310E342274"..., 8192) = 4522
lseek(3, 849346560, SEEK_SET) = 849346560
read(3, "D262nv210343240C24227344367q300223231306330275266213276M7I'&352234J"..., 8192) = 8192
stat("rand-28.txt", 0x559f43e0a550) = -1 ENOENT (No such file or directory)
lstat("rand-28.txt", 0x559f43e0a550) = -1 ENOENT (No such file or directory)
stat("rand-28.txt", 0x559f43e0a550) = -1 ENOENT (No such file or directory)
lstat("rand-28.txt", 0x559f43e0a550) = -1 ENOENT (No such file or directory)
open("rand-28.txt", O_RDWR|O_CREAT|O_TRUNC, 0666) = 4
ioctl(1, TIOCGWINSZ, 0x7fff9a895790) = -1 ENOTTY (Inappropriate ioctl for device)
write(1, " extracting: rand-28.txt "..., 37 extracting: rand-28.txt ) = 37
read(3, "2753279Y206223217}355W%:220YNT257260z^361T242237021336372+306310"..., 8192) = 8192


unzip opens dataset.zip, seeks to the end, then seeks to the start of the requested file in the archive (rand-28.txt, at offset 849346560) and reads from there.



The central directory is found by scanning the last 65557 bytes of the archive; see the code starting here:



/*---------------------------------------------------------------------------
Find and process the end-of-central-directory header. UnZip need only
check last 65557 bytes of zipfile: comment may be up to 65535, end-of-
central-directory record is 18 bytes, and signature itself is 4 bytes;
add some to allow for appended garbage. Since ZipInfo is often used as
a debugging tool, search the whole zipfile if zipinfo_mode is true.
---------------------------------------------------------------------------*/





share|improve this answer


























  • Could you add additional information if possible about how it actually finds the central directory record(which here seems to be 943718400)

    – tangy
    Jan 29 at 19:09






  • 1





    See my update. Let me know if you would like an interpretation of the code!

    – Stephen Kitt
    Jan 29 at 21:51











  • Thanks a ton - both for the addition of relevant sections with references and the offer to help understand! I'll try reading it on my own at first and request you if I do :)

    – tangy
    Jan 29 at 22:15
















11














When searching for a single file in a large archive, it uses method 1, which you can see using strace:



open("dataset.zip", O_RDONLY)           = 3
ioctl(1, TIOCGWINSZ, 0x7fff9a895920) = -1 ENOTTY (Inappropriate ioctl for device)
write(1, "Archive: dataset.zipn", 22Archive: dataset.zip
) = 22
lseek(3, 943718400, SEEK_SET) = 943718400
read(3, "340P356(s34230620520127360U[250/2207346<252+u2342251[<2310E342274"..., 4522) = 4522
lseek(3, 943722880, SEEK_SET) = 943722880
read(3, "3f225P\uxv14350343503", 20) = 20
lseek(3, 943718400, SEEK_SET) = 943718400
read(3, "340P356(s34230620520127360U[250/2207346<252+u2342251[<2310E342274"..., 8192) = 4522
lseek(3, 849346560, SEEK_SET) = 849346560
read(3, "D262nv210343240C24227344367q300223231306330275266213276M7I'&352234J"..., 8192) = 8192
stat("rand-28.txt", 0x559f43e0a550) = -1 ENOENT (No such file or directory)
lstat("rand-28.txt", 0x559f43e0a550) = -1 ENOENT (No such file or directory)
stat("rand-28.txt", 0x559f43e0a550) = -1 ENOENT (No such file or directory)
lstat("rand-28.txt", 0x559f43e0a550) = -1 ENOENT (No such file or directory)
open("rand-28.txt", O_RDWR|O_CREAT|O_TRUNC, 0666) = 4
ioctl(1, TIOCGWINSZ, 0x7fff9a895790) = -1 ENOTTY (Inappropriate ioctl for device)
write(1, " extracting: rand-28.txt "..., 37 extracting: rand-28.txt ) = 37
read(3, "2753279Y206223217}355W%:220YNT257260z^361T242237021336372+306310"..., 8192) = 8192


unzip opens dataset.zip, seeks to the end, then seeks to the start of the requested file in the archive (rand-28.txt, at offset 849346560) and reads from there.



The central directory is found by scanning the last 65557 bytes of the archive; see the code starting here:



/*---------------------------------------------------------------------------
Find and process the end-of-central-directory header. UnZip need only
check last 65557 bytes of zipfile: comment may be up to 65535, end-of-
central-directory record is 18 bytes, and signature itself is 4 bytes;
add some to allow for appended garbage. Since ZipInfo is often used as
a debugging tool, search the whole zipfile if zipinfo_mode is true.
---------------------------------------------------------------------------*/





share|improve this answer


























  • Could you add additional information if possible about how it actually finds the central directory record(which here seems to be 943718400)

    – tangy
    Jan 29 at 19:09






  • 1





    See my update. Let me know if you would like an interpretation of the code!

    – Stephen Kitt
    Jan 29 at 21:51











  • Thanks a ton - both for the addition of relevant sections with references and the offer to help understand! I'll try reading it on my own at first and request you if I do :)

    – tangy
    Jan 29 at 22:15














11












11








11







When searching for a single file in a large archive, it uses method 1, which you can see using strace:



open("dataset.zip", O_RDONLY)           = 3
ioctl(1, TIOCGWINSZ, 0x7fff9a895920) = -1 ENOTTY (Inappropriate ioctl for device)
write(1, "Archive: dataset.zipn", 22Archive: dataset.zip
) = 22
lseek(3, 943718400, SEEK_SET) = 943718400
read(3, "340P356(s34230620520127360U[250/2207346<252+u2342251[<2310E342274"..., 4522) = 4522
lseek(3, 943722880, SEEK_SET) = 943722880
read(3, "3f225P\uxv14350343503", 20) = 20
lseek(3, 943718400, SEEK_SET) = 943718400
read(3, "340P356(s34230620520127360U[250/2207346<252+u2342251[<2310E342274"..., 8192) = 4522
lseek(3, 849346560, SEEK_SET) = 849346560
read(3, "D262nv210343240C24227344367q300223231306330275266213276M7I'&352234J"..., 8192) = 8192
stat("rand-28.txt", 0x559f43e0a550) = -1 ENOENT (No such file or directory)
lstat("rand-28.txt", 0x559f43e0a550) = -1 ENOENT (No such file or directory)
stat("rand-28.txt", 0x559f43e0a550) = -1 ENOENT (No such file or directory)
lstat("rand-28.txt", 0x559f43e0a550) = -1 ENOENT (No such file or directory)
open("rand-28.txt", O_RDWR|O_CREAT|O_TRUNC, 0666) = 4
ioctl(1, TIOCGWINSZ, 0x7fff9a895790) = -1 ENOTTY (Inappropriate ioctl for device)
write(1, " extracting: rand-28.txt "..., 37 extracting: rand-28.txt ) = 37
read(3, "2753279Y206223217}355W%:220YNT257260z^361T242237021336372+306310"..., 8192) = 8192


unzip opens dataset.zip, seeks to the end, then seeks to the start of the requested file in the archive (rand-28.txt, at offset 849346560) and reads from there.



The central directory is found by scanning the last 65557 bytes of the archive; see the code starting here:



/*---------------------------------------------------------------------------
Find and process the end-of-central-directory header. UnZip need only
check last 65557 bytes of zipfile: comment may be up to 65535, end-of-
central-directory record is 18 bytes, and signature itself is 4 bytes;
add some to allow for appended garbage. Since ZipInfo is often used as
a debugging tool, search the whole zipfile if zipinfo_mode is true.
---------------------------------------------------------------------------*/





share|improve this answer















When searching for a single file in a large archive, it uses method 1, which you can see using strace:



open("dataset.zip", O_RDONLY)           = 3
ioctl(1, TIOCGWINSZ, 0x7fff9a895920) = -1 ENOTTY (Inappropriate ioctl for device)
write(1, "Archive: dataset.zipn", 22Archive: dataset.zip
) = 22
lseek(3, 943718400, SEEK_SET) = 943718400
read(3, "340P356(s34230620520127360U[250/2207346<252+u2342251[<2310E342274"..., 4522) = 4522
lseek(3, 943722880, SEEK_SET) = 943722880
read(3, "3f225P\uxv14350343503", 20) = 20
lseek(3, 943718400, SEEK_SET) = 943718400
read(3, "340P356(s34230620520127360U[250/2207346<252+u2342251[<2310E342274"..., 8192) = 4522
lseek(3, 849346560, SEEK_SET) = 849346560
read(3, "D262nv210343240C24227344367q300223231306330275266213276M7I'&352234J"..., 8192) = 8192
stat("rand-28.txt", 0x559f43e0a550) = -1 ENOENT (No such file or directory)
lstat("rand-28.txt", 0x559f43e0a550) = -1 ENOENT (No such file or directory)
stat("rand-28.txt", 0x559f43e0a550) = -1 ENOENT (No such file or directory)
lstat("rand-28.txt", 0x559f43e0a550) = -1 ENOENT (No such file or directory)
open("rand-28.txt", O_RDWR|O_CREAT|O_TRUNC, 0666) = 4
ioctl(1, TIOCGWINSZ, 0x7fff9a895790) = -1 ENOTTY (Inappropriate ioctl for device)
write(1, " extracting: rand-28.txt "..., 37 extracting: rand-28.txt ) = 37
read(3, "2753279Y206223217}355W%:220YNT257260z^361T242237021336372+306310"..., 8192) = 8192


unzip opens dataset.zip, seeks to the end, then seeks to the start of the requested file in the archive (rand-28.txt, at offset 849346560) and reads from there.



The central directory is found by scanning the last 65557 bytes of the archive; see the code starting here:



/*---------------------------------------------------------------------------
Find and process the end-of-central-directory header. UnZip need only
check last 65557 bytes of zipfile: comment may be up to 65535, end-of-
central-directory record is 18 bytes, and signature itself is 4 bytes;
add some to allow for appended garbage. Since ZipInfo is often used as
a debugging tool, search the whole zipfile if zipinfo_mode is true.
---------------------------------------------------------------------------*/






share|improve this answer














share|improve this answer



share|improve this answer








edited Jan 29 at 21:50

























answered Jan 29 at 18:05









Stephen KittStephen Kitt

170k24385462




170k24385462













  • Could you add additional information if possible about how it actually finds the central directory record(which here seems to be 943718400)

    – tangy
    Jan 29 at 19:09






  • 1





    See my update. Let me know if you would like an interpretation of the code!

    – Stephen Kitt
    Jan 29 at 21:51











  • Thanks a ton - both for the addition of relevant sections with references and the offer to help understand! I'll try reading it on my own at first and request you if I do :)

    – tangy
    Jan 29 at 22:15



















  • Could you add additional information if possible about how it actually finds the central directory record(which here seems to be 943718400)

    – tangy
    Jan 29 at 19:09






  • 1





    See my update. Let me know if you would like an interpretation of the code!

    – Stephen Kitt
    Jan 29 at 21:51











  • Thanks a ton - both for the addition of relevant sections with references and the offer to help understand! I'll try reading it on my own at first and request you if I do :)

    – tangy
    Jan 29 at 22:15

















Could you add additional information if possible about how it actually finds the central directory record(which here seems to be 943718400)

– tangy
Jan 29 at 19:09





Could you add additional information if possible about how it actually finds the central directory record(which here seems to be 943718400)

– tangy
Jan 29 at 19:09




1




1





See my update. Let me know if you would like an interpretation of the code!

– Stephen Kitt
Jan 29 at 21:51





See my update. Let me know if you would like an interpretation of the code!

– Stephen Kitt
Jan 29 at 21:51













Thanks a ton - both for the addition of relevant sections with references and the offer to help understand! I'll try reading it on my own at first and request you if I do :)

– tangy
Jan 29 at 22:15





Thanks a ton - both for the addition of relevant sections with references and the offer to help understand! I'll try reading it on my own at first and request you if I do :)

– tangy
Jan 29 at 22:15













3














Actually it's a mixture. unzip reads some data from a known location, and then reads data blocks related to (but not identical with) the target entry in the zip-file.



The design of zip/unzip is explained in comments in the source-files. Here's the pertinent one from extract.c:



/*--------------------------------------------------------------------------- 
The basic idea of this function is as follows. Since the central di-
rectory lies at the end of the zipfile and the member files lie at the
beginning or middle or wherever, it is not very desirable to simply
read a central directory entry, jump to the member and extract it, and
then jump back to the central directory. In the case of a large zipfile
this would lead to a whole lot of disk-grinding, especially if each mem-
ber file is small. Instead, we read from the central directory the per-
tinent information for a block of files, then go extract/test the whole
block. Thus this routine contains two small(er) loops within a very
large outer loop: the first of the small ones reads a block of files
from the central directory; the second extracts or tests each file; and
the outer one loops over blocks. There's some file-pointer positioning
stuff in between, but that's about it. Btw, it's because of this jump-
ing around that we can afford to be lenient if an error occurs in one of
the member files: we should still be able to go find the other members,
since we know the offset of each from the beginning of the zipfile.
---------------------------------------------------------------------------*/


The format itself is mostly derived from PK-Ware's implementation, and is summarized in programming information text-files. According to that, there's more than one type of record in the central directory as well, so unzip cannot readily go to the end of the file and make an array of entries to lookup the target file.



Now... if you take the time to read the source code, you'll discover that unzip reads buffers of 8192 bytes (look for INBUFSIZ). I'd only use the single-file extract for a fairly large zip file (I had in mind the Java sources), but even for a smaller zip-file, you can see the effect of the buffer size. To see this, I zipped up the Git files for PuTTY, which gave 2727 files (counting a copy of the git log). Java was bigger than than 20 years ago, and hasn't shrunk. Extracting that log from the zip-file (chosen since it wouldn't be at the end of an alphabetically-sorted index and likely not in the first block read from the central directory) gave this from strace for the lseek calls:



lseek(3, -2252, SEEK_CUR)               = 1267
lseek(3, 120463360, SEEK_SET) = 120463360
lseek(3, 120468731, SEEK_SET) = 120468731
lseek(3, 120135680, SEEK_SET) = 120135680
lseek(3, 270336, SEEK_SET) = 270336
lseek(3, 120463360, SEEK_SET) = 120463360


As usual, with benchmarks, ymmv.






share|improve this answer


























  • While this makes sense if we are extracting the entire archive, for a single archive why would I need to do things this way? I think it might be using another routine in the case mentioned in the question eg. extract_or_test_member?

    – tangy
    Jan 29 at 22:09











  • typo - i meant single file within the archive. With a single file just 2 seeks would be needed right?

    – tangy
    Jan 29 at 22:15













  • More than that: it has to read the record which tells how large the central directory is, and it's not guaranteed that it'll read the whole directory while it's doing that.

    – Thomas Dickey
    Jan 29 at 22:21
















3














Actually it's a mixture. unzip reads some data from a known location, and then reads data blocks related to (but not identical with) the target entry in the zip-file.



The design of zip/unzip is explained in comments in the source-files. Here's the pertinent one from extract.c:



/*--------------------------------------------------------------------------- 
The basic idea of this function is as follows. Since the central di-
rectory lies at the end of the zipfile and the member files lie at the
beginning or middle or wherever, it is not very desirable to simply
read a central directory entry, jump to the member and extract it, and
then jump back to the central directory. In the case of a large zipfile
this would lead to a whole lot of disk-grinding, especially if each mem-
ber file is small. Instead, we read from the central directory the per-
tinent information for a block of files, then go extract/test the whole
block. Thus this routine contains two small(er) loops within a very
large outer loop: the first of the small ones reads a block of files
from the central directory; the second extracts or tests each file; and
the outer one loops over blocks. There's some file-pointer positioning
stuff in between, but that's about it. Btw, it's because of this jump-
ing around that we can afford to be lenient if an error occurs in one of
the member files: we should still be able to go find the other members,
since we know the offset of each from the beginning of the zipfile.
---------------------------------------------------------------------------*/


The format itself is mostly derived from PK-Ware's implementation, and is summarized in programming information text-files. According to that, there's more than one type of record in the central directory as well, so unzip cannot readily go to the end of the file and make an array of entries to lookup the target file.



Now... if you take the time to read the source code, you'll discover that unzip reads buffers of 8192 bytes (look for INBUFSIZ). I'd only use the single-file extract for a fairly large zip file (I had in mind the Java sources), but even for a smaller zip-file, you can see the effect of the buffer size. To see this, I zipped up the Git files for PuTTY, which gave 2727 files (counting a copy of the git log). Java was bigger than than 20 years ago, and hasn't shrunk. Extracting that log from the zip-file (chosen since it wouldn't be at the end of an alphabetically-sorted index and likely not in the first block read from the central directory) gave this from strace for the lseek calls:



lseek(3, -2252, SEEK_CUR)               = 1267
lseek(3, 120463360, SEEK_SET) = 120463360
lseek(3, 120468731, SEEK_SET) = 120468731
lseek(3, 120135680, SEEK_SET) = 120135680
lseek(3, 270336, SEEK_SET) = 270336
lseek(3, 120463360, SEEK_SET) = 120463360


As usual, with benchmarks, ymmv.






share|improve this answer


























  • While this makes sense if we are extracting the entire archive, for a single archive why would I need to do things this way? I think it might be using another routine in the case mentioned in the question eg. extract_or_test_member?

    – tangy
    Jan 29 at 22:09











  • typo - i meant single file within the archive. With a single file just 2 seeks would be needed right?

    – tangy
    Jan 29 at 22:15













  • More than that: it has to read the record which tells how large the central directory is, and it's not guaranteed that it'll read the whole directory while it's doing that.

    – Thomas Dickey
    Jan 29 at 22:21














3












3








3







Actually it's a mixture. unzip reads some data from a known location, and then reads data blocks related to (but not identical with) the target entry in the zip-file.



The design of zip/unzip is explained in comments in the source-files. Here's the pertinent one from extract.c:



/*--------------------------------------------------------------------------- 
The basic idea of this function is as follows. Since the central di-
rectory lies at the end of the zipfile and the member files lie at the
beginning or middle or wherever, it is not very desirable to simply
read a central directory entry, jump to the member and extract it, and
then jump back to the central directory. In the case of a large zipfile
this would lead to a whole lot of disk-grinding, especially if each mem-
ber file is small. Instead, we read from the central directory the per-
tinent information for a block of files, then go extract/test the whole
block. Thus this routine contains two small(er) loops within a very
large outer loop: the first of the small ones reads a block of files
from the central directory; the second extracts or tests each file; and
the outer one loops over blocks. There's some file-pointer positioning
stuff in between, but that's about it. Btw, it's because of this jump-
ing around that we can afford to be lenient if an error occurs in one of
the member files: we should still be able to go find the other members,
since we know the offset of each from the beginning of the zipfile.
---------------------------------------------------------------------------*/


The format itself is mostly derived from PK-Ware's implementation, and is summarized in programming information text-files. According to that, there's more than one type of record in the central directory as well, so unzip cannot readily go to the end of the file and make an array of entries to lookup the target file.



Now... if you take the time to read the source code, you'll discover that unzip reads buffers of 8192 bytes (look for INBUFSIZ). I'd only use the single-file extract for a fairly large zip file (I had in mind the Java sources), but even for a smaller zip-file, you can see the effect of the buffer size. To see this, I zipped up the Git files for PuTTY, which gave 2727 files (counting a copy of the git log). Java was bigger than than 20 years ago, and hasn't shrunk. Extracting that log from the zip-file (chosen since it wouldn't be at the end of an alphabetically-sorted index and likely not in the first block read from the central directory) gave this from strace for the lseek calls:



lseek(3, -2252, SEEK_CUR)               = 1267
lseek(3, 120463360, SEEK_SET) = 120463360
lseek(3, 120468731, SEEK_SET) = 120468731
lseek(3, 120135680, SEEK_SET) = 120135680
lseek(3, 270336, SEEK_SET) = 270336
lseek(3, 120463360, SEEK_SET) = 120463360


As usual, with benchmarks, ymmv.






share|improve this answer















Actually it's a mixture. unzip reads some data from a known location, and then reads data blocks related to (but not identical with) the target entry in the zip-file.



The design of zip/unzip is explained in comments in the source-files. Here's the pertinent one from extract.c:



/*--------------------------------------------------------------------------- 
The basic idea of this function is as follows. Since the central di-
rectory lies at the end of the zipfile and the member files lie at the
beginning or middle or wherever, it is not very desirable to simply
read a central directory entry, jump to the member and extract it, and
then jump back to the central directory. In the case of a large zipfile
this would lead to a whole lot of disk-grinding, especially if each mem-
ber file is small. Instead, we read from the central directory the per-
tinent information for a block of files, then go extract/test the whole
block. Thus this routine contains two small(er) loops within a very
large outer loop: the first of the small ones reads a block of files
from the central directory; the second extracts or tests each file; and
the outer one loops over blocks. There's some file-pointer positioning
stuff in between, but that's about it. Btw, it's because of this jump-
ing around that we can afford to be lenient if an error occurs in one of
the member files: we should still be able to go find the other members,
since we know the offset of each from the beginning of the zipfile.
---------------------------------------------------------------------------*/


The format itself is mostly derived from PK-Ware's implementation, and is summarized in programming information text-files. According to that, there's more than one type of record in the central directory as well, so unzip cannot readily go to the end of the file and make an array of entries to lookup the target file.



Now... if you take the time to read the source code, you'll discover that unzip reads buffers of 8192 bytes (look for INBUFSIZ). I'd only use the single-file extract for a fairly large zip file (I had in mind the Java sources), but even for a smaller zip-file, you can see the effect of the buffer size. To see this, I zipped up the Git files for PuTTY, which gave 2727 files (counting a copy of the git log). Java was bigger than than 20 years ago, and hasn't shrunk. Extracting that log from the zip-file (chosen since it wouldn't be at the end of an alphabetically-sorted index and likely not in the first block read from the central directory) gave this from strace for the lseek calls:



lseek(3, -2252, SEEK_CUR)               = 1267
lseek(3, 120463360, SEEK_SET) = 120463360
lseek(3, 120468731, SEEK_SET) = 120468731
lseek(3, 120135680, SEEK_SET) = 120135680
lseek(3, 270336, SEEK_SET) = 270336
lseek(3, 120463360, SEEK_SET) = 120463360


As usual, with benchmarks, ymmv.







share|improve this answer














share|improve this answer



share|improve this answer








edited Jan 30 at 21:19

























answered Jan 29 at 21:43









Thomas DickeyThomas Dickey

52.8k597170




52.8k597170













  • While this makes sense if we are extracting the entire archive, for a single archive why would I need to do things this way? I think it might be using another routine in the case mentioned in the question eg. extract_or_test_member?

    – tangy
    Jan 29 at 22:09











  • typo - i meant single file within the archive. With a single file just 2 seeks would be needed right?

    – tangy
    Jan 29 at 22:15













  • More than that: it has to read the record which tells how large the central directory is, and it's not guaranteed that it'll read the whole directory while it's doing that.

    – Thomas Dickey
    Jan 29 at 22:21



















  • While this makes sense if we are extracting the entire archive, for a single archive why would I need to do things this way? I think it might be using another routine in the case mentioned in the question eg. extract_or_test_member?

    – tangy
    Jan 29 at 22:09











  • typo - i meant single file within the archive. With a single file just 2 seeks would be needed right?

    – tangy
    Jan 29 at 22:15













  • More than that: it has to read the record which tells how large the central directory is, and it's not guaranteed that it'll read the whole directory while it's doing that.

    – Thomas Dickey
    Jan 29 at 22:21

















While this makes sense if we are extracting the entire archive, for a single archive why would I need to do things this way? I think it might be using another routine in the case mentioned in the question eg. extract_or_test_member?

– tangy
Jan 29 at 22:09





While this makes sense if we are extracting the entire archive, for a single archive why would I need to do things this way? I think it might be using another routine in the case mentioned in the question eg. extract_or_test_member?

– tangy
Jan 29 at 22:09













typo - i meant single file within the archive. With a single file just 2 seeks would be needed right?

– tangy
Jan 29 at 22:15







typo - i meant single file within the archive. With a single file just 2 seeks would be needed right?

– tangy
Jan 29 at 22:15















More than that: it has to read the record which tells how large the central directory is, and it's not guaranteed that it'll read the whole directory while it's doing that.

– Thomas Dickey
Jan 29 at 22:21





More than that: it has to read the record which tells how large the central directory is, and it's not guaranteed that it'll read the whole directory while it's doing that.

– Thomas Dickey
Jan 29 at 22:21


















draft saved

draft discarded




















































Thanks for contributing an answer to Unix & Linux Stack Exchange!


  • Please be sure to answer the question. Provide details and share your research!

But avoid



  • Asking for help, clarification, or responding to other answers.

  • Making statements based on opinion; back them up with references or personal experience.


To learn more, see our tips on writing great answers.




draft saved


draft discarded














StackExchange.ready(
function () {
StackExchange.openid.initPostLogin('.new-post-login', 'https%3a%2f%2funix.stackexchange.com%2fquestions%2f497509%2fwhat-method-does-unzip-use-to-find-a-single-file-in-an-archive%23new-answer', 'question_page');
}
);

Post as a guest















Required, but never shown





















































Required, but never shown














Required, but never shown












Required, but never shown







Required, but never shown

































Required, but never shown














Required, but never shown












Required, but never shown







Required, but never shown







Popular posts from this blog

How to change which sound is reproduced for terminal bell?

Title Spacing in Bjornstrup Chapter, Removing Chapter Number From Contents

Can I use Tabulator js library in my java Spring + Thymeleaf project?