Why is the size of npy bigger than csv?












3















Screenshot



I changed csv to npy file. After the change, size of csv file is 5GB, and npy is 13GB.
I thought a npy file is more efficient than csv.
Am I misunderstanding this? Why is the size of npy bigger than csv?



I just used this code



full = pd.read_csv('data/RGB.csv', header=None).values
np.save('data/RGB.npy', full, allow_pickle=False, fix_imports=False)


and data structure like this:



R, G, B, is_skin
2, 5, 1, 0
10, 52, 242, 1
52, 240, 42, 0
...(row is 420,711,257)









share|improve this question




















  • 1





    Maybe you can share relevant parts of the code that you used to convert your csv to numpy ?

    – Fredz0r
    Nov 21 '18 at 8:10






  • 1





    to me it looks more like 5GB vs 13GB. Can you also post a typical line from RGB.csv? Depending on how your data looks one format can lead to smaller file sizes than another.

    – ead
    Nov 21 '18 at 8:11













  • @Fredz0r I updated code!

    – YeongHwa Jin
    Nov 21 '18 at 8:15











  • @ead yes, 5Gb and 13GB. I'm confused :( . Data is just 2D array of RGB values. I posted it.

    – YeongHwa Jin
    Nov 21 '18 at 8:29
















3















Screenshot



I changed csv to npy file. After the change, size of csv file is 5GB, and npy is 13GB.
I thought a npy file is more efficient than csv.
Am I misunderstanding this? Why is the size of npy bigger than csv?



I just used this code



full = pd.read_csv('data/RGB.csv', header=None).values
np.save('data/RGB.npy', full, allow_pickle=False, fix_imports=False)


and data structure like this:



R, G, B, is_skin
2, 5, 1, 0
10, 52, 242, 1
52, 240, 42, 0
...(row is 420,711,257)









share|improve this question




















  • 1





    Maybe you can share relevant parts of the code that you used to convert your csv to numpy ?

    – Fredz0r
    Nov 21 '18 at 8:10






  • 1





    to me it looks more like 5GB vs 13GB. Can you also post a typical line from RGB.csv? Depending on how your data looks one format can lead to smaller file sizes than another.

    – ead
    Nov 21 '18 at 8:11













  • @Fredz0r I updated code!

    – YeongHwa Jin
    Nov 21 '18 at 8:15











  • @ead yes, 5Gb and 13GB. I'm confused :( . Data is just 2D array of RGB values. I posted it.

    – YeongHwa Jin
    Nov 21 '18 at 8:29














3












3








3


0






Screenshot



I changed csv to npy file. After the change, size of csv file is 5GB, and npy is 13GB.
I thought a npy file is more efficient than csv.
Am I misunderstanding this? Why is the size of npy bigger than csv?



I just used this code



full = pd.read_csv('data/RGB.csv', header=None).values
np.save('data/RGB.npy', full, allow_pickle=False, fix_imports=False)


and data structure like this:



R, G, B, is_skin
2, 5, 1, 0
10, 52, 242, 1
52, 240, 42, 0
...(row is 420,711,257)









share|improve this question
















Screenshot



I changed csv to npy file. After the change, size of csv file is 5GB, and npy is 13GB.
I thought a npy file is more efficient than csv.
Am I misunderstanding this? Why is the size of npy bigger than csv?



I just used this code



full = pd.read_csv('data/RGB.csv', header=None).values
np.save('data/RGB.npy', full, allow_pickle=False, fix_imports=False)


and data structure like this:



R, G, B, is_skin
2, 5, 1, 0
10, 52, 242, 1
52, 240, 42, 0
...(row is 420,711,257)






python-3.x csv numpy






share|improve this question















share|improve this question













share|improve this question




share|improve this question








edited Nov 21 '18 at 9:23









ead

13.3k23059




13.3k23059










asked Nov 21 '18 at 7:55









YeongHwa JinYeongHwa Jin

386




386








  • 1





    Maybe you can share relevant parts of the code that you used to convert your csv to numpy ?

    – Fredz0r
    Nov 21 '18 at 8:10






  • 1





    to me it looks more like 5GB vs 13GB. Can you also post a typical line from RGB.csv? Depending on how your data looks one format can lead to smaller file sizes than another.

    – ead
    Nov 21 '18 at 8:11













  • @Fredz0r I updated code!

    – YeongHwa Jin
    Nov 21 '18 at 8:15











  • @ead yes, 5Gb and 13GB. I'm confused :( . Data is just 2D array of RGB values. I posted it.

    – YeongHwa Jin
    Nov 21 '18 at 8:29














  • 1





    Maybe you can share relevant parts of the code that you used to convert your csv to numpy ?

    – Fredz0r
    Nov 21 '18 at 8:10






  • 1





    to me it looks more like 5GB vs 13GB. Can you also post a typical line from RGB.csv? Depending on how your data looks one format can lead to smaller file sizes than another.

    – ead
    Nov 21 '18 at 8:11













  • @Fredz0r I updated code!

    – YeongHwa Jin
    Nov 21 '18 at 8:15











  • @ead yes, 5Gb and 13GB. I'm confused :( . Data is just 2D array of RGB values. I posted it.

    – YeongHwa Jin
    Nov 21 '18 at 8:29








1




1





Maybe you can share relevant parts of the code that you used to convert your csv to numpy ?

– Fredz0r
Nov 21 '18 at 8:10





Maybe you can share relevant parts of the code that you used to convert your csv to numpy ?

– Fredz0r
Nov 21 '18 at 8:10




1




1





to me it looks more like 5GB vs 13GB. Can you also post a typical line from RGB.csv? Depending on how your data looks one format can lead to smaller file sizes than another.

– ead
Nov 21 '18 at 8:11







to me it looks more like 5GB vs 13GB. Can you also post a typical line from RGB.csv? Depending on how your data looks one format can lead to smaller file sizes than another.

– ead
Nov 21 '18 at 8:11















@Fredz0r I updated code!

– YeongHwa Jin
Nov 21 '18 at 8:15





@Fredz0r I updated code!

– YeongHwa Jin
Nov 21 '18 at 8:15













@ead yes, 5Gb and 13GB. I'm confused :( . Data is just 2D array of RGB values. I posted it.

– YeongHwa Jin
Nov 21 '18 at 8:29





@ead yes, 5Gb and 13GB. I'm confused :( . Data is just 2D array of RGB values. I posted it.

– YeongHwa Jin
Nov 21 '18 at 8:29












1 Answer
1






active

oldest

votes


















6














In your case an element is an integer between 0 and 255, inclusive. That means, saved as ASCII it will need at most




  • 3 chars for the number

  • 1 char for ,

  • 1 char for the whitespace


which results in at most 5 bytes (somewhat less on average) per element on the disc.



Pandas reads/interprets this as an int64 array (see full.dtype) as default, which means it needs 8 bytes per element, which leads to a bigger size of the npy-file (most of which are zeros!).



To save an integer between 0 and 255 we need only one byte, thus the size of the npy-file could be reduced by factor 8 without loosing any information - just tell pandas it needs to interpret the data as unsigned 8bit-integers:



full = pd.read_csv(r'e:data.csv', dtype=np.uint8).values
# or to get rid of pandas-dependency:
# full = np.genfromtxt(r'e:data.csv', delimiter=',', dtype=np.uint8, skip_header=1)
np.save(r'e:/RGB.npy', full, allow_pickle=False, fix_imports=False)
# an 8 times smaller npy-file




Most of the time npy-format needs less space, however there can be situations when the ASCII format results in smaller files.



For example if data consist mostly of very small numbers with one digit and some few very big numbers, that for them really 8bytes are needed:




  • in ASCII-format you pay on average 2 bytes per element (there is no need to write whitespace, , alone as delimiter is good enough).

  • in numpy-format you will pay 8 bytes per element.






share|improve this answer


























  • Wow, beautiful answer...! (๑°ㅁ°๑)‼ thanks ead!!

    – YeongHwa Jin
    Nov 21 '18 at 11:08













Your Answer






StackExchange.ifUsing("editor", function () {
StackExchange.using("externalEditor", function () {
StackExchange.using("snippets", function () {
StackExchange.snippets.init();
});
});
}, "code-snippets");

StackExchange.ready(function() {
var channelOptions = {
tags: "".split(" "),
id: "1"
};
initTagRenderer("".split(" "), "".split(" "), channelOptions);

StackExchange.using("externalEditor", function() {
// Have to fire editor after snippets, if snippets enabled
if (StackExchange.settings.snippets.snippetsEnabled) {
StackExchange.using("snippets", function() {
createEditor();
});
}
else {
createEditor();
}
});

function createEditor() {
StackExchange.prepareEditor({
heartbeatType: 'answer',
autoActivateHeartbeat: false,
convertImagesToLinks: true,
noModals: true,
showLowRepImageUploadWarning: true,
reputationToPostImages: 10,
bindNavPrevention: true,
postfix: "",
imageUploader: {
brandingHtml: "Powered by u003ca class="icon-imgur-white" href="https://imgur.com/"u003eu003c/au003e",
contentPolicyHtml: "User contributions licensed under u003ca href="https://creativecommons.org/licenses/by-sa/3.0/"u003ecc by-sa 3.0 with attribution requiredu003c/au003e u003ca href="https://stackoverflow.com/legal/content-policy"u003e(content policy)u003c/au003e",
allowUrls: true
},
onDemand: true,
discardSelector: ".discard-answer"
,immediatelyShowMarkdownHelp:true
});


}
});














draft saved

draft discarded


















StackExchange.ready(
function () {
StackExchange.openid.initPostLogin('.new-post-login', 'https%3a%2f%2fstackoverflow.com%2fquestions%2f53407468%2fwhy-is-the-size-of-npy-bigger-than-csv%23new-answer', 'question_page');
}
);

Post as a guest















Required, but never shown

























1 Answer
1






active

oldest

votes








1 Answer
1






active

oldest

votes









active

oldest

votes






active

oldest

votes









6














In your case an element is an integer between 0 and 255, inclusive. That means, saved as ASCII it will need at most




  • 3 chars for the number

  • 1 char for ,

  • 1 char for the whitespace


which results in at most 5 bytes (somewhat less on average) per element on the disc.



Pandas reads/interprets this as an int64 array (see full.dtype) as default, which means it needs 8 bytes per element, which leads to a bigger size of the npy-file (most of which are zeros!).



To save an integer between 0 and 255 we need only one byte, thus the size of the npy-file could be reduced by factor 8 without loosing any information - just tell pandas it needs to interpret the data as unsigned 8bit-integers:



full = pd.read_csv(r'e:data.csv', dtype=np.uint8).values
# or to get rid of pandas-dependency:
# full = np.genfromtxt(r'e:data.csv', delimiter=',', dtype=np.uint8, skip_header=1)
np.save(r'e:/RGB.npy', full, allow_pickle=False, fix_imports=False)
# an 8 times smaller npy-file




Most of the time npy-format needs less space, however there can be situations when the ASCII format results in smaller files.



For example if data consist mostly of very small numbers with one digit and some few very big numbers, that for them really 8bytes are needed:




  • in ASCII-format you pay on average 2 bytes per element (there is no need to write whitespace, , alone as delimiter is good enough).

  • in numpy-format you will pay 8 bytes per element.






share|improve this answer


























  • Wow, beautiful answer...! (๑°ㅁ°๑)‼ thanks ead!!

    – YeongHwa Jin
    Nov 21 '18 at 11:08


















6














In your case an element is an integer between 0 and 255, inclusive. That means, saved as ASCII it will need at most




  • 3 chars for the number

  • 1 char for ,

  • 1 char for the whitespace


which results in at most 5 bytes (somewhat less on average) per element on the disc.



Pandas reads/interprets this as an int64 array (see full.dtype) as default, which means it needs 8 bytes per element, which leads to a bigger size of the npy-file (most of which are zeros!).



To save an integer between 0 and 255 we need only one byte, thus the size of the npy-file could be reduced by factor 8 without loosing any information - just tell pandas it needs to interpret the data as unsigned 8bit-integers:



full = pd.read_csv(r'e:data.csv', dtype=np.uint8).values
# or to get rid of pandas-dependency:
# full = np.genfromtxt(r'e:data.csv', delimiter=',', dtype=np.uint8, skip_header=1)
np.save(r'e:/RGB.npy', full, allow_pickle=False, fix_imports=False)
# an 8 times smaller npy-file




Most of the time npy-format needs less space, however there can be situations when the ASCII format results in smaller files.



For example if data consist mostly of very small numbers with one digit and some few very big numbers, that for them really 8bytes are needed:




  • in ASCII-format you pay on average 2 bytes per element (there is no need to write whitespace, , alone as delimiter is good enough).

  • in numpy-format you will pay 8 bytes per element.






share|improve this answer


























  • Wow, beautiful answer...! (๑°ㅁ°๑)‼ thanks ead!!

    – YeongHwa Jin
    Nov 21 '18 at 11:08
















6












6








6







In your case an element is an integer between 0 and 255, inclusive. That means, saved as ASCII it will need at most




  • 3 chars for the number

  • 1 char for ,

  • 1 char for the whitespace


which results in at most 5 bytes (somewhat less on average) per element on the disc.



Pandas reads/interprets this as an int64 array (see full.dtype) as default, which means it needs 8 bytes per element, which leads to a bigger size of the npy-file (most of which are zeros!).



To save an integer between 0 and 255 we need only one byte, thus the size of the npy-file could be reduced by factor 8 without loosing any information - just tell pandas it needs to interpret the data as unsigned 8bit-integers:



full = pd.read_csv(r'e:data.csv', dtype=np.uint8).values
# or to get rid of pandas-dependency:
# full = np.genfromtxt(r'e:data.csv', delimiter=',', dtype=np.uint8, skip_header=1)
np.save(r'e:/RGB.npy', full, allow_pickle=False, fix_imports=False)
# an 8 times smaller npy-file




Most of the time npy-format needs less space, however there can be situations when the ASCII format results in smaller files.



For example if data consist mostly of very small numbers with one digit and some few very big numbers, that for them really 8bytes are needed:




  • in ASCII-format you pay on average 2 bytes per element (there is no need to write whitespace, , alone as delimiter is good enough).

  • in numpy-format you will pay 8 bytes per element.






share|improve this answer















In your case an element is an integer between 0 and 255, inclusive. That means, saved as ASCII it will need at most




  • 3 chars for the number

  • 1 char for ,

  • 1 char for the whitespace


which results in at most 5 bytes (somewhat less on average) per element on the disc.



Pandas reads/interprets this as an int64 array (see full.dtype) as default, which means it needs 8 bytes per element, which leads to a bigger size of the npy-file (most of which are zeros!).



To save an integer between 0 and 255 we need only one byte, thus the size of the npy-file could be reduced by factor 8 without loosing any information - just tell pandas it needs to interpret the data as unsigned 8bit-integers:



full = pd.read_csv(r'e:data.csv', dtype=np.uint8).values
# or to get rid of pandas-dependency:
# full = np.genfromtxt(r'e:data.csv', delimiter=',', dtype=np.uint8, skip_header=1)
np.save(r'e:/RGB.npy', full, allow_pickle=False, fix_imports=False)
# an 8 times smaller npy-file




Most of the time npy-format needs less space, however there can be situations when the ASCII format results in smaller files.



For example if data consist mostly of very small numbers with one digit and some few very big numbers, that for them really 8bytes are needed:




  • in ASCII-format you pay on average 2 bytes per element (there is no need to write whitespace, , alone as delimiter is good enough).

  • in numpy-format you will pay 8 bytes per element.







share|improve this answer














share|improve this answer



share|improve this answer








edited Nov 21 '18 at 9:36

























answered Nov 21 '18 at 9:21









eadead

13.3k23059




13.3k23059













  • Wow, beautiful answer...! (๑°ㅁ°๑)‼ thanks ead!!

    – YeongHwa Jin
    Nov 21 '18 at 11:08





















  • Wow, beautiful answer...! (๑°ㅁ°๑)‼ thanks ead!!

    – YeongHwa Jin
    Nov 21 '18 at 11:08



















Wow, beautiful answer...! (๑°ㅁ°๑)‼ thanks ead!!

– YeongHwa Jin
Nov 21 '18 at 11:08







Wow, beautiful answer...! (๑°ㅁ°๑)‼ thanks ead!!

– YeongHwa Jin
Nov 21 '18 at 11:08






















draft saved

draft discarded




















































Thanks for contributing an answer to Stack Overflow!


  • Please be sure to answer the question. Provide details and share your research!

But avoid



  • Asking for help, clarification, or responding to other answers.

  • Making statements based on opinion; back them up with references or personal experience.


To learn more, see our tips on writing great answers.




draft saved


draft discarded














StackExchange.ready(
function () {
StackExchange.openid.initPostLogin('.new-post-login', 'https%3a%2f%2fstackoverflow.com%2fquestions%2f53407468%2fwhy-is-the-size-of-npy-bigger-than-csv%23new-answer', 'question_page');
}
);

Post as a guest















Required, but never shown





















































Required, but never shown














Required, but never shown












Required, but never shown







Required, but never shown

































Required, but never shown














Required, but never shown












Required, but never shown







Required, but never shown







Popular posts from this blog

Biblatex bibliography style without URLs when DOI exists (in Overleaf with Zotero bibliography)

ComboBox Display Member on multiple fields

Is it possible to collect Nectar points via Trainline?