Why is the size of npy bigger than csv?

Screenshot

I changed csv to npy file. After the change, size of csv file is 5GB, and npy is 13GB.
I thought a npy file is more efficient than csv.
Am I misunderstanding this? Why is the size of npy bigger than csv?

I just used this code

full = pd.read_csv('data/RGB.csv', header=None).values

np.save('data/RGB.npy', full, allow_pickle=False, fix_imports=False)

and data structure like this:

R, G, B, is_skin

2, 5, 1, 0

10, 52, 242, 1

52, 240, 42, 0

...(row is 420,711,257)

edited Nov 21 '18 at 9:23

ead

13.3k23059

asked Nov 21 '18 at 7:55

YeongHwa Jin

386

1

Maybe you can share relevant parts of the code that you used to convert your csv to numpy ?

– Fredz0r
Nov 21 '18 at 8:10

1

to me it looks more like 5GB vs 13GB. Can you also post a typical line from RGB.csv? Depending on how your data looks one format can lead to smaller file sizes than another.

– ead
Nov 21 '18 at 8:11

@Fredz0r I updated code!

– YeongHwa Jin
Nov 21 '18 at 8:15

@ead yes, 5Gb and 13GB. I'm confused :( . Data is just 2D array of RGB values. I posted it.

– YeongHwa Jin
Nov 21 '18 at 8:29

add a comment |

Screenshot

I just used this code

full = pd.read_csv('data/RGB.csv', header=None).values

np.save('data/RGB.npy', full, allow_pickle=False, fix_imports=False)

and data structure like this:

R, G, B, is_skin

2, 5, 1, 0

10, 52, 242, 1

52, 240, 42, 0

...(row is 420,711,257)

edited Nov 21 '18 at 9:23

ead

13.3k23059

asked Nov 21 '18 at 7:55

YeongHwa Jin

386

1

Maybe you can share relevant parts of the code that you used to convert your csv to numpy ?

– Fredz0r
Nov 21 '18 at 8:10

1

to me it looks more like 5GB vs 13GB. Can you also post a typical line from RGB.csv? Depending on how your data looks one format can lead to smaller file sizes than another.

– ead
Nov 21 '18 at 8:11

@Fredz0r I updated code!

– YeongHwa Jin
Nov 21 '18 at 8:15

@ead yes, 5Gb and 13GB. I'm confused :( . Data is just 2D array of RGB values. I posted it.

– YeongHwa Jin
Nov 21 '18 at 8:29

add a comment |

Screenshot

I just used this code

full = pd.read_csv('data/RGB.csv', header=None).values

np.save('data/RGB.npy', full, allow_pickle=False, fix_imports=False)

and data structure like this:

R, G, B, is_skin

2, 5, 1, 0

10, 52, 242, 1

52, 240, 42, 0

...(row is 420,711,257)

edited Nov 21 '18 at 9:23

ead

13.3k23059

asked Nov 21 '18 at 7:55

YeongHwa Jin

386

Screenshot

I just used this code

full = pd.read_csv('data/RGB.csv', header=None).values

np.save('data/RGB.npy', full, allow_pickle=False, fix_imports=False)

and data structure like this:

R, G, B, is_skin

2, 5, 1, 0

10, 52, 242, 1

52, 240, 42, 0

...(row is 420,711,257)

python-3.x csv numpy

edited Nov 21 '18 at 9:23

ead

13.3k23059

asked Nov 21 '18 at 7:55

YeongHwa Jin

386

edited Nov 21 '18 at 9:23

ead

13.3k23059

asked Nov 21 '18 at 7:55

YeongHwa Jin

386

edited Nov 21 '18 at 9:23

ead

13.3k23059

edited Nov 21 '18 at 9:23

ead

13.3k23059

edited Nov 21 '18 at 9:23

ead

13.3k23059

asked Nov 21 '18 at 7:55

YeongHwa Jin

386

asked Nov 21 '18 at 7:55

YeongHwa Jin

386

asked Nov 21 '18 at 7:55

YeongHwa Jin

386

1

Maybe you can share relevant parts of the code that you used to convert your csv to numpy ?

– Fredz0r
Nov 21 '18 at 8:10

1

to me it looks more like 5GB vs 13GB. Can you also post a typical line from RGB.csv? Depending on how your data looks one format can lead to smaller file sizes than another.

– ead
Nov 21 '18 at 8:11

@Fredz0r I updated code!

– YeongHwa Jin
Nov 21 '18 at 8:15

@ead yes, 5Gb and 13GB. I'm confused :( . Data is just 2D array of RGB values. I posted it.

– YeongHwa Jin
Nov 21 '18 at 8:29

add a comment |

1

Maybe you can share relevant parts of the code that you used to convert your csv to numpy ?

– Fredz0r
Nov 21 '18 at 8:10

1

to me it looks more like 5GB vs 13GB. Can you also post a typical line from RGB.csv? Depending on how your data looks one format can lead to smaller file sizes than another.

– ead
Nov 21 '18 at 8:11

@Fredz0r I updated code!

– YeongHwa Jin
Nov 21 '18 at 8:15

@ead yes, 5Gb and 13GB. I'm confused :( . Data is just 2D array of RGB values. I posted it.

– YeongHwa Jin
Nov 21 '18 at 8:29

Maybe you can share relevant parts of the code that you used to convert your csv to numpy ?

– Fredz0r
Nov 21 '18 at 8:10

to me it looks more like 5GB vs 13GB. Can you also post a typical line from RGB.csv? Depending on how your data looks one format can lead to smaller file sizes than another.

– ead
Nov 21 '18 at 8:11

@Fredz0r I updated code!

– YeongHwa Jin
Nov 21 '18 at 8:15

@ead yes, 5Gb and 13GB. I'm confused :( . Data is just 2D array of RGB values. I posted it.

– YeongHwa Jin
Nov 21 '18 at 8:29

add a comment |

1 Answer
1

active

oldest

votes

In your case an element is an integer between 0 and 255, inclusive. That means, saved as ASCII it will need at most

3 chars for the number

1 char for ,

1 char for the whitespace

which results in at most 5 bytes (somewhat less on average) per element on the disc.

Pandas reads/interprets this as an int64 array (see full.dtype) as default, which means it needs 8 bytes per element, which leads to a bigger size of the npy-file (most of which are zeros!).

To save an integer between 0 and 255 we need only one byte, thus the size of the npy-file could be reduced by factor 8 without loosing any information - just tell pandas it needs to interpret the data as unsigned 8bit-integers:

full = pd.read_csv(r'e:data.csv', dtype=np.uint8).values

# or to get rid of pandas-dependency:

# full = np.genfromtxt(r'e:data.csv', delimiter=',', dtype=np.uint8, skip_header=1)

np.save(r'e:/RGB.npy', full, allow_pickle=False, fix_imports=False)

# an 8 times smaller npy-file

Most of the time npy-format needs less space, however there can be situations when the ASCII format results in smaller files.

For example if data consist mostly of very small numbers with one digit and some few very big numbers, that for them really 8bytes are needed:

in ASCII-format you pay on average 2 bytes per element (there is no need to write whitespace, , alone as delimiter is good enough).

in numpy-format you will pay 8 bytes per element.

edited Nov 21 '18 at 9:36

answered Nov 21 '18 at 9:21

ead

13.3k23059

Wow, beautiful answer...! (๑°ㅁ°๑)‼ thanks ead!!

– YeongHwa Jin
Nov 21 '18 at 11:08

add a comment |

Your Answer

StackExchange.ifUsing("editor", function () {
StackExchange.using("externalEditor", function () {
StackExchange.using("snippets", function () {
StackExchange.snippets.init();
});
});
}, "code-snippets");

StackExchange.ready(function() {
var channelOptions = {
tags: "".split(" "),
id: "1"
};
initTagRenderer("".split(" "), "".split(" "), channelOptions);

StackExchange.using("externalEditor", function() {
// Have to fire editor after snippets, if snippets enabled
if (StackExchange.settings.snippets.snippetsEnabled) {
StackExchange.using("snippets", function() {
createEditor();
});
}
else {
createEditor();
}
});

function createEditor() {
StackExchange.prepareEditor({
heartbeatType: 'answer',
autoActivateHeartbeat: false,
convertImagesToLinks: true,
noModals: true,
showLowRepImageUploadWarning: true,
reputationToPostImages: 10,
bindNavPrevention: true,
postfix: "",
imageUploader: {
brandingHtml: "Powered by u003ca class="icon-imgur-white" href="https://imgur.com/"u003eu003c/au003e",
contentPolicyHtml: "User contributions licensed under u003ca href="https://creativecommons.org/licenses/by-sa/3.0/"u003ecc by-sa 3.0 with attribution requiredu003c/au003e u003ca href="https://stackoverflow.com/legal/content-policy"u003e(content policy)u003c/au003e",
allowUrls: true
},
onDemand: true,
discardSelector: ".discard-answer"
,immediatelyShowMarkdownHelp:true
});

}
});

draft saved

draft discarded

Sign up or log in

StackExchange.ready(function () {
StackExchange.helpers.onClickDraftSave('#login-link');
});

Post as a guest

Name

Required, but never shown

StackExchange.ready(
function () {
StackExchange.openid.initPostLogin('.new-post-login', 'https%3a%2f%2fstackoverflow.com%2fquestions%2f53407468%2fwhy-is-the-size-of-npy-bigger-than-csv%23new-answer', 'question_page');
}
);

Post as a guest

Name

Required, but never shown

1 Answer
1

active

oldest

votes

1 Answer
1

active

oldest

votes

In your case an element is an integer between 0 and 255, inclusive. That means, saved as ASCII it will need at most

3 chars for the number

1 char for ,

1 char for the whitespace

which results in at most 5 bytes (somewhat less on average) per element on the disc.

Pandas reads/interprets this as an int64 array (see full.dtype) as default, which means it needs 8 bytes per element, which leads to a bigger size of the npy-file (most of which are zeros!).

full = pd.read_csv(r'e:data.csv', dtype=np.uint8).values

# or to get rid of pandas-dependency:

# full = np.genfromtxt(r'e:data.csv', delimiter=',', dtype=np.uint8, skip_header=1)

np.save(r'e:/RGB.npy', full, allow_pickle=False, fix_imports=False)

# an 8 times smaller npy-file

Most of the time npy-format needs less space, however there can be situations when the ASCII format results in smaller files.

For example if data consist mostly of very small numbers with one digit and some few very big numbers, that for them really 8bytes are needed:

in ASCII-format you pay on average 2 bytes per element (there is no need to write whitespace, , alone as delimiter is good enough).

in numpy-format you will pay 8 bytes per element.

edited Nov 21 '18 at 9:36

answered Nov 21 '18 at 9:21

ead

13.3k23059

Wow, beautiful answer...! (๑°ㅁ°๑)‼ thanks ead!!

– YeongHwa Jin
Nov 21 '18 at 11:08

add a comment |

In your case an element is an integer between 0 and 255, inclusive. That means, saved as ASCII it will need at most

3 chars for the number

1 char for ,

1 char for the whitespace

which results in at most 5 bytes (somewhat less on average) per element on the disc.

Pandas reads/interprets this as an int64 array (see full.dtype) as default, which means it needs 8 bytes per element, which leads to a bigger size of the npy-file (most of which are zeros!).

full = pd.read_csv(r'e:data.csv', dtype=np.uint8).values

# or to get rid of pandas-dependency:

# full = np.genfromtxt(r'e:data.csv', delimiter=',', dtype=np.uint8, skip_header=1)

np.save(r'e:/RGB.npy', full, allow_pickle=False, fix_imports=False)

# an 8 times smaller npy-file

Most of the time npy-format needs less space, however there can be situations when the ASCII format results in smaller files.

For example if data consist mostly of very small numbers with one digit and some few very big numbers, that for them really 8bytes are needed:

in ASCII-format you pay on average 2 bytes per element (there is no need to write whitespace, , alone as delimiter is good enough).

in numpy-format you will pay 8 bytes per element.

edited Nov 21 '18 at 9:36

answered Nov 21 '18 at 9:21

ead

13.3k23059

Wow, beautiful answer...! (๑°ㅁ°๑)‼ thanks ead!!

– YeongHwa Jin
Nov 21 '18 at 11:08

add a comment |

In your case an element is an integer between 0 and 255, inclusive. That means, saved as ASCII it will need at most

3 chars for the number

1 char for ,

1 char for the whitespace

which results in at most 5 bytes (somewhat less on average) per element on the disc.

Pandas reads/interprets this as an int64 array (see full.dtype) as default, which means it needs 8 bytes per element, which leads to a bigger size of the npy-file (most of which are zeros!).

full = pd.read_csv(r'e:data.csv', dtype=np.uint8).values

# or to get rid of pandas-dependency:

# full = np.genfromtxt(r'e:data.csv', delimiter=',', dtype=np.uint8, skip_header=1)

np.save(r'e:/RGB.npy', full, allow_pickle=False, fix_imports=False)

# an 8 times smaller npy-file

Most of the time npy-format needs less space, however there can be situations when the ASCII format results in smaller files.

For example if data consist mostly of very small numbers with one digit and some few very big numbers, that for them really 8bytes are needed:

in ASCII-format you pay on average 2 bytes per element (there is no need to write whitespace, , alone as delimiter is good enough).

in numpy-format you will pay 8 bytes per element.

edited Nov 21 '18 at 9:36

answered Nov 21 '18 at 9:21

ead

13.3k23059

In your case an element is an integer between 0 and 255, inclusive. That means, saved as ASCII it will need at most

3 chars for the number

1 char for ,

1 char for the whitespace

which results in at most 5 bytes (somewhat less on average) per element on the disc.

Pandas reads/interprets this as an int64 array (see full.dtype) as default, which means it needs 8 bytes per element, which leads to a bigger size of the npy-file (most of which are zeros!).

full = pd.read_csv(r'e:data.csv', dtype=np.uint8).values

# or to get rid of pandas-dependency:

# full = np.genfromtxt(r'e:data.csv', delimiter=',', dtype=np.uint8, skip_header=1)

np.save(r'e:/RGB.npy', full, allow_pickle=False, fix_imports=False)

# an 8 times smaller npy-file

Most of the time npy-format needs less space, however there can be situations when the ASCII format results in smaller files.

For example if data consist mostly of very small numbers with one digit and some few very big numbers, that for them really 8bytes are needed:

in ASCII-format you pay on average 2 bytes per element (there is no need to write whitespace, , alone as delimiter is good enough).

in numpy-format you will pay 8 bytes per element.

edited Nov 21 '18 at 9:36

answered Nov 21 '18 at 9:21

ead

13.3k23059

edited Nov 21 '18 at 9:36

answered Nov 21 '18 at 9:21

ead

13.3k23059

answered Nov 21 '18 at 9:21

ead

13.3k23059

answered Nov 21 '18 at 9:21

ead

13.3k23059

Wow, beautiful answer...! (๑°ㅁ°๑)‼ thanks ead!!

– YeongHwa Jin
Nov 21 '18 at 11:08

add a comment |

Wow, beautiful answer...! (๑°ㅁ°๑)‼ thanks ead!!

– YeongHwa Jin
Nov 21 '18 at 11:08

Wow, beautiful answer...! (๑°ㅁ°๑)‼ thanks ead!!

– YeongHwa Jin
Nov 21 '18 at 11:08

add a comment |

draft saved

draft discarded

Thanks for contributing an answer to Stack Overflow!

Please be sure to answer the question. Provide details and share your research!

But avoid …

Asking for help, clarification, or responding to other answers.

Making statements based on opinion; back them up with references or personal experience.

To learn more, see our tips on writing great answers.

draft saved

draft discarded

Sign up or log in

StackExchange.ready(function () {
StackExchange.helpers.onClickDraftSave('#login-link');
});

Post as a guest

Name

Required, but never shown

Post as a guest

Name

Required, but never shown

Sign up or log in

StackExchange.ready(function () {
StackExchange.helpers.onClickDraftSave('#login-link');
});

Post as a guest

Name

Required, but never shown

Sign up or log in

StackExchange.ready(function () {
StackExchange.helpers.onClickDraftSave('#login-link');
});

Post as a guest

Name

Required, but never shown

Sign up or log in

StackExchange.ready(function () {
StackExchange.helpers.onClickDraftSave('#login-link');
});

Post as a guest

Name

Required, but never shown

Name

Required, but never shown

Name

Required, but never shown

This page is only for reference, If you need detailed information, please check here

c5eAxIovzfAG

搜尋此網誌

Cfrgtkky