efficiently growing a large dataframe vertically

I have the following code which recursively iterates over a directory containing thousands of csv's, and attempts to read and add them all to one DataFrame:

df = pd.DataFrame()

symbol = symbol.upper()



for filepath in glob.iglob(r'W:data{0}option******.csv'.format(188), recursive=True):



    optionNameCSI = filepath.split("\")[-1].split('.')[0]

    try:

        tmp = pd.read_csv(filepath, engine='c')

        strike = tmp['Strike'].iloc[-1]

        expiry = pd.to_datetime(tmp['Option Expiration Date'].iloc[-1])

        m = expiry.month

        y = expiry.year

        PutCall = tmp['PutCall'].iloc[-1]

        future = symbol + numToLetter[m] + str(y)

    except (IndexError, KeyError) as e:

        continue



    if tmp.empty:

        df = tmp

    else:

        df = df.append(tmp)



    print(optionName, 'loaded')

However, this code starts off iterating very quickly, then slows down exponentially and never completes. Is there something I'm doing wrong? I know that the file paths are all acquired correctly, so it's the growing of the DataFrame that is the issue.

asked Nov 19 '18 at 14:37

Évariste Galois

34513

5

You are appending to a DataFrame within a loop, which needlessly copies data and is extremely inefficient (which is why is starts out fine, but then slows to a halt). Append to a list within the loop and concatenate once after.

– ALollz
Nov 19 '18 at 14:40

2

Alexander's Solution illustrates this.

– ALollz
Nov 19 '18 at 14:42

2

Another illustration from @unutbu with wise words: Never call DataFrame.append or pd.concat inside a for-loop. It leads to quadratic copying.

– Parfait
Nov 19 '18 at 14:48

Nice! I was not aware of this. If we're trying to be as efficient as possible, is there any significant difference in the performance of Alexander's solution compared to the concatenation?

– Évariste Galois
Nov 19 '18 at 14:53

1

I'd just use the csv module tbh. Every attempt I've made at growing a DF like this has been crippling in speed and memory. I have not found a hack to get around it. The odd join or concat, maybe, but the overhead is gross once you put it in a loop.

– roganjosh
Nov 19 '18 at 14:55

add a comment |

I have the following code which recursively iterates over a directory containing thousands of csv's, and attempts to read and add them all to one DataFrame:

df = pd.DataFrame()

symbol = symbol.upper()



for filepath in glob.iglob(r'W:data{0}option******.csv'.format(188), recursive=True):



    optionNameCSI = filepath.split("\")[-1].split('.')[0]

    try:

        tmp = pd.read_csv(filepath, engine='c')

        strike = tmp['Strike'].iloc[-1]

        expiry = pd.to_datetime(tmp['Option Expiration Date'].iloc[-1])

        m = expiry.month

        y = expiry.year

        PutCall = tmp['PutCall'].iloc[-1]

        future = symbol + numToLetter[m] + str(y)

    except (IndexError, KeyError) as e:

        continue



    if tmp.empty:

        df = tmp

    else:

        df = df.append(tmp)



    print(optionName, 'loaded')

asked Nov 19 '18 at 14:37

Évariste Galois

34513

5

You are appending to a DataFrame within a loop, which needlessly copies data and is extremely inefficient (which is why is starts out fine, but then slows to a halt). Append to a list within the loop and concatenate once after.

– ALollz
Nov 19 '18 at 14:40

2

Alexander's Solution illustrates this.

– ALollz
Nov 19 '18 at 14:42

2

Another illustration from @unutbu with wise words: Never call DataFrame.append or pd.concat inside a for-loop. It leads to quadratic copying.

– Parfait
Nov 19 '18 at 14:48

Nice! I was not aware of this. If we're trying to be as efficient as possible, is there any significant difference in the performance of Alexander's solution compared to the concatenation?

– Évariste Galois
Nov 19 '18 at 14:53

1

I'd just use the csv module tbh. Every attempt I've made at growing a DF like this has been crippling in speed and memory. I have not found a hack to get around it. The odd join or concat, maybe, but the overhead is gross once you put it in a loop.

– roganjosh
Nov 19 '18 at 14:55

add a comment |

I have the following code which recursively iterates over a directory containing thousands of csv's, and attempts to read and add them all to one DataFrame:

df = pd.DataFrame()

symbol = symbol.upper()



for filepath in glob.iglob(r'W:data{0}option******.csv'.format(188), recursive=True):



    optionNameCSI = filepath.split("\")[-1].split('.')[0]

    try:

        tmp = pd.read_csv(filepath, engine='c')

        strike = tmp['Strike'].iloc[-1]

        expiry = pd.to_datetime(tmp['Option Expiration Date'].iloc[-1])

        m = expiry.month

        y = expiry.year

        PutCall = tmp['PutCall'].iloc[-1]

        future = symbol + numToLetter[m] + str(y)

    except (IndexError, KeyError) as e:

        continue



    if tmp.empty:

        df = tmp

    else:

        df = df.append(tmp)



    print(optionName, 'loaded')

asked Nov 19 '18 at 14:37

Évariste Galois

34513

I have the following code which recursively iterates over a directory containing thousands of csv's, and attempts to read and add them all to one DataFrame:

df = pd.DataFrame()

symbol = symbol.upper()



for filepath in glob.iglob(r'W:data{0}option******.csv'.format(188), recursive=True):



    optionNameCSI = filepath.split("\")[-1].split('.')[0]

    try:

        tmp = pd.read_csv(filepath, engine='c')

        strike = tmp['Strike'].iloc[-1]

        expiry = pd.to_datetime(tmp['Option Expiration Date'].iloc[-1])

        m = expiry.month

        y = expiry.year

        PutCall = tmp['PutCall'].iloc[-1]

        future = symbol + numToLetter[m] + str(y)

    except (IndexError, KeyError) as e:

        continue



    if tmp.empty:

        df = tmp

    else:

        df = df.append(tmp)



    print(optionName, 'loaded')

python pandas

asked Nov 19 '18 at 14:37

Évariste Galois

34513

asked Nov 19 '18 at 14:37

Évariste Galois

34513

asked Nov 19 '18 at 14:37

Évariste Galois

34513

asked Nov 19 '18 at 14:37

Évariste Galois

34513

asked Nov 19 '18 at 14:37

Évariste Galois

34513

5

You are appending to a DataFrame within a loop, which needlessly copies data and is extremely inefficient (which is why is starts out fine, but then slows to a halt). Append to a list within the loop and concatenate once after.

– ALollz
Nov 19 '18 at 14:40

2

Alexander's Solution illustrates this.

– ALollz
Nov 19 '18 at 14:42

2

Another illustration from @unutbu with wise words: Never call DataFrame.append or pd.concat inside a for-loop. It leads to quadratic copying.

– Parfait
Nov 19 '18 at 14:48

Nice! I was not aware of this. If we're trying to be as efficient as possible, is there any significant difference in the performance of Alexander's solution compared to the concatenation?

– Évariste Galois
Nov 19 '18 at 14:53

1

I'd just use the csv module tbh. Every attempt I've made at growing a DF like this has been crippling in speed and memory. I have not found a hack to get around it. The odd join or concat, maybe, but the overhead is gross once you put it in a loop.

– roganjosh
Nov 19 '18 at 14:55

add a comment |

5

You are appending to a DataFrame within a loop, which needlessly copies data and is extremely inefficient (which is why is starts out fine, but then slows to a halt). Append to a list within the loop and concatenate once after.

– ALollz
Nov 19 '18 at 14:40

2

Alexander's Solution illustrates this.

– ALollz
Nov 19 '18 at 14:42

2

Another illustration from @unutbu with wise words: Never call DataFrame.append or pd.concat inside a for-loop. It leads to quadratic copying.

– Parfait
Nov 19 '18 at 14:48

Nice! I was not aware of this. If we're trying to be as efficient as possible, is there any significant difference in the performance of Alexander's solution compared to the concatenation?

– Évariste Galois
Nov 19 '18 at 14:53

1

I'd just use the csv module tbh. Every attempt I've made at growing a DF like this has been crippling in speed and memory. I have not found a hack to get around it. The odd join or concat, maybe, but the overhead is gross once you put it in a loop.

– roganjosh
Nov 19 '18 at 14:55

You are appending to a DataFrame within a loop, which needlessly copies data and is extremely inefficient (which is why is starts out fine, but then slows to a halt). Append to a list within the loop and concatenate once after.

– ALollz
Nov 19 '18 at 14:40

Alexander's Solution illustrates this.

– ALollz
Nov 19 '18 at 14:42

Another illustration from @unutbu with wise words: Never call DataFrame.append or pd.concat inside a for-loop. It leads to quadratic copying.

– Parfait
Nov 19 '18 at 14:48

Nice! I was not aware of this. If we're trying to be as efficient as possible, is there any significant difference in the performance of Alexander's solution compared to the concatenation?

– Évariste Galois
Nov 19 '18 at 14:53

I'd just use the csv module tbh. Every attempt I've made at growing a DF like this has been crippling in speed and memory. I have not found a hack to get around it. The odd join or concat, maybe, but the overhead is gross once you put it in a loop.

– roganjosh
Nov 19 '18 at 14:55

add a comment |

1 Answer
1

active

oldest

votes

consider separating your code into separate functions like so:

def get_data_from_csv(filepath):

    optionNameCSI = filepath.split("\")[-1].split('.')[0]

    try:

        df = pd.read_csv(filepath, engine='c')

        # do stuff ...

        return df

    except (IndexError, KeyError) as e:

        return

then you can use a list comprehension to gather all the data in a list like people above have suggested

filepaths = glob.iglob(r'W:data{0}option******.csv'.format(188), recursive=True)

result = [get_data_from_csv(filepath) for filepath in filepaths]

result = [r for r in result if r is not None] # remove 'None' values

then join the data using pd.concat

df = pd.concat(result)

edited Nov 19 '18 at 15:43

answered Nov 19 '18 at 15:38

Robert

33429

add a comment |

Your Answer

StackExchange.ifUsing("editor", function () {
StackExchange.using("externalEditor", function () {
StackExchange.using("snippets", function () {
StackExchange.snippets.init();
});
});
}, "code-snippets");

StackExchange.ready(function() {
var channelOptions = {
tags: "".split(" "),
id: "1"
};
initTagRenderer("".split(" "), "".split(" "), channelOptions);

StackExchange.using("externalEditor", function() {
// Have to fire editor after snippets, if snippets enabled
if (StackExchange.settings.snippets.snippetsEnabled) {
StackExchange.using("snippets", function() {
createEditor();
});
}
else {
createEditor();
}
});

function createEditor() {
StackExchange.prepareEditor({
heartbeatType: 'answer',
autoActivateHeartbeat: false,
convertImagesToLinks: true,
noModals: true,
showLowRepImageUploadWarning: true,
reputationToPostImages: 10,
bindNavPrevention: true,
postfix: "",
imageUploader: {
brandingHtml: "Powered by u003ca class="icon-imgur-white" href="https://imgur.com/"u003eu003c/au003e",
contentPolicyHtml: "User contributions licensed under u003ca href="https://creativecommons.org/licenses/by-sa/3.0/"u003ecc by-sa 3.0 with attribution requiredu003c/au003e u003ca href="https://stackoverflow.com/legal/content-policy"u003e(content policy)u003c/au003e",
allowUrls: true
},
onDemand: true,
discardSelector: ".discard-answer"
,immediatelyShowMarkdownHelp:true
});

}
});

draft saved

draft discarded

Sign up or log in

StackExchange.ready(function () {
StackExchange.helpers.onClickDraftSave('#login-link');
});

Post as a guest

Name

Required, but never shown

StackExchange.ready(
function () {
StackExchange.openid.initPostLogin('.new-post-login', 'https%3a%2f%2fstackoverflow.com%2fquestions%2f53376912%2fefficiently-growing-a-large-dataframe-vertically%23new-answer', 'question_page');
}
);

Post as a guest

Name

Required, but never shown

1 Answer
1

active

oldest

votes

1 Answer
1

active

oldest

votes

consider separating your code into separate functions like so:

def get_data_from_csv(filepath):

    optionNameCSI = filepath.split("\")[-1].split('.')[0]

    try:

        df = pd.read_csv(filepath, engine='c')

        # do stuff ...

        return df

    except (IndexError, KeyError) as e:

        return

then you can use a list comprehension to gather all the data in a list like people above have suggested

filepaths = glob.iglob(r'W:data{0}option******.csv'.format(188), recursive=True)

result = [get_data_from_csv(filepath) for filepath in filepaths]

result = [r for r in result if r is not None] # remove 'None' values

then join the data using pd.concat

df = pd.concat(result)

edited Nov 19 '18 at 15:43

answered Nov 19 '18 at 15:38

Robert

33429

add a comment |

consider separating your code into separate functions like so:

def get_data_from_csv(filepath):

    optionNameCSI = filepath.split("\")[-1].split('.')[0]

    try:

        df = pd.read_csv(filepath, engine='c')

        # do stuff ...

        return df

    except (IndexError, KeyError) as e:

        return

then you can use a list comprehension to gather all the data in a list like people above have suggested

filepaths = glob.iglob(r'W:data{0}option******.csv'.format(188), recursive=True)

result = [get_data_from_csv(filepath) for filepath in filepaths]

result = [r for r in result if r is not None] # remove 'None' values

then join the data using pd.concat

df = pd.concat(result)

edited Nov 19 '18 at 15:43

answered Nov 19 '18 at 15:38

Robert

33429

add a comment |

consider separating your code into separate functions like so:

def get_data_from_csv(filepath):

    optionNameCSI = filepath.split("\")[-1].split('.')[0]

    try:

        df = pd.read_csv(filepath, engine='c')

        # do stuff ...

        return df

    except (IndexError, KeyError) as e:

        return

then you can use a list comprehension to gather all the data in a list like people above have suggested

filepaths = glob.iglob(r'W:data{0}option******.csv'.format(188), recursive=True)

result = [get_data_from_csv(filepath) for filepath in filepaths]

result = [r for r in result if r is not None] # remove 'None' values

then join the data using pd.concat

df = pd.concat(result)

edited Nov 19 '18 at 15:43

answered Nov 19 '18 at 15:38

Robert

33429

consider separating your code into separate functions like so:

def get_data_from_csv(filepath):

    optionNameCSI = filepath.split("\")[-1].split('.')[0]

    try:

        df = pd.read_csv(filepath, engine='c')

        # do stuff ...

        return df

    except (IndexError, KeyError) as e:

        return

then you can use a list comprehension to gather all the data in a list like people above have suggested

filepaths = glob.iglob(r'W:data{0}option******.csv'.format(188), recursive=True)

result = [get_data_from_csv(filepath) for filepath in filepaths]

result = [r for r in result if r is not None] # remove 'None' values

then join the data using pd.concat

df = pd.concat(result)

edited Nov 19 '18 at 15:43

answered Nov 19 '18 at 15:38

Robert

33429

edited Nov 19 '18 at 15:43

answered Nov 19 '18 at 15:38

Robert

33429

answered Nov 19 '18 at 15:38

Robert

33429

answered Nov 19 '18 at 15:38

Robert

33429

add a comment |

draft saved

draft discarded

Thanks for contributing an answer to Stack Overflow!

Please be sure to answer the question. Provide details and share your research!

But avoid …

Asking for help, clarification, or responding to other answers.

Making statements based on opinion; back them up with references or personal experience.

To learn more, see our tips on writing great answers.

draft saved

draft discarded

Sign up or log in

StackExchange.ready(function () {
StackExchange.helpers.onClickDraftSave('#login-link');
});

Post as a guest

Name

Required, but never shown

Post as a guest

Name

Required, but never shown

Sign up or log in

StackExchange.ready(function () {
StackExchange.helpers.onClickDraftSave('#login-link');
});

Post as a guest

Name

Required, but never shown

Sign up or log in

StackExchange.ready(function () {
StackExchange.helpers.onClickDraftSave('#login-link');
});

Post as a guest

Name

Required, but never shown

Sign up or log in

StackExchange.ready(function () {
StackExchange.helpers.onClickDraftSave('#login-link');
});

Post as a guest

Name

Required, but never shown

Name

Required, but never shown

Name

Required, but never shown

This page is only for reference, If you need detailed information, please check here

搜尋此網誌

Cfrgtkky