efficiently growing a large dataframe vertically
I have the following code which recursively iterates over a directory containing thousands of csv's, and attempts to read and add them all to one DataFrame:
df = pd.DataFrame()
symbol = symbol.upper()
for filepath in glob.iglob(r'W:data{0}option******.csv'.format(188), recursive=True):
optionNameCSI = filepath.split("\")[-1].split('.')[0]
try:
tmp = pd.read_csv(filepath, engine='c')
strike = tmp['Strike'].iloc[-1]
expiry = pd.to_datetime(tmp['Option Expiration Date'].iloc[-1])
m = expiry.month
y = expiry.year
PutCall = tmp['PutCall'].iloc[-1]
future = symbol + numToLetter[m] + str(y)
except (IndexError, KeyError) as e:
continue
if tmp.empty:
df = tmp
else:
df = df.append(tmp)
print(optionName, 'loaded')
However, this code starts off iterating very quickly, then slows down exponentially and never completes. Is there something I'm doing wrong? I know that the file paths are all acquired correctly, so it's the growing of the DataFrame that is the issue.
python pandas
add a comment |
I have the following code which recursively iterates over a directory containing thousands of csv's, and attempts to read and add them all to one DataFrame:
df = pd.DataFrame()
symbol = symbol.upper()
for filepath in glob.iglob(r'W:data{0}option******.csv'.format(188), recursive=True):
optionNameCSI = filepath.split("\")[-1].split('.')[0]
try:
tmp = pd.read_csv(filepath, engine='c')
strike = tmp['Strike'].iloc[-1]
expiry = pd.to_datetime(tmp['Option Expiration Date'].iloc[-1])
m = expiry.month
y = expiry.year
PutCall = tmp['PutCall'].iloc[-1]
future = symbol + numToLetter[m] + str(y)
except (IndexError, KeyError) as e:
continue
if tmp.empty:
df = tmp
else:
df = df.append(tmp)
print(optionName, 'loaded')
However, this code starts off iterating very quickly, then slows down exponentially and never completes. Is there something I'm doing wrong? I know that the file paths are all acquired correctly, so it's the growing of the DataFrame that is the issue.
python pandas
5
You are appending to a DataFrame within a loop, which needlessly copies data and is extremely inefficient (which is why is starts out fine, but then slows to a halt). Append to a list within the loop and concatenate once after.
– ALollz
Nov 19 '18 at 14:40
2
Alexander's Solution illustrates this.
– ALollz
Nov 19 '18 at 14:42
2
Another illustration from @unutbu with wise words: Never call DataFrame.append or pd.concat inside a for-loop. It leads to quadratic copying.
– Parfait
Nov 19 '18 at 14:48
Nice! I was not aware of this. If we're trying to be as efficient as possible, is there any significant difference in the performance of Alexander's solution compared to the concatenation?
– Évariste Galois
Nov 19 '18 at 14:53
1
I'd just use thecsv
module tbh. Every attempt I've made at growing a DF like this has been crippling in speed and memory. I have not found a hack to get around it. The odd join or concat, maybe, but the overhead is gross once you put it in a loop.
– roganjosh
Nov 19 '18 at 14:55
add a comment |
I have the following code which recursively iterates over a directory containing thousands of csv's, and attempts to read and add them all to one DataFrame:
df = pd.DataFrame()
symbol = symbol.upper()
for filepath in glob.iglob(r'W:data{0}option******.csv'.format(188), recursive=True):
optionNameCSI = filepath.split("\")[-1].split('.')[0]
try:
tmp = pd.read_csv(filepath, engine='c')
strike = tmp['Strike'].iloc[-1]
expiry = pd.to_datetime(tmp['Option Expiration Date'].iloc[-1])
m = expiry.month
y = expiry.year
PutCall = tmp['PutCall'].iloc[-1]
future = symbol + numToLetter[m] + str(y)
except (IndexError, KeyError) as e:
continue
if tmp.empty:
df = tmp
else:
df = df.append(tmp)
print(optionName, 'loaded')
However, this code starts off iterating very quickly, then slows down exponentially and never completes. Is there something I'm doing wrong? I know that the file paths are all acquired correctly, so it's the growing of the DataFrame that is the issue.
python pandas
I have the following code which recursively iterates over a directory containing thousands of csv's, and attempts to read and add them all to one DataFrame:
df = pd.DataFrame()
symbol = symbol.upper()
for filepath in glob.iglob(r'W:data{0}option******.csv'.format(188), recursive=True):
optionNameCSI = filepath.split("\")[-1].split('.')[0]
try:
tmp = pd.read_csv(filepath, engine='c')
strike = tmp['Strike'].iloc[-1]
expiry = pd.to_datetime(tmp['Option Expiration Date'].iloc[-1])
m = expiry.month
y = expiry.year
PutCall = tmp['PutCall'].iloc[-1]
future = symbol + numToLetter[m] + str(y)
except (IndexError, KeyError) as e:
continue
if tmp.empty:
df = tmp
else:
df = df.append(tmp)
print(optionName, 'loaded')
However, this code starts off iterating very quickly, then slows down exponentially and never completes. Is there something I'm doing wrong? I know that the file paths are all acquired correctly, so it's the growing of the DataFrame that is the issue.
python pandas
python pandas
asked Nov 19 '18 at 14:37
Évariste GaloisÉvariste Galois
34513
34513
5
You are appending to a DataFrame within a loop, which needlessly copies data and is extremely inefficient (which is why is starts out fine, but then slows to a halt). Append to a list within the loop and concatenate once after.
– ALollz
Nov 19 '18 at 14:40
2
Alexander's Solution illustrates this.
– ALollz
Nov 19 '18 at 14:42
2
Another illustration from @unutbu with wise words: Never call DataFrame.append or pd.concat inside a for-loop. It leads to quadratic copying.
– Parfait
Nov 19 '18 at 14:48
Nice! I was not aware of this. If we're trying to be as efficient as possible, is there any significant difference in the performance of Alexander's solution compared to the concatenation?
– Évariste Galois
Nov 19 '18 at 14:53
1
I'd just use thecsv
module tbh. Every attempt I've made at growing a DF like this has been crippling in speed and memory. I have not found a hack to get around it. The odd join or concat, maybe, but the overhead is gross once you put it in a loop.
– roganjosh
Nov 19 '18 at 14:55
add a comment |
5
You are appending to a DataFrame within a loop, which needlessly copies data and is extremely inefficient (which is why is starts out fine, but then slows to a halt). Append to a list within the loop and concatenate once after.
– ALollz
Nov 19 '18 at 14:40
2
Alexander's Solution illustrates this.
– ALollz
Nov 19 '18 at 14:42
2
Another illustration from @unutbu with wise words: Never call DataFrame.append or pd.concat inside a for-loop. It leads to quadratic copying.
– Parfait
Nov 19 '18 at 14:48
Nice! I was not aware of this. If we're trying to be as efficient as possible, is there any significant difference in the performance of Alexander's solution compared to the concatenation?
– Évariste Galois
Nov 19 '18 at 14:53
1
I'd just use thecsv
module tbh. Every attempt I've made at growing a DF like this has been crippling in speed and memory. I have not found a hack to get around it. The odd join or concat, maybe, but the overhead is gross once you put it in a loop.
– roganjosh
Nov 19 '18 at 14:55
5
5
You are appending to a DataFrame within a loop, which needlessly copies data and is extremely inefficient (which is why is starts out fine, but then slows to a halt). Append to a list within the loop and concatenate once after.
– ALollz
Nov 19 '18 at 14:40
You are appending to a DataFrame within a loop, which needlessly copies data and is extremely inefficient (which is why is starts out fine, but then slows to a halt). Append to a list within the loop and concatenate once after.
– ALollz
Nov 19 '18 at 14:40
2
2
Alexander's Solution illustrates this.
– ALollz
Nov 19 '18 at 14:42
Alexander's Solution illustrates this.
– ALollz
Nov 19 '18 at 14:42
2
2
Another illustration from @unutbu with wise words: Never call DataFrame.append or pd.concat inside a for-loop. It leads to quadratic copying.
– Parfait
Nov 19 '18 at 14:48
Another illustration from @unutbu with wise words: Never call DataFrame.append or pd.concat inside a for-loop. It leads to quadratic copying.
– Parfait
Nov 19 '18 at 14:48
Nice! I was not aware of this. If we're trying to be as efficient as possible, is there any significant difference in the performance of Alexander's solution compared to the concatenation?
– Évariste Galois
Nov 19 '18 at 14:53
Nice! I was not aware of this. If we're trying to be as efficient as possible, is there any significant difference in the performance of Alexander's solution compared to the concatenation?
– Évariste Galois
Nov 19 '18 at 14:53
1
1
I'd just use the
csv
module tbh. Every attempt I've made at growing a DF like this has been crippling in speed and memory. I have not found a hack to get around it. The odd join or concat, maybe, but the overhead is gross once you put it in a loop.– roganjosh
Nov 19 '18 at 14:55
I'd just use the
csv
module tbh. Every attempt I've made at growing a DF like this has been crippling in speed and memory. I have not found a hack to get around it. The odd join or concat, maybe, but the overhead is gross once you put it in a loop.– roganjosh
Nov 19 '18 at 14:55
add a comment |
1 Answer
1
active
oldest
votes
consider separating your code into separate functions like so:
def get_data_from_csv(filepath):
optionNameCSI = filepath.split("\")[-1].split('.')[0]
try:
df = pd.read_csv(filepath, engine='c')
# do stuff ...
return df
except (IndexError, KeyError) as e:
return
then you can use a list comprehension to gather all the data in a list like people above have suggested
filepaths = glob.iglob(r'W:data{0}option******.csv'.format(188), recursive=True)
result = [get_data_from_csv(filepath) for filepath in filepaths]
result = [r for r in result if r is not None] # remove 'None' values
then join the data using pd.concat
df = pd.concat(result)
add a comment |
Your Answer
StackExchange.ifUsing("editor", function () {
StackExchange.using("externalEditor", function () {
StackExchange.using("snippets", function () {
StackExchange.snippets.init();
});
});
}, "code-snippets");
StackExchange.ready(function() {
var channelOptions = {
tags: "".split(" "),
id: "1"
};
initTagRenderer("".split(" "), "".split(" "), channelOptions);
StackExchange.using("externalEditor", function() {
// Have to fire editor after snippets, if snippets enabled
if (StackExchange.settings.snippets.snippetsEnabled) {
StackExchange.using("snippets", function() {
createEditor();
});
}
else {
createEditor();
}
});
function createEditor() {
StackExchange.prepareEditor({
heartbeatType: 'answer',
autoActivateHeartbeat: false,
convertImagesToLinks: true,
noModals: true,
showLowRepImageUploadWarning: true,
reputationToPostImages: 10,
bindNavPrevention: true,
postfix: "",
imageUploader: {
brandingHtml: "Powered by u003ca class="icon-imgur-white" href="https://imgur.com/"u003eu003c/au003e",
contentPolicyHtml: "User contributions licensed under u003ca href="https://creativecommons.org/licenses/by-sa/3.0/"u003ecc by-sa 3.0 with attribution requiredu003c/au003e u003ca href="https://stackoverflow.com/legal/content-policy"u003e(content policy)u003c/au003e",
allowUrls: true
},
onDemand: true,
discardSelector: ".discard-answer"
,immediatelyShowMarkdownHelp:true
});
}
});
Sign up or log in
StackExchange.ready(function () {
StackExchange.helpers.onClickDraftSave('#login-link');
});
Sign up using Google
Sign up using Facebook
Sign up using Email and Password
Post as a guest
Required, but never shown
StackExchange.ready(
function () {
StackExchange.openid.initPostLogin('.new-post-login', 'https%3a%2f%2fstackoverflow.com%2fquestions%2f53376912%2fefficiently-growing-a-large-dataframe-vertically%23new-answer', 'question_page');
}
);
Post as a guest
Required, but never shown
1 Answer
1
active
oldest
votes
1 Answer
1
active
oldest
votes
active
oldest
votes
active
oldest
votes
consider separating your code into separate functions like so:
def get_data_from_csv(filepath):
optionNameCSI = filepath.split("\")[-1].split('.')[0]
try:
df = pd.read_csv(filepath, engine='c')
# do stuff ...
return df
except (IndexError, KeyError) as e:
return
then you can use a list comprehension to gather all the data in a list like people above have suggested
filepaths = glob.iglob(r'W:data{0}option******.csv'.format(188), recursive=True)
result = [get_data_from_csv(filepath) for filepath in filepaths]
result = [r for r in result if r is not None] # remove 'None' values
then join the data using pd.concat
df = pd.concat(result)
add a comment |
consider separating your code into separate functions like so:
def get_data_from_csv(filepath):
optionNameCSI = filepath.split("\")[-1].split('.')[0]
try:
df = pd.read_csv(filepath, engine='c')
# do stuff ...
return df
except (IndexError, KeyError) as e:
return
then you can use a list comprehension to gather all the data in a list like people above have suggested
filepaths = glob.iglob(r'W:data{0}option******.csv'.format(188), recursive=True)
result = [get_data_from_csv(filepath) for filepath in filepaths]
result = [r for r in result if r is not None] # remove 'None' values
then join the data using pd.concat
df = pd.concat(result)
add a comment |
consider separating your code into separate functions like so:
def get_data_from_csv(filepath):
optionNameCSI = filepath.split("\")[-1].split('.')[0]
try:
df = pd.read_csv(filepath, engine='c')
# do stuff ...
return df
except (IndexError, KeyError) as e:
return
then you can use a list comprehension to gather all the data in a list like people above have suggested
filepaths = glob.iglob(r'W:data{0}option******.csv'.format(188), recursive=True)
result = [get_data_from_csv(filepath) for filepath in filepaths]
result = [r for r in result if r is not None] # remove 'None' values
then join the data using pd.concat
df = pd.concat(result)
consider separating your code into separate functions like so:
def get_data_from_csv(filepath):
optionNameCSI = filepath.split("\")[-1].split('.')[0]
try:
df = pd.read_csv(filepath, engine='c')
# do stuff ...
return df
except (IndexError, KeyError) as e:
return
then you can use a list comprehension to gather all the data in a list like people above have suggested
filepaths = glob.iglob(r'W:data{0}option******.csv'.format(188), recursive=True)
result = [get_data_from_csv(filepath) for filepath in filepaths]
result = [r for r in result if r is not None] # remove 'None' values
then join the data using pd.concat
df = pd.concat(result)
edited Nov 19 '18 at 15:43
answered Nov 19 '18 at 15:38
RobertRobert
33429
33429
add a comment |
add a comment |
Thanks for contributing an answer to Stack Overflow!
- Please be sure to answer the question. Provide details and share your research!
But avoid …
- Asking for help, clarification, or responding to other answers.
- Making statements based on opinion; back them up with references or personal experience.
To learn more, see our tips on writing great answers.
Sign up or log in
StackExchange.ready(function () {
StackExchange.helpers.onClickDraftSave('#login-link');
});
Sign up using Google
Sign up using Facebook
Sign up using Email and Password
Post as a guest
Required, but never shown
StackExchange.ready(
function () {
StackExchange.openid.initPostLogin('.new-post-login', 'https%3a%2f%2fstackoverflow.com%2fquestions%2f53376912%2fefficiently-growing-a-large-dataframe-vertically%23new-answer', 'question_page');
}
);
Post as a guest
Required, but never shown
Sign up or log in
StackExchange.ready(function () {
StackExchange.helpers.onClickDraftSave('#login-link');
});
Sign up using Google
Sign up using Facebook
Sign up using Email and Password
Post as a guest
Required, but never shown
Sign up or log in
StackExchange.ready(function () {
StackExchange.helpers.onClickDraftSave('#login-link');
});
Sign up using Google
Sign up using Facebook
Sign up using Email and Password
Post as a guest
Required, but never shown
Sign up or log in
StackExchange.ready(function () {
StackExchange.helpers.onClickDraftSave('#login-link');
});
Sign up using Google
Sign up using Facebook
Sign up using Email and Password
Sign up using Google
Sign up using Facebook
Sign up using Email and Password
Post as a guest
Required, but never shown
Required, but never shown
Required, but never shown
Required, but never shown
Required, but never shown
Required, but never shown
Required, but never shown
Required, but never shown
Required, but never shown
5
You are appending to a DataFrame within a loop, which needlessly copies data and is extremely inefficient (which is why is starts out fine, but then slows to a halt). Append to a list within the loop and concatenate once after.
– ALollz
Nov 19 '18 at 14:40
2
Alexander's Solution illustrates this.
– ALollz
Nov 19 '18 at 14:42
2
Another illustration from @unutbu with wise words: Never call DataFrame.append or pd.concat inside a for-loop. It leads to quadratic copying.
– Parfait
Nov 19 '18 at 14:48
Nice! I was not aware of this. If we're trying to be as efficient as possible, is there any significant difference in the performance of Alexander's solution compared to the concatenation?
– Évariste Galois
Nov 19 '18 at 14:53
1
I'd just use the
csv
module tbh. Every attempt I've made at growing a DF like this has been crippling in speed and memory. I have not found a hack to get around it. The odd join or concat, maybe, but the overhead is gross once you put it in a loop.– roganjosh
Nov 19 '18 at 14:55