efficiently growing a large dataframe vertically












1















I have the following code which recursively iterates over a directory containing thousands of csv's, and attempts to read and add them all to one DataFrame:



df = pd.DataFrame()
symbol = symbol.upper()

for filepath in glob.iglob(r'W:data{0}option******.csv'.format(188), recursive=True):

optionNameCSI = filepath.split("\")[-1].split('.')[0]
try:
tmp = pd.read_csv(filepath, engine='c')
strike = tmp['Strike'].iloc[-1]
expiry = pd.to_datetime(tmp['Option Expiration Date'].iloc[-1])
m = expiry.month
y = expiry.year
PutCall = tmp['PutCall'].iloc[-1]
future = symbol + numToLetter[m] + str(y)
except (IndexError, KeyError) as e:
continue

if tmp.empty:
df = tmp
else:
df = df.append(tmp)

print(optionName, 'loaded')


However, this code starts off iterating very quickly, then slows down exponentially and never completes. Is there something I'm doing wrong? I know that the file paths are all acquired correctly, so it's the growing of the DataFrame that is the issue.










share|improve this question


















  • 5





    You are appending to a DataFrame within a loop, which needlessly copies data and is extremely inefficient (which is why is starts out fine, but then slows to a halt). Append to a list within the loop and concatenate once after.

    – ALollz
    Nov 19 '18 at 14:40






  • 2





    Alexander's Solution illustrates this.

    – ALollz
    Nov 19 '18 at 14:42






  • 2





    Another illustration from @unutbu with wise words: Never call DataFrame.append or pd.concat inside a for-loop. It leads to quadratic copying.

    – Parfait
    Nov 19 '18 at 14:48











  • Nice! I was not aware of this. If we're trying to be as efficient as possible, is there any significant difference in the performance of Alexander's solution compared to the concatenation?

    – Évariste Galois
    Nov 19 '18 at 14:53






  • 1





    I'd just use the csv module tbh. Every attempt I've made at growing a DF like this has been crippling in speed and memory. I have not found a hack to get around it. The odd join or concat, maybe, but the overhead is gross once you put it in a loop.

    – roganjosh
    Nov 19 '18 at 14:55


















1















I have the following code which recursively iterates over a directory containing thousands of csv's, and attempts to read and add them all to one DataFrame:



df = pd.DataFrame()
symbol = symbol.upper()

for filepath in glob.iglob(r'W:data{0}option******.csv'.format(188), recursive=True):

optionNameCSI = filepath.split("\")[-1].split('.')[0]
try:
tmp = pd.read_csv(filepath, engine='c')
strike = tmp['Strike'].iloc[-1]
expiry = pd.to_datetime(tmp['Option Expiration Date'].iloc[-1])
m = expiry.month
y = expiry.year
PutCall = tmp['PutCall'].iloc[-1]
future = symbol + numToLetter[m] + str(y)
except (IndexError, KeyError) as e:
continue

if tmp.empty:
df = tmp
else:
df = df.append(tmp)

print(optionName, 'loaded')


However, this code starts off iterating very quickly, then slows down exponentially and never completes. Is there something I'm doing wrong? I know that the file paths are all acquired correctly, so it's the growing of the DataFrame that is the issue.










share|improve this question


















  • 5





    You are appending to a DataFrame within a loop, which needlessly copies data and is extremely inefficient (which is why is starts out fine, but then slows to a halt). Append to a list within the loop and concatenate once after.

    – ALollz
    Nov 19 '18 at 14:40






  • 2





    Alexander's Solution illustrates this.

    – ALollz
    Nov 19 '18 at 14:42






  • 2





    Another illustration from @unutbu with wise words: Never call DataFrame.append or pd.concat inside a for-loop. It leads to quadratic copying.

    – Parfait
    Nov 19 '18 at 14:48











  • Nice! I was not aware of this. If we're trying to be as efficient as possible, is there any significant difference in the performance of Alexander's solution compared to the concatenation?

    – Évariste Galois
    Nov 19 '18 at 14:53






  • 1





    I'd just use the csv module tbh. Every attempt I've made at growing a DF like this has been crippling in speed and memory. I have not found a hack to get around it. The odd join or concat, maybe, but the overhead is gross once you put it in a loop.

    – roganjosh
    Nov 19 '18 at 14:55
















1












1








1








I have the following code which recursively iterates over a directory containing thousands of csv's, and attempts to read and add them all to one DataFrame:



df = pd.DataFrame()
symbol = symbol.upper()

for filepath in glob.iglob(r'W:data{0}option******.csv'.format(188), recursive=True):

optionNameCSI = filepath.split("\")[-1].split('.')[0]
try:
tmp = pd.read_csv(filepath, engine='c')
strike = tmp['Strike'].iloc[-1]
expiry = pd.to_datetime(tmp['Option Expiration Date'].iloc[-1])
m = expiry.month
y = expiry.year
PutCall = tmp['PutCall'].iloc[-1]
future = symbol + numToLetter[m] + str(y)
except (IndexError, KeyError) as e:
continue

if tmp.empty:
df = tmp
else:
df = df.append(tmp)

print(optionName, 'loaded')


However, this code starts off iterating very quickly, then slows down exponentially and never completes. Is there something I'm doing wrong? I know that the file paths are all acquired correctly, so it's the growing of the DataFrame that is the issue.










share|improve this question














I have the following code which recursively iterates over a directory containing thousands of csv's, and attempts to read and add them all to one DataFrame:



df = pd.DataFrame()
symbol = symbol.upper()

for filepath in glob.iglob(r'W:data{0}option******.csv'.format(188), recursive=True):

optionNameCSI = filepath.split("\")[-1].split('.')[0]
try:
tmp = pd.read_csv(filepath, engine='c')
strike = tmp['Strike'].iloc[-1]
expiry = pd.to_datetime(tmp['Option Expiration Date'].iloc[-1])
m = expiry.month
y = expiry.year
PutCall = tmp['PutCall'].iloc[-1]
future = symbol + numToLetter[m] + str(y)
except (IndexError, KeyError) as e:
continue

if tmp.empty:
df = tmp
else:
df = df.append(tmp)

print(optionName, 'loaded')


However, this code starts off iterating very quickly, then slows down exponentially and never completes. Is there something I'm doing wrong? I know that the file paths are all acquired correctly, so it's the growing of the DataFrame that is the issue.







python pandas






share|improve this question













share|improve this question











share|improve this question




share|improve this question










asked Nov 19 '18 at 14:37









Évariste GaloisÉvariste Galois

34513




34513








  • 5





    You are appending to a DataFrame within a loop, which needlessly copies data and is extremely inefficient (which is why is starts out fine, but then slows to a halt). Append to a list within the loop and concatenate once after.

    – ALollz
    Nov 19 '18 at 14:40






  • 2





    Alexander's Solution illustrates this.

    – ALollz
    Nov 19 '18 at 14:42






  • 2





    Another illustration from @unutbu with wise words: Never call DataFrame.append or pd.concat inside a for-loop. It leads to quadratic copying.

    – Parfait
    Nov 19 '18 at 14:48











  • Nice! I was not aware of this. If we're trying to be as efficient as possible, is there any significant difference in the performance of Alexander's solution compared to the concatenation?

    – Évariste Galois
    Nov 19 '18 at 14:53






  • 1





    I'd just use the csv module tbh. Every attempt I've made at growing a DF like this has been crippling in speed and memory. I have not found a hack to get around it. The odd join or concat, maybe, but the overhead is gross once you put it in a loop.

    – roganjosh
    Nov 19 '18 at 14:55
















  • 5





    You are appending to a DataFrame within a loop, which needlessly copies data and is extremely inefficient (which is why is starts out fine, but then slows to a halt). Append to a list within the loop and concatenate once after.

    – ALollz
    Nov 19 '18 at 14:40






  • 2





    Alexander's Solution illustrates this.

    – ALollz
    Nov 19 '18 at 14:42






  • 2





    Another illustration from @unutbu with wise words: Never call DataFrame.append or pd.concat inside a for-loop. It leads to quadratic copying.

    – Parfait
    Nov 19 '18 at 14:48











  • Nice! I was not aware of this. If we're trying to be as efficient as possible, is there any significant difference in the performance of Alexander's solution compared to the concatenation?

    – Évariste Galois
    Nov 19 '18 at 14:53






  • 1





    I'd just use the csv module tbh. Every attempt I've made at growing a DF like this has been crippling in speed and memory. I have not found a hack to get around it. The odd join or concat, maybe, but the overhead is gross once you put it in a loop.

    – roganjosh
    Nov 19 '18 at 14:55










5




5





You are appending to a DataFrame within a loop, which needlessly copies data and is extremely inefficient (which is why is starts out fine, but then slows to a halt). Append to a list within the loop and concatenate once after.

– ALollz
Nov 19 '18 at 14:40





You are appending to a DataFrame within a loop, which needlessly copies data and is extremely inefficient (which is why is starts out fine, but then slows to a halt). Append to a list within the loop and concatenate once after.

– ALollz
Nov 19 '18 at 14:40




2




2





Alexander's Solution illustrates this.

– ALollz
Nov 19 '18 at 14:42





Alexander's Solution illustrates this.

– ALollz
Nov 19 '18 at 14:42




2




2





Another illustration from @unutbu with wise words: Never call DataFrame.append or pd.concat inside a for-loop. It leads to quadratic copying.

– Parfait
Nov 19 '18 at 14:48





Another illustration from @unutbu with wise words: Never call DataFrame.append or pd.concat inside a for-loop. It leads to quadratic copying.

– Parfait
Nov 19 '18 at 14:48













Nice! I was not aware of this. If we're trying to be as efficient as possible, is there any significant difference in the performance of Alexander's solution compared to the concatenation?

– Évariste Galois
Nov 19 '18 at 14:53





Nice! I was not aware of this. If we're trying to be as efficient as possible, is there any significant difference in the performance of Alexander's solution compared to the concatenation?

– Évariste Galois
Nov 19 '18 at 14:53




1




1





I'd just use the csv module tbh. Every attempt I've made at growing a DF like this has been crippling in speed and memory. I have not found a hack to get around it. The odd join or concat, maybe, but the overhead is gross once you put it in a loop.

– roganjosh
Nov 19 '18 at 14:55







I'd just use the csv module tbh. Every attempt I've made at growing a DF like this has been crippling in speed and memory. I have not found a hack to get around it. The odd join or concat, maybe, but the overhead is gross once you put it in a loop.

– roganjosh
Nov 19 '18 at 14:55














1 Answer
1






active

oldest

votes


















3














consider separating your code into separate functions like so:



def get_data_from_csv(filepath):
optionNameCSI = filepath.split("\")[-1].split('.')[0]
try:
df = pd.read_csv(filepath, engine='c')
# do stuff ...
return df
except (IndexError, KeyError) as e:
return


then you can use a list comprehension to gather all the data in a list like people above have suggested



filepaths = glob.iglob(r'W:data{0}option******.csv'.format(188), recursive=True)
result = [get_data_from_csv(filepath) for filepath in filepaths]
result = [r for r in result if r is not None] # remove 'None' values


then join the data using pd.concat



df = pd.concat(result)





share|improve this answer

























    Your Answer






    StackExchange.ifUsing("editor", function () {
    StackExchange.using("externalEditor", function () {
    StackExchange.using("snippets", function () {
    StackExchange.snippets.init();
    });
    });
    }, "code-snippets");

    StackExchange.ready(function() {
    var channelOptions = {
    tags: "".split(" "),
    id: "1"
    };
    initTagRenderer("".split(" "), "".split(" "), channelOptions);

    StackExchange.using("externalEditor", function() {
    // Have to fire editor after snippets, if snippets enabled
    if (StackExchange.settings.snippets.snippetsEnabled) {
    StackExchange.using("snippets", function() {
    createEditor();
    });
    }
    else {
    createEditor();
    }
    });

    function createEditor() {
    StackExchange.prepareEditor({
    heartbeatType: 'answer',
    autoActivateHeartbeat: false,
    convertImagesToLinks: true,
    noModals: true,
    showLowRepImageUploadWarning: true,
    reputationToPostImages: 10,
    bindNavPrevention: true,
    postfix: "",
    imageUploader: {
    brandingHtml: "Powered by u003ca class="icon-imgur-white" href="https://imgur.com/"u003eu003c/au003e",
    contentPolicyHtml: "User contributions licensed under u003ca href="https://creativecommons.org/licenses/by-sa/3.0/"u003ecc by-sa 3.0 with attribution requiredu003c/au003e u003ca href="https://stackoverflow.com/legal/content-policy"u003e(content policy)u003c/au003e",
    allowUrls: true
    },
    onDemand: true,
    discardSelector: ".discard-answer"
    ,immediatelyShowMarkdownHelp:true
    });


    }
    });














    draft saved

    draft discarded


















    StackExchange.ready(
    function () {
    StackExchange.openid.initPostLogin('.new-post-login', 'https%3a%2f%2fstackoverflow.com%2fquestions%2f53376912%2fefficiently-growing-a-large-dataframe-vertically%23new-answer', 'question_page');
    }
    );

    Post as a guest















    Required, but never shown

























    1 Answer
    1






    active

    oldest

    votes








    1 Answer
    1






    active

    oldest

    votes









    active

    oldest

    votes






    active

    oldest

    votes









    3














    consider separating your code into separate functions like so:



    def get_data_from_csv(filepath):
    optionNameCSI = filepath.split("\")[-1].split('.')[0]
    try:
    df = pd.read_csv(filepath, engine='c')
    # do stuff ...
    return df
    except (IndexError, KeyError) as e:
    return


    then you can use a list comprehension to gather all the data in a list like people above have suggested



    filepaths = glob.iglob(r'W:data{0}option******.csv'.format(188), recursive=True)
    result = [get_data_from_csv(filepath) for filepath in filepaths]
    result = [r for r in result if r is not None] # remove 'None' values


    then join the data using pd.concat



    df = pd.concat(result)





    share|improve this answer






























      3














      consider separating your code into separate functions like so:



      def get_data_from_csv(filepath):
      optionNameCSI = filepath.split("\")[-1].split('.')[0]
      try:
      df = pd.read_csv(filepath, engine='c')
      # do stuff ...
      return df
      except (IndexError, KeyError) as e:
      return


      then you can use a list comprehension to gather all the data in a list like people above have suggested



      filepaths = glob.iglob(r'W:data{0}option******.csv'.format(188), recursive=True)
      result = [get_data_from_csv(filepath) for filepath in filepaths]
      result = [r for r in result if r is not None] # remove 'None' values


      then join the data using pd.concat



      df = pd.concat(result)





      share|improve this answer




























        3












        3








        3







        consider separating your code into separate functions like so:



        def get_data_from_csv(filepath):
        optionNameCSI = filepath.split("\")[-1].split('.')[0]
        try:
        df = pd.read_csv(filepath, engine='c')
        # do stuff ...
        return df
        except (IndexError, KeyError) as e:
        return


        then you can use a list comprehension to gather all the data in a list like people above have suggested



        filepaths = glob.iglob(r'W:data{0}option******.csv'.format(188), recursive=True)
        result = [get_data_from_csv(filepath) for filepath in filepaths]
        result = [r for r in result if r is not None] # remove 'None' values


        then join the data using pd.concat



        df = pd.concat(result)





        share|improve this answer















        consider separating your code into separate functions like so:



        def get_data_from_csv(filepath):
        optionNameCSI = filepath.split("\")[-1].split('.')[0]
        try:
        df = pd.read_csv(filepath, engine='c')
        # do stuff ...
        return df
        except (IndexError, KeyError) as e:
        return


        then you can use a list comprehension to gather all the data in a list like people above have suggested



        filepaths = glob.iglob(r'W:data{0}option******.csv'.format(188), recursive=True)
        result = [get_data_from_csv(filepath) for filepath in filepaths]
        result = [r for r in result if r is not None] # remove 'None' values


        then join the data using pd.concat



        df = pd.concat(result)






        share|improve this answer














        share|improve this answer



        share|improve this answer








        edited Nov 19 '18 at 15:43

























        answered Nov 19 '18 at 15:38









        RobertRobert

        33429




        33429






























            draft saved

            draft discarded




















































            Thanks for contributing an answer to Stack Overflow!


            • Please be sure to answer the question. Provide details and share your research!

            But avoid



            • Asking for help, clarification, or responding to other answers.

            • Making statements based on opinion; back them up with references or personal experience.


            To learn more, see our tips on writing great answers.




            draft saved


            draft discarded














            StackExchange.ready(
            function () {
            StackExchange.openid.initPostLogin('.new-post-login', 'https%3a%2f%2fstackoverflow.com%2fquestions%2f53376912%2fefficiently-growing-a-large-dataframe-vertically%23new-answer', 'question_page');
            }
            );

            Post as a guest















            Required, but never shown





















































            Required, but never shown














            Required, but never shown












            Required, but never shown







            Required, but never shown

































            Required, but never shown














            Required, but never shown












            Required, but never shown







            Required, but never shown







            Popular posts from this blog

            Biblatex bibliography style without URLs when DOI exists (in Overleaf with Zotero bibliography)

            ComboBox Display Member on multiple fields

            Is it possible to collect Nectar points via Trainline?