Parsing big XML files efficiently





.everyoneloves__top-leaderboard:empty,.everyoneloves__mid-leaderboard:empty,.everyoneloves__bot-mid-leaderboard:empty{ height:90px;width:728px;box-sizing:border-box;
}







3















I'm dealing with XML files that are 4GB+ in size and wondering how I can best parse them. Right now I run into memory issues and looking for a way to not load the whole file in memory and go through it in batches perhaps?



Current code is using lxml and iterating over the repeating elements. Namespaces are cleared up front:



from lxml import etree, objectify
import pandas as pd

file = 'some_huge_file.xml'
if file.lower().endswith('.xml'):

parser = etree.XMLParser(remove_blank_text=True)
tree = etree.parse(mvnFile, parser)
root = tree.getroot()

####
for elem in root.getiterator():
if not hasattr(elem.tag, 'find'): continue # (1)
i = elem.tag.find('}')
if i >= 0:
elem.tag = elem.tag[i + 1:]
objectify.deannotate(root, cleanup_namespaces=True)
####
data = [{
'Element1': tp.findtext('element1'),
'Element2': tp.findtext('element2'),
'Element3': tp.findtext('element3'),
}
for tp in tree.xpath('//mainelement/subelement')]

df = pd.DataFrame(data)
print(df)


Furthermore I need to do some splitting of the values of the elements as they are space-separated. However, I only need specific values so I'm wondering if I can do this somehow within the parsing instead of splitting the columns on space afterwards?



xml example:



<mainelement>
<subelement tc="00:00:00:000" ms="0">
<element1>0.861668 0.496888 0.000000 0.000000 0.867815</element1>
<element2>0.043423 0.509801 -0.111990 -0.070212 0.126711</element2>
<element3>-0.001501 0.008416 0.000098 0.005241 0.005301</element3>
</subelement>
<subelement tc="00:00:00:001" ms="1">
<element1>-0.503814 0.005664 -0.070326 -0.860926 -0.503720</element1>
<element2>-0.044658 0.046381 0.909291 -0.033390 0.049348</element2>
<element3>-0.000000 -0.000000 -0.000000 -0.005217 0.007849</element3>
</subelement>
<subelement tc="00:00:00:002" ms="2">
<element1> -0.861173 0.503578 -0.007163 0.056031 0.862101</element1>
<element2>0.371398 1.325794 -0.030966 0.059466 1.388910</element2>
<element3>-0.010139 0.001048 0.026847 -0.010139 0.001048</element3>
</subelement>
<subelement tc="00:00:00:003" ms="3">
<element1>0.856813 0.494664 0.003921 0.023356 0.868762</element1>
<element2>-0.030966 0.059466 1.388910 -0.152636 -0.008650</element2>
<element3>0.001048 0.026847 -0.010139 0.001048 0.035846</element3>
</subelement>
</mainelement>









share|improve this question























  • I don't know about Python XML handling, but in the Java world there are two types: SAX and DOM. DOM means "Document Object Model" and it means that the whole XML file gets loaded into memory, so that queries can be done afterwards. Once loaded, it is very quick, but it might consume huge amounts of memory. SAX however runs over your XML, and when a certain tag, attribute, content is reached, an event might be launched. This might take quite a while but it almost takes no memory. So, when you say "efficient", do you mean speed or memory related efficiency?

    – Dominique
    Nov 22 '18 at 12:29











  • speed at this point is not important I'd say. I just need to get the data into a database for now, and if I try to read all elements and all values I need, python will just crash or error (and MacOS starts force quitting applications ;) ).

    – Chrisvdberge
    Nov 22 '18 at 12:37






  • 1





    Did you check this?

    – Andersson
    Nov 22 '18 at 12:41






  • 1





    Possible duplicate of xml parsing in python for big data

    – stovfl
    Nov 22 '18 at 13:31











  • those links were quite helpful. The possible duplicate pointed in the right direction but didn't provided a clear concrete answer, so I added the code I came up with as answer to this question for clarity

    – Chrisvdberge
    Nov 23 '18 at 12:05


















3















I'm dealing with XML files that are 4GB+ in size and wondering how I can best parse them. Right now I run into memory issues and looking for a way to not load the whole file in memory and go through it in batches perhaps?



Current code is using lxml and iterating over the repeating elements. Namespaces are cleared up front:



from lxml import etree, objectify
import pandas as pd

file = 'some_huge_file.xml'
if file.lower().endswith('.xml'):

parser = etree.XMLParser(remove_blank_text=True)
tree = etree.parse(mvnFile, parser)
root = tree.getroot()

####
for elem in root.getiterator():
if not hasattr(elem.tag, 'find'): continue # (1)
i = elem.tag.find('}')
if i >= 0:
elem.tag = elem.tag[i + 1:]
objectify.deannotate(root, cleanup_namespaces=True)
####
data = [{
'Element1': tp.findtext('element1'),
'Element2': tp.findtext('element2'),
'Element3': tp.findtext('element3'),
}
for tp in tree.xpath('//mainelement/subelement')]

df = pd.DataFrame(data)
print(df)


Furthermore I need to do some splitting of the values of the elements as they are space-separated. However, I only need specific values so I'm wondering if I can do this somehow within the parsing instead of splitting the columns on space afterwards?



xml example:



<mainelement>
<subelement tc="00:00:00:000" ms="0">
<element1>0.861668 0.496888 0.000000 0.000000 0.867815</element1>
<element2>0.043423 0.509801 -0.111990 -0.070212 0.126711</element2>
<element3>-0.001501 0.008416 0.000098 0.005241 0.005301</element3>
</subelement>
<subelement tc="00:00:00:001" ms="1">
<element1>-0.503814 0.005664 -0.070326 -0.860926 -0.503720</element1>
<element2>-0.044658 0.046381 0.909291 -0.033390 0.049348</element2>
<element3>-0.000000 -0.000000 -0.000000 -0.005217 0.007849</element3>
</subelement>
<subelement tc="00:00:00:002" ms="2">
<element1> -0.861173 0.503578 -0.007163 0.056031 0.862101</element1>
<element2>0.371398 1.325794 -0.030966 0.059466 1.388910</element2>
<element3>-0.010139 0.001048 0.026847 -0.010139 0.001048</element3>
</subelement>
<subelement tc="00:00:00:003" ms="3">
<element1>0.856813 0.494664 0.003921 0.023356 0.868762</element1>
<element2>-0.030966 0.059466 1.388910 -0.152636 -0.008650</element2>
<element3>0.001048 0.026847 -0.010139 0.001048 0.035846</element3>
</subelement>
</mainelement>









share|improve this question























  • I don't know about Python XML handling, but in the Java world there are two types: SAX and DOM. DOM means "Document Object Model" and it means that the whole XML file gets loaded into memory, so that queries can be done afterwards. Once loaded, it is very quick, but it might consume huge amounts of memory. SAX however runs over your XML, and when a certain tag, attribute, content is reached, an event might be launched. This might take quite a while but it almost takes no memory. So, when you say "efficient", do you mean speed or memory related efficiency?

    – Dominique
    Nov 22 '18 at 12:29











  • speed at this point is not important I'd say. I just need to get the data into a database for now, and if I try to read all elements and all values I need, python will just crash or error (and MacOS starts force quitting applications ;) ).

    – Chrisvdberge
    Nov 22 '18 at 12:37






  • 1





    Did you check this?

    – Andersson
    Nov 22 '18 at 12:41






  • 1





    Possible duplicate of xml parsing in python for big data

    – stovfl
    Nov 22 '18 at 13:31











  • those links were quite helpful. The possible duplicate pointed in the right direction but didn't provided a clear concrete answer, so I added the code I came up with as answer to this question for clarity

    – Chrisvdberge
    Nov 23 '18 at 12:05














3












3








3








I'm dealing with XML files that are 4GB+ in size and wondering how I can best parse them. Right now I run into memory issues and looking for a way to not load the whole file in memory and go through it in batches perhaps?



Current code is using lxml and iterating over the repeating elements. Namespaces are cleared up front:



from lxml import etree, objectify
import pandas as pd

file = 'some_huge_file.xml'
if file.lower().endswith('.xml'):

parser = etree.XMLParser(remove_blank_text=True)
tree = etree.parse(mvnFile, parser)
root = tree.getroot()

####
for elem in root.getiterator():
if not hasattr(elem.tag, 'find'): continue # (1)
i = elem.tag.find('}')
if i >= 0:
elem.tag = elem.tag[i + 1:]
objectify.deannotate(root, cleanup_namespaces=True)
####
data = [{
'Element1': tp.findtext('element1'),
'Element2': tp.findtext('element2'),
'Element3': tp.findtext('element3'),
}
for tp in tree.xpath('//mainelement/subelement')]

df = pd.DataFrame(data)
print(df)


Furthermore I need to do some splitting of the values of the elements as they are space-separated. However, I only need specific values so I'm wondering if I can do this somehow within the parsing instead of splitting the columns on space afterwards?



xml example:



<mainelement>
<subelement tc="00:00:00:000" ms="0">
<element1>0.861668 0.496888 0.000000 0.000000 0.867815</element1>
<element2>0.043423 0.509801 -0.111990 -0.070212 0.126711</element2>
<element3>-0.001501 0.008416 0.000098 0.005241 0.005301</element3>
</subelement>
<subelement tc="00:00:00:001" ms="1">
<element1>-0.503814 0.005664 -0.070326 -0.860926 -0.503720</element1>
<element2>-0.044658 0.046381 0.909291 -0.033390 0.049348</element2>
<element3>-0.000000 -0.000000 -0.000000 -0.005217 0.007849</element3>
</subelement>
<subelement tc="00:00:00:002" ms="2">
<element1> -0.861173 0.503578 -0.007163 0.056031 0.862101</element1>
<element2>0.371398 1.325794 -0.030966 0.059466 1.388910</element2>
<element3>-0.010139 0.001048 0.026847 -0.010139 0.001048</element3>
</subelement>
<subelement tc="00:00:00:003" ms="3">
<element1>0.856813 0.494664 0.003921 0.023356 0.868762</element1>
<element2>-0.030966 0.059466 1.388910 -0.152636 -0.008650</element2>
<element3>0.001048 0.026847 -0.010139 0.001048 0.035846</element3>
</subelement>
</mainelement>









share|improve this question














I'm dealing with XML files that are 4GB+ in size and wondering how I can best parse them. Right now I run into memory issues and looking for a way to not load the whole file in memory and go through it in batches perhaps?



Current code is using lxml and iterating over the repeating elements. Namespaces are cleared up front:



from lxml import etree, objectify
import pandas as pd

file = 'some_huge_file.xml'
if file.lower().endswith('.xml'):

parser = etree.XMLParser(remove_blank_text=True)
tree = etree.parse(mvnFile, parser)
root = tree.getroot()

####
for elem in root.getiterator():
if not hasattr(elem.tag, 'find'): continue # (1)
i = elem.tag.find('}')
if i >= 0:
elem.tag = elem.tag[i + 1:]
objectify.deannotate(root, cleanup_namespaces=True)
####
data = [{
'Element1': tp.findtext('element1'),
'Element2': tp.findtext('element2'),
'Element3': tp.findtext('element3'),
}
for tp in tree.xpath('//mainelement/subelement')]

df = pd.DataFrame(data)
print(df)


Furthermore I need to do some splitting of the values of the elements as they are space-separated. However, I only need specific values so I'm wondering if I can do this somehow within the parsing instead of splitting the columns on space afterwards?



xml example:



<mainelement>
<subelement tc="00:00:00:000" ms="0">
<element1>0.861668 0.496888 0.000000 0.000000 0.867815</element1>
<element2>0.043423 0.509801 -0.111990 -0.070212 0.126711</element2>
<element3>-0.001501 0.008416 0.000098 0.005241 0.005301</element3>
</subelement>
<subelement tc="00:00:00:001" ms="1">
<element1>-0.503814 0.005664 -0.070326 -0.860926 -0.503720</element1>
<element2>-0.044658 0.046381 0.909291 -0.033390 0.049348</element2>
<element3>-0.000000 -0.000000 -0.000000 -0.005217 0.007849</element3>
</subelement>
<subelement tc="00:00:00:002" ms="2">
<element1> -0.861173 0.503578 -0.007163 0.056031 0.862101</element1>
<element2>0.371398 1.325794 -0.030966 0.059466 1.388910</element2>
<element3>-0.010139 0.001048 0.026847 -0.010139 0.001048</element3>
</subelement>
<subelement tc="00:00:00:003" ms="3">
<element1>0.856813 0.494664 0.003921 0.023356 0.868762</element1>
<element2>-0.030966 0.059466 1.388910 -0.152636 -0.008650</element2>
<element3>0.001048 0.026847 -0.010139 0.001048 0.035846</element3>
</subelement>
</mainelement>






python lxml






share|improve this question













share|improve this question











share|improve this question




share|improve this question










asked Nov 22 '18 at 12:13









ChrisvdbergeChrisvdberge

5752822




5752822













  • I don't know about Python XML handling, but in the Java world there are two types: SAX and DOM. DOM means "Document Object Model" and it means that the whole XML file gets loaded into memory, so that queries can be done afterwards. Once loaded, it is very quick, but it might consume huge amounts of memory. SAX however runs over your XML, and when a certain tag, attribute, content is reached, an event might be launched. This might take quite a while but it almost takes no memory. So, when you say "efficient", do you mean speed or memory related efficiency?

    – Dominique
    Nov 22 '18 at 12:29











  • speed at this point is not important I'd say. I just need to get the data into a database for now, and if I try to read all elements and all values I need, python will just crash or error (and MacOS starts force quitting applications ;) ).

    – Chrisvdberge
    Nov 22 '18 at 12:37






  • 1





    Did you check this?

    – Andersson
    Nov 22 '18 at 12:41






  • 1





    Possible duplicate of xml parsing in python for big data

    – stovfl
    Nov 22 '18 at 13:31











  • those links were quite helpful. The possible duplicate pointed in the right direction but didn't provided a clear concrete answer, so I added the code I came up with as answer to this question for clarity

    – Chrisvdberge
    Nov 23 '18 at 12:05



















  • I don't know about Python XML handling, but in the Java world there are two types: SAX and DOM. DOM means "Document Object Model" and it means that the whole XML file gets loaded into memory, so that queries can be done afterwards. Once loaded, it is very quick, but it might consume huge amounts of memory. SAX however runs over your XML, and when a certain tag, attribute, content is reached, an event might be launched. This might take quite a while but it almost takes no memory. So, when you say "efficient", do you mean speed or memory related efficiency?

    – Dominique
    Nov 22 '18 at 12:29











  • speed at this point is not important I'd say. I just need to get the data into a database for now, and if I try to read all elements and all values I need, python will just crash or error (and MacOS starts force quitting applications ;) ).

    – Chrisvdberge
    Nov 22 '18 at 12:37






  • 1





    Did you check this?

    – Andersson
    Nov 22 '18 at 12:41






  • 1





    Possible duplicate of xml parsing in python for big data

    – stovfl
    Nov 22 '18 at 13:31











  • those links were quite helpful. The possible duplicate pointed in the right direction but didn't provided a clear concrete answer, so I added the code I came up with as answer to this question for clarity

    – Chrisvdberge
    Nov 23 '18 at 12:05

















I don't know about Python XML handling, but in the Java world there are two types: SAX and DOM. DOM means "Document Object Model" and it means that the whole XML file gets loaded into memory, so that queries can be done afterwards. Once loaded, it is very quick, but it might consume huge amounts of memory. SAX however runs over your XML, and when a certain tag, attribute, content is reached, an event might be launched. This might take quite a while but it almost takes no memory. So, when you say "efficient", do you mean speed or memory related efficiency?

– Dominique
Nov 22 '18 at 12:29





I don't know about Python XML handling, but in the Java world there are two types: SAX and DOM. DOM means "Document Object Model" and it means that the whole XML file gets loaded into memory, so that queries can be done afterwards. Once loaded, it is very quick, but it might consume huge amounts of memory. SAX however runs over your XML, and when a certain tag, attribute, content is reached, an event might be launched. This might take quite a while but it almost takes no memory. So, when you say "efficient", do you mean speed or memory related efficiency?

– Dominique
Nov 22 '18 at 12:29













speed at this point is not important I'd say. I just need to get the data into a database for now, and if I try to read all elements and all values I need, python will just crash or error (and MacOS starts force quitting applications ;) ).

– Chrisvdberge
Nov 22 '18 at 12:37





speed at this point is not important I'd say. I just need to get the data into a database for now, and if I try to read all elements and all values I need, python will just crash or error (and MacOS starts force quitting applications ;) ).

– Chrisvdberge
Nov 22 '18 at 12:37




1




1





Did you check this?

– Andersson
Nov 22 '18 at 12:41





Did you check this?

– Andersson
Nov 22 '18 at 12:41




1




1





Possible duplicate of xml parsing in python for big data

– stovfl
Nov 22 '18 at 13:31





Possible duplicate of xml parsing in python for big data

– stovfl
Nov 22 '18 at 13:31













those links were quite helpful. The possible duplicate pointed in the right direction but didn't provided a clear concrete answer, so I added the code I came up with as answer to this question for clarity

– Chrisvdberge
Nov 23 '18 at 12:05





those links were quite helpful. The possible duplicate pointed in the right direction but didn't provided a clear concrete answer, so I added the code I came up with as answer to this question for clarity

– Chrisvdberge
Nov 23 '18 at 12:05












1 Answer
1






active

oldest

votes


















0














Based on the links you posted in the comments I came up with the following to iterate and split more efficiently that works fine:



from lxml import etree, objectify
import pandas as pd

file = 'some_huge_file.xml'

time =
data1_element1_x =
data1_element1_y =
data1_element2_x =
data1_element2_y =
data2_element1_x =
data2_element1_y =
data2_element2_x =
data2_element2_y =

if file.lower().endswith('.xml'):
for event, elem in etree.iterparse(file):
if elem.tag == "subelement":
time.append(elem.get('tc'))
for child in elem:
if child.tag == "element1":
split_data = child.text.split(" ")
data1_element1_x.append(float(split_data[0]))
data1_element1_y.append(float(split_data[1]))
data2_element1_x.append(float(split_data[2]))
data2_element1_y.append(float(split_data[3]))
elif child.tag == "element2":
split_data = child.text.split(" ")
data1_element2_x.append(float(split_data[0]))
data1_element2_y.append(float(split_data[1]))
data2_element2_x.append(float(split_data[2]))
data2_element2_y.append(float(split_data[3]))
elem.clear()
df = pd.DataFrame({
'Time':time,
'Data1_element1_x': data1_element1_x,
'Data1_element1_y': data1_element1_y,
'Data1_element2_x': data1_element2_x,
'Data1_element2_y': data1_element2_y,
'Data2_element1_x': data2_element1_x,
'Data2_element1_y': data2_element1_y,
'Data2_element2_x': data2_element2_x,
'Data2_element2_y': data2_element2_y
})

print(df)





share|improve this answer
























    Your Answer






    StackExchange.ifUsing("editor", function () {
    StackExchange.using("externalEditor", function () {
    StackExchange.using("snippets", function () {
    StackExchange.snippets.init();
    });
    });
    }, "code-snippets");

    StackExchange.ready(function() {
    var channelOptions = {
    tags: "".split(" "),
    id: "1"
    };
    initTagRenderer("".split(" "), "".split(" "), channelOptions);

    StackExchange.using("externalEditor", function() {
    // Have to fire editor after snippets, if snippets enabled
    if (StackExchange.settings.snippets.snippetsEnabled) {
    StackExchange.using("snippets", function() {
    createEditor();
    });
    }
    else {
    createEditor();
    }
    });

    function createEditor() {
    StackExchange.prepareEditor({
    heartbeatType: 'answer',
    autoActivateHeartbeat: false,
    convertImagesToLinks: true,
    noModals: true,
    showLowRepImageUploadWarning: true,
    reputationToPostImages: 10,
    bindNavPrevention: true,
    postfix: "",
    imageUploader: {
    brandingHtml: "Powered by u003ca class="icon-imgur-white" href="https://imgur.com/"u003eu003c/au003e",
    contentPolicyHtml: "User contributions licensed under u003ca href="https://creativecommons.org/licenses/by-sa/3.0/"u003ecc by-sa 3.0 with attribution requiredu003c/au003e u003ca href="https://stackoverflow.com/legal/content-policy"u003e(content policy)u003c/au003e",
    allowUrls: true
    },
    onDemand: true,
    discardSelector: ".discard-answer"
    ,immediatelyShowMarkdownHelp:true
    });


    }
    });














    draft saved

    draft discarded


















    StackExchange.ready(
    function () {
    StackExchange.openid.initPostLogin('.new-post-login', 'https%3a%2f%2fstackoverflow.com%2fquestions%2f53430768%2fparsing-big-xml-files-efficiently%23new-answer', 'question_page');
    }
    );

    Post as a guest















    Required, but never shown

























    1 Answer
    1






    active

    oldest

    votes








    1 Answer
    1






    active

    oldest

    votes









    active

    oldest

    votes






    active

    oldest

    votes









    0














    Based on the links you posted in the comments I came up with the following to iterate and split more efficiently that works fine:



    from lxml import etree, objectify
    import pandas as pd

    file = 'some_huge_file.xml'

    time =
    data1_element1_x =
    data1_element1_y =
    data1_element2_x =
    data1_element2_y =
    data2_element1_x =
    data2_element1_y =
    data2_element2_x =
    data2_element2_y =

    if file.lower().endswith('.xml'):
    for event, elem in etree.iterparse(file):
    if elem.tag == "subelement":
    time.append(elem.get('tc'))
    for child in elem:
    if child.tag == "element1":
    split_data = child.text.split(" ")
    data1_element1_x.append(float(split_data[0]))
    data1_element1_y.append(float(split_data[1]))
    data2_element1_x.append(float(split_data[2]))
    data2_element1_y.append(float(split_data[3]))
    elif child.tag == "element2":
    split_data = child.text.split(" ")
    data1_element2_x.append(float(split_data[0]))
    data1_element2_y.append(float(split_data[1]))
    data2_element2_x.append(float(split_data[2]))
    data2_element2_y.append(float(split_data[3]))
    elem.clear()
    df = pd.DataFrame({
    'Time':time,
    'Data1_element1_x': data1_element1_x,
    'Data1_element1_y': data1_element1_y,
    'Data1_element2_x': data1_element2_x,
    'Data1_element2_y': data1_element2_y,
    'Data2_element1_x': data2_element1_x,
    'Data2_element1_y': data2_element1_y,
    'Data2_element2_x': data2_element2_x,
    'Data2_element2_y': data2_element2_y
    })

    print(df)





    share|improve this answer




























      0














      Based on the links you posted in the comments I came up with the following to iterate and split more efficiently that works fine:



      from lxml import etree, objectify
      import pandas as pd

      file = 'some_huge_file.xml'

      time =
      data1_element1_x =
      data1_element1_y =
      data1_element2_x =
      data1_element2_y =
      data2_element1_x =
      data2_element1_y =
      data2_element2_x =
      data2_element2_y =

      if file.lower().endswith('.xml'):
      for event, elem in etree.iterparse(file):
      if elem.tag == "subelement":
      time.append(elem.get('tc'))
      for child in elem:
      if child.tag == "element1":
      split_data = child.text.split(" ")
      data1_element1_x.append(float(split_data[0]))
      data1_element1_y.append(float(split_data[1]))
      data2_element1_x.append(float(split_data[2]))
      data2_element1_y.append(float(split_data[3]))
      elif child.tag == "element2":
      split_data = child.text.split(" ")
      data1_element2_x.append(float(split_data[0]))
      data1_element2_y.append(float(split_data[1]))
      data2_element2_x.append(float(split_data[2]))
      data2_element2_y.append(float(split_data[3]))
      elem.clear()
      df = pd.DataFrame({
      'Time':time,
      'Data1_element1_x': data1_element1_x,
      'Data1_element1_y': data1_element1_y,
      'Data1_element2_x': data1_element2_x,
      'Data1_element2_y': data1_element2_y,
      'Data2_element1_x': data2_element1_x,
      'Data2_element1_y': data2_element1_y,
      'Data2_element2_x': data2_element2_x,
      'Data2_element2_y': data2_element2_y
      })

      print(df)





      share|improve this answer


























        0












        0








        0







        Based on the links you posted in the comments I came up with the following to iterate and split more efficiently that works fine:



        from lxml import etree, objectify
        import pandas as pd

        file = 'some_huge_file.xml'

        time =
        data1_element1_x =
        data1_element1_y =
        data1_element2_x =
        data1_element2_y =
        data2_element1_x =
        data2_element1_y =
        data2_element2_x =
        data2_element2_y =

        if file.lower().endswith('.xml'):
        for event, elem in etree.iterparse(file):
        if elem.tag == "subelement":
        time.append(elem.get('tc'))
        for child in elem:
        if child.tag == "element1":
        split_data = child.text.split(" ")
        data1_element1_x.append(float(split_data[0]))
        data1_element1_y.append(float(split_data[1]))
        data2_element1_x.append(float(split_data[2]))
        data2_element1_y.append(float(split_data[3]))
        elif child.tag == "element2":
        split_data = child.text.split(" ")
        data1_element2_x.append(float(split_data[0]))
        data1_element2_y.append(float(split_data[1]))
        data2_element2_x.append(float(split_data[2]))
        data2_element2_y.append(float(split_data[3]))
        elem.clear()
        df = pd.DataFrame({
        'Time':time,
        'Data1_element1_x': data1_element1_x,
        'Data1_element1_y': data1_element1_y,
        'Data1_element2_x': data1_element2_x,
        'Data1_element2_y': data1_element2_y,
        'Data2_element1_x': data2_element1_x,
        'Data2_element1_y': data2_element1_y,
        'Data2_element2_x': data2_element2_x,
        'Data2_element2_y': data2_element2_y
        })

        print(df)





        share|improve this answer













        Based on the links you posted in the comments I came up with the following to iterate and split more efficiently that works fine:



        from lxml import etree, objectify
        import pandas as pd

        file = 'some_huge_file.xml'

        time =
        data1_element1_x =
        data1_element1_y =
        data1_element2_x =
        data1_element2_y =
        data2_element1_x =
        data2_element1_y =
        data2_element2_x =
        data2_element2_y =

        if file.lower().endswith('.xml'):
        for event, elem in etree.iterparse(file):
        if elem.tag == "subelement":
        time.append(elem.get('tc'))
        for child in elem:
        if child.tag == "element1":
        split_data = child.text.split(" ")
        data1_element1_x.append(float(split_data[0]))
        data1_element1_y.append(float(split_data[1]))
        data2_element1_x.append(float(split_data[2]))
        data2_element1_y.append(float(split_data[3]))
        elif child.tag == "element2":
        split_data = child.text.split(" ")
        data1_element2_x.append(float(split_data[0]))
        data1_element2_y.append(float(split_data[1]))
        data2_element2_x.append(float(split_data[2]))
        data2_element2_y.append(float(split_data[3]))
        elem.clear()
        df = pd.DataFrame({
        'Time':time,
        'Data1_element1_x': data1_element1_x,
        'Data1_element1_y': data1_element1_y,
        'Data1_element2_x': data1_element2_x,
        'Data1_element2_y': data1_element2_y,
        'Data2_element1_x': data2_element1_x,
        'Data2_element1_y': data2_element1_y,
        'Data2_element2_x': data2_element2_x,
        'Data2_element2_y': data2_element2_y
        })

        print(df)






        share|improve this answer












        share|improve this answer



        share|improve this answer










        answered Nov 23 '18 at 12:04









        ChrisvdbergeChrisvdberge

        5752822




        5752822
































            draft saved

            draft discarded




















































            Thanks for contributing an answer to Stack Overflow!


            • Please be sure to answer the question. Provide details and share your research!

            But avoid



            • Asking for help, clarification, or responding to other answers.

            • Making statements based on opinion; back them up with references or personal experience.


            To learn more, see our tips on writing great answers.




            draft saved


            draft discarded














            StackExchange.ready(
            function () {
            StackExchange.openid.initPostLogin('.new-post-login', 'https%3a%2f%2fstackoverflow.com%2fquestions%2f53430768%2fparsing-big-xml-files-efficiently%23new-answer', 'question_page');
            }
            );

            Post as a guest















            Required, but never shown





















































            Required, but never shown














            Required, but never shown












            Required, but never shown







            Required, but never shown

































            Required, but never shown














            Required, but never shown












            Required, but never shown







            Required, but never shown







            Popular posts from this blog

            How to change which sound is reproduced for terminal bell?

            Can I use Tabulator js library in my java Spring + Thymeleaf project?

            Title Spacing in Bjornstrup Chapter, Removing Chapter Number From Contents