Nested XML with all data as attributes to pandas dataframe, only final entry is being imported












0















Edit: Issue resolved, updated code below.



I've spent several days scouring the internet trying to figure this out and I can't seem to get anywhere.



Basically I have a heavily nested XML file with all the info I want to extract included as attributes, example below:



<TotalAllocationResultDocument xmlns:xsi="http://www.w3.org/2001/XMLSchema-instance" DtdRelease="0" DtdVersion="4" xsi:noNamespaceSchemaLocation="total-allocation-result-document.xsd">
<DocumentIdentification v="A_R-GB-NI-Q-BASE-------190101-01"/>
<DocumentVersion v="1"/>
<DocumentType v="A25"/>
<SenderIdentification codingScheme="A01" v="10X1001A1001A57U"/>
<SenderRole v="A07"/>
<ReceiverIdentification codingScheme="A01" v="10XNITSO-12345-O"/>
<ReceiverRole v="A04"/>
<CreationDateTime v="2018-11-13T12:55:08Z"/>
<BidTimeInterval v="2018-12-31T23:00Z/2019-03-31T22:00Z"/>
<Domain codingScheme="A01" v="10YDOM-1010A024Y"/>
<AllocationTimeSeries>
<TimeSeriesIdentification v="TotalAllocationResults_TS_4888599"/>
<BidDocumentIdentification v="GB-NI-Q-BASE-------190101-01"/>
<BidDocumentVersion v="1"/>
<BidIdentification v="BID00001-GB-NI"/>
<BiddingParty codingScheme="A01" v="xxxxxxxxxx"/>
<AuctionIdentification v="GB-NI-Q-BASE-------190101-01"/>
<BusinessType v="A03"/>
<InArea codingScheme="A01" v="10Y1001A1001A016"/>
<OutArea codingScheme="A01" v="10YGB----------A"/>
<ContractType v="A08"/>
<ContractIdentification v="311218D3-GB-NI-FK01"/>
<MeasureUnitQuantity v="MAW"/>
<Currency v="EUR"/>
<MeasureUnitPrice v="MWH"/>
<Period>
<TimeInterval v="2018-12-31T23:00Z/2019-03-31T22:00Z"/>
<Resolution v="P3M"/>
<Interval>
<Pos v="1"/>
<Qty v="0"/>
<PriceAmount v="4.02"/>
<BidQty v="3.0"/>
<BidPriceAmount v="2.51"/>
</Interval>
</Period>
</AllocationTimeSeries>
<AllocationTimeSeries>
<TimeSeriesIdentification v="TotalAllocationResults_TS_4888602"/>
<BidDocumentIdentification v="GB-NI-Q-BASE-------190101-01"/>
<BidDocumentVersion v="1"/>
<BidIdentification v="BID00004-GB-NI"/>
<BiddingParty codingScheme="A01" v="yyyyyyyyyy"/>
<AuctionIdentification v="GB-NI-Q-BASE-------190101-01"/>
<BusinessType v="A03"/>
<InArea codingScheme="A01" v="10Y1001A1001A016"/>
<OutArea codingScheme="A01" v="10YGB----------A"/>
<ContractType v="A08"/>
<ContractIdentification v="311218D3-GB-NI-FK01"/>
<MeasureUnitQuantity v="MAW"/>
<Currency v="EUR"/>
<MeasureUnitPrice v="MWH"/>
<Period>
<TimeInterval v="2018-12-31T23:00Z/2019-03-31T22:00Z"/>
<Resolution v="P3M"/>
<Interval>
<Pos v="1"/>
<Qty v="0"/>
<PriceAmount v="4.02"/>
<BidQty v="3.0"/>
<BidPriceAmount v="1.51"/>
</Interval>
</Period>
</AllocationTimeSeries>
</TotalAllocationResultDocument>


I'm trying to get this data into a pandas dataframe, with columns something like:



SenderIdentification | DocumentIdentification | InArea | BidID | AllocatedQty



The code I've got is based on a load of different sources, but looks like:



import pandas as pd
import xml.etree.cElementTree as et

def getvalueofnode(node):
""" return node text or None """
return node.text if node is not None else None

def main():
""" main """
parsed_xml =et.parse("example.xml")
dfcols = ['SenderIdentification', 'DocumentIdentification', 'InArea', 'BidID', 'AllocatedQty']
df_xml = pd.DataFrame(columns=dfcols)

for node in parsed_xml.getroot():

for SenderIdentificationFind in parsed_xml.findall('./SenderIdentification'):
SenderIdentification = SenderIdentificationFind.get('v')

for DocumentIdentificationFind in parsed_xml.findall('./DocumentIdentification'):
DocumentIdentification = DocumentIdentificationFind.get('v')

for InAreaFind in parsed_xml.findall('./AllocationTimeSeries/InArea'):
InArea = InAreaFind.get('v')

for BidIDFind in parsed_xml.findall('./AllocationTimeSeries/BidIdentification'):
BidID = BidIDFind.get('v')

for BidIDFind in parsed_xml.findall('./AllocationTimeSeries/Period/Interval/Qty'):
AllocatedQty = BidIDFind.get('v')

df_xml = df_xml.append(
pd.Series([SenderIdentification, DocumentIdentification, InArea, BidID, AllocatedQty], index=dfcols),
ignore_index=True)

return df_xml

main()

df_xml = main()


This is creating a dataframe, but only seems to be pulling data from the last "AllocationTimeSeries" entry in the XML, and seems to have far more entries than the total number of "AllocationTimeSeries" entries in the document (eg. I would expect two unique rows based on this example data, but I'm actually seeing 12 identical rows).



I'm quite confused so any help as to how to get to the dataframe I'm after would be appreciated.



I've also tried the approach of using an XLST file to simplify the XLS file to an easier format to parse, but I don't really have the knowledge of XLS to get that to work either.



Thanks!



Code that works:



import pandas as pd
import lxml.etree as et

def getvalueofnode(node):
""" return node text or None """
return node.text if node is not None else None

def main():
""" main """
parsed_xml = et.parse("example.xml")
dfcols = ['DocumentIdentification', 'DocumentVersion', 'CreationDateTime', 'BidTimeInterval', 'BidIdentification', 'BiddingParty', 'OutArea', 'InArea', 'ContractIdentification', 'BidPrice', ' BidQuantity', 'ClearingPrice', 'AllocatedQuantity']
df_xml = pd.DataFrame(columns=dfcols)

root = parsed_xml.getroot()
result = len(root.findall("./AllocationTimeSeries"))+1

for DocumentIdentificationFind in parsed_xml.findall('./DocumentIdentification'):
DocumentIdentification = DocumentIdentificationFind.get('v')

for DocumentVersionFind in parsed_xml.findall('./DocumentVersion'):
DocumentVersion = DocumentVersionFind.get('v')

for CreationDateTimeFind in parsed_xml.findall('./CreationDateTime'):
CreationDateTime = CreationDateTimeFind.get('v')

for BidTimeIntervalFind in parsed_xml.findall('./BidTimeInterval'):
BidTimeInterval = BidTimeIntervalFind.get('v')

for i in range(1,result):

BidIdentificationString = './AllocationTimeSeries[%d]/BidIdentification'%(i)
BiddingPartyString = './AllocationTimeSeries[%d]/BiddingParty'%(i)
InAreaString = './AllocationTimeSeries[%d]/InArea'%(i)
OutAreaString = './AllocationTimeSeries[%d]/OutArea'%(i)
ContractIdentificationString = './AllocationTimeSeries[%d]/ContractIdentification'%(i)
BidPriceString = './AllocationTimeSeries[%d]/Period/Interval/BidPriceAmount'%(i)
BidQuantityString = './AllocationTimeSeries[%d]/Period/Interval/BidQty'%(i)
ClearingPriceString = './AllocationTimeSeries[%d]/Period/Interval/PriceAmount'%(i)
AllocatedQuantityString = './AllocationTimeSeries[%d]/Period/Interval/Qty'%(i)

for BidIdentificationFind in parsed_xml.findall(BidIdentificationString):
BidIdentification = BidIdentificationFind.get('v')

for BiddingPartyFind in parsed_xml.findall(BiddingPartyString):
BiddingParty = BiddingPartyFind.get('v')

for InAreaFind in parsed_xml.findall(InAreaString):
InArea = InAreaFind.get('v')

for OutAreaFind in parsed_xml.findall(OutAreaString):
OutArea = OutAreaFind.get('v')

for ContractIdentificationFind in parsed_xml.findall(ContractIdentificationString):
ContractIdentification = ContractIdentificationFind.get('v')

for BidPriceFind in parsed_xml.findall(BidPriceString):
BidPrice = BidPriceFind.get('v')

for BidQuantityFind in parsed_xml.findall(BidQuantityString):
BidQuantity = BidQuantityFind.get('v')

for ClearingPriceFind in parsed_xml.findall(ClearingPriceString):
ClearingPrice = ClearingPriceFind.get('v')

for AllocatedQuantityFind in parsed_xml.findall(AllocatedQuantityString):
AllocatedQuantity = AllocatedQuantityFind.get('v')

df_xml = df_xml.append(
pd.Series([DocumentIdentification, DocumentVersion, CreationDateTime, BidTimeInterval, BidIdentification, BiddingParty, OutArea, InArea, ContractIdentification, BidPrice, BidQuantity, ClearingPrice, AllocatedQuantity], index=dfcols),
ignore_index=True)

return df_xml

main()

df_xml = main()

df_xml









share|improve this question





























    0















    Edit: Issue resolved, updated code below.



    I've spent several days scouring the internet trying to figure this out and I can't seem to get anywhere.



    Basically I have a heavily nested XML file with all the info I want to extract included as attributes, example below:



    <TotalAllocationResultDocument xmlns:xsi="http://www.w3.org/2001/XMLSchema-instance" DtdRelease="0" DtdVersion="4" xsi:noNamespaceSchemaLocation="total-allocation-result-document.xsd">
    <DocumentIdentification v="A_R-GB-NI-Q-BASE-------190101-01"/>
    <DocumentVersion v="1"/>
    <DocumentType v="A25"/>
    <SenderIdentification codingScheme="A01" v="10X1001A1001A57U"/>
    <SenderRole v="A07"/>
    <ReceiverIdentification codingScheme="A01" v="10XNITSO-12345-O"/>
    <ReceiverRole v="A04"/>
    <CreationDateTime v="2018-11-13T12:55:08Z"/>
    <BidTimeInterval v="2018-12-31T23:00Z/2019-03-31T22:00Z"/>
    <Domain codingScheme="A01" v="10YDOM-1010A024Y"/>
    <AllocationTimeSeries>
    <TimeSeriesIdentification v="TotalAllocationResults_TS_4888599"/>
    <BidDocumentIdentification v="GB-NI-Q-BASE-------190101-01"/>
    <BidDocumentVersion v="1"/>
    <BidIdentification v="BID00001-GB-NI"/>
    <BiddingParty codingScheme="A01" v="xxxxxxxxxx"/>
    <AuctionIdentification v="GB-NI-Q-BASE-------190101-01"/>
    <BusinessType v="A03"/>
    <InArea codingScheme="A01" v="10Y1001A1001A016"/>
    <OutArea codingScheme="A01" v="10YGB----------A"/>
    <ContractType v="A08"/>
    <ContractIdentification v="311218D3-GB-NI-FK01"/>
    <MeasureUnitQuantity v="MAW"/>
    <Currency v="EUR"/>
    <MeasureUnitPrice v="MWH"/>
    <Period>
    <TimeInterval v="2018-12-31T23:00Z/2019-03-31T22:00Z"/>
    <Resolution v="P3M"/>
    <Interval>
    <Pos v="1"/>
    <Qty v="0"/>
    <PriceAmount v="4.02"/>
    <BidQty v="3.0"/>
    <BidPriceAmount v="2.51"/>
    </Interval>
    </Period>
    </AllocationTimeSeries>
    <AllocationTimeSeries>
    <TimeSeriesIdentification v="TotalAllocationResults_TS_4888602"/>
    <BidDocumentIdentification v="GB-NI-Q-BASE-------190101-01"/>
    <BidDocumentVersion v="1"/>
    <BidIdentification v="BID00004-GB-NI"/>
    <BiddingParty codingScheme="A01" v="yyyyyyyyyy"/>
    <AuctionIdentification v="GB-NI-Q-BASE-------190101-01"/>
    <BusinessType v="A03"/>
    <InArea codingScheme="A01" v="10Y1001A1001A016"/>
    <OutArea codingScheme="A01" v="10YGB----------A"/>
    <ContractType v="A08"/>
    <ContractIdentification v="311218D3-GB-NI-FK01"/>
    <MeasureUnitQuantity v="MAW"/>
    <Currency v="EUR"/>
    <MeasureUnitPrice v="MWH"/>
    <Period>
    <TimeInterval v="2018-12-31T23:00Z/2019-03-31T22:00Z"/>
    <Resolution v="P3M"/>
    <Interval>
    <Pos v="1"/>
    <Qty v="0"/>
    <PriceAmount v="4.02"/>
    <BidQty v="3.0"/>
    <BidPriceAmount v="1.51"/>
    </Interval>
    </Period>
    </AllocationTimeSeries>
    </TotalAllocationResultDocument>


    I'm trying to get this data into a pandas dataframe, with columns something like:



    SenderIdentification | DocumentIdentification | InArea | BidID | AllocatedQty



    The code I've got is based on a load of different sources, but looks like:



    import pandas as pd
    import xml.etree.cElementTree as et

    def getvalueofnode(node):
    """ return node text or None """
    return node.text if node is not None else None

    def main():
    """ main """
    parsed_xml =et.parse("example.xml")
    dfcols = ['SenderIdentification', 'DocumentIdentification', 'InArea', 'BidID', 'AllocatedQty']
    df_xml = pd.DataFrame(columns=dfcols)

    for node in parsed_xml.getroot():

    for SenderIdentificationFind in parsed_xml.findall('./SenderIdentification'):
    SenderIdentification = SenderIdentificationFind.get('v')

    for DocumentIdentificationFind in parsed_xml.findall('./DocumentIdentification'):
    DocumentIdentification = DocumentIdentificationFind.get('v')

    for InAreaFind in parsed_xml.findall('./AllocationTimeSeries/InArea'):
    InArea = InAreaFind.get('v')

    for BidIDFind in parsed_xml.findall('./AllocationTimeSeries/BidIdentification'):
    BidID = BidIDFind.get('v')

    for BidIDFind in parsed_xml.findall('./AllocationTimeSeries/Period/Interval/Qty'):
    AllocatedQty = BidIDFind.get('v')

    df_xml = df_xml.append(
    pd.Series([SenderIdentification, DocumentIdentification, InArea, BidID, AllocatedQty], index=dfcols),
    ignore_index=True)

    return df_xml

    main()

    df_xml = main()


    This is creating a dataframe, but only seems to be pulling data from the last "AllocationTimeSeries" entry in the XML, and seems to have far more entries than the total number of "AllocationTimeSeries" entries in the document (eg. I would expect two unique rows based on this example data, but I'm actually seeing 12 identical rows).



    I'm quite confused so any help as to how to get to the dataframe I'm after would be appreciated.



    I've also tried the approach of using an XLST file to simplify the XLS file to an easier format to parse, but I don't really have the knowledge of XLS to get that to work either.



    Thanks!



    Code that works:



    import pandas as pd
    import lxml.etree as et

    def getvalueofnode(node):
    """ return node text or None """
    return node.text if node is not None else None

    def main():
    """ main """
    parsed_xml = et.parse("example.xml")
    dfcols = ['DocumentIdentification', 'DocumentVersion', 'CreationDateTime', 'BidTimeInterval', 'BidIdentification', 'BiddingParty', 'OutArea', 'InArea', 'ContractIdentification', 'BidPrice', ' BidQuantity', 'ClearingPrice', 'AllocatedQuantity']
    df_xml = pd.DataFrame(columns=dfcols)

    root = parsed_xml.getroot()
    result = len(root.findall("./AllocationTimeSeries"))+1

    for DocumentIdentificationFind in parsed_xml.findall('./DocumentIdentification'):
    DocumentIdentification = DocumentIdentificationFind.get('v')

    for DocumentVersionFind in parsed_xml.findall('./DocumentVersion'):
    DocumentVersion = DocumentVersionFind.get('v')

    for CreationDateTimeFind in parsed_xml.findall('./CreationDateTime'):
    CreationDateTime = CreationDateTimeFind.get('v')

    for BidTimeIntervalFind in parsed_xml.findall('./BidTimeInterval'):
    BidTimeInterval = BidTimeIntervalFind.get('v')

    for i in range(1,result):

    BidIdentificationString = './AllocationTimeSeries[%d]/BidIdentification'%(i)
    BiddingPartyString = './AllocationTimeSeries[%d]/BiddingParty'%(i)
    InAreaString = './AllocationTimeSeries[%d]/InArea'%(i)
    OutAreaString = './AllocationTimeSeries[%d]/OutArea'%(i)
    ContractIdentificationString = './AllocationTimeSeries[%d]/ContractIdentification'%(i)
    BidPriceString = './AllocationTimeSeries[%d]/Period/Interval/BidPriceAmount'%(i)
    BidQuantityString = './AllocationTimeSeries[%d]/Period/Interval/BidQty'%(i)
    ClearingPriceString = './AllocationTimeSeries[%d]/Period/Interval/PriceAmount'%(i)
    AllocatedQuantityString = './AllocationTimeSeries[%d]/Period/Interval/Qty'%(i)

    for BidIdentificationFind in parsed_xml.findall(BidIdentificationString):
    BidIdentification = BidIdentificationFind.get('v')

    for BiddingPartyFind in parsed_xml.findall(BiddingPartyString):
    BiddingParty = BiddingPartyFind.get('v')

    for InAreaFind in parsed_xml.findall(InAreaString):
    InArea = InAreaFind.get('v')

    for OutAreaFind in parsed_xml.findall(OutAreaString):
    OutArea = OutAreaFind.get('v')

    for ContractIdentificationFind in parsed_xml.findall(ContractIdentificationString):
    ContractIdentification = ContractIdentificationFind.get('v')

    for BidPriceFind in parsed_xml.findall(BidPriceString):
    BidPrice = BidPriceFind.get('v')

    for BidQuantityFind in parsed_xml.findall(BidQuantityString):
    BidQuantity = BidQuantityFind.get('v')

    for ClearingPriceFind in parsed_xml.findall(ClearingPriceString):
    ClearingPrice = ClearingPriceFind.get('v')

    for AllocatedQuantityFind in parsed_xml.findall(AllocatedQuantityString):
    AllocatedQuantity = AllocatedQuantityFind.get('v')

    df_xml = df_xml.append(
    pd.Series([DocumentIdentification, DocumentVersion, CreationDateTime, BidTimeInterval, BidIdentification, BiddingParty, OutArea, InArea, ContractIdentification, BidPrice, BidQuantity, ClearingPrice, AllocatedQuantity], index=dfcols),
    ignore_index=True)

    return df_xml

    main()

    df_xml = main()

    df_xml









    share|improve this question



























      0












      0








      0


      1






      Edit: Issue resolved, updated code below.



      I've spent several days scouring the internet trying to figure this out and I can't seem to get anywhere.



      Basically I have a heavily nested XML file with all the info I want to extract included as attributes, example below:



      <TotalAllocationResultDocument xmlns:xsi="http://www.w3.org/2001/XMLSchema-instance" DtdRelease="0" DtdVersion="4" xsi:noNamespaceSchemaLocation="total-allocation-result-document.xsd">
      <DocumentIdentification v="A_R-GB-NI-Q-BASE-------190101-01"/>
      <DocumentVersion v="1"/>
      <DocumentType v="A25"/>
      <SenderIdentification codingScheme="A01" v="10X1001A1001A57U"/>
      <SenderRole v="A07"/>
      <ReceiverIdentification codingScheme="A01" v="10XNITSO-12345-O"/>
      <ReceiverRole v="A04"/>
      <CreationDateTime v="2018-11-13T12:55:08Z"/>
      <BidTimeInterval v="2018-12-31T23:00Z/2019-03-31T22:00Z"/>
      <Domain codingScheme="A01" v="10YDOM-1010A024Y"/>
      <AllocationTimeSeries>
      <TimeSeriesIdentification v="TotalAllocationResults_TS_4888599"/>
      <BidDocumentIdentification v="GB-NI-Q-BASE-------190101-01"/>
      <BidDocumentVersion v="1"/>
      <BidIdentification v="BID00001-GB-NI"/>
      <BiddingParty codingScheme="A01" v="xxxxxxxxxx"/>
      <AuctionIdentification v="GB-NI-Q-BASE-------190101-01"/>
      <BusinessType v="A03"/>
      <InArea codingScheme="A01" v="10Y1001A1001A016"/>
      <OutArea codingScheme="A01" v="10YGB----------A"/>
      <ContractType v="A08"/>
      <ContractIdentification v="311218D3-GB-NI-FK01"/>
      <MeasureUnitQuantity v="MAW"/>
      <Currency v="EUR"/>
      <MeasureUnitPrice v="MWH"/>
      <Period>
      <TimeInterval v="2018-12-31T23:00Z/2019-03-31T22:00Z"/>
      <Resolution v="P3M"/>
      <Interval>
      <Pos v="1"/>
      <Qty v="0"/>
      <PriceAmount v="4.02"/>
      <BidQty v="3.0"/>
      <BidPriceAmount v="2.51"/>
      </Interval>
      </Period>
      </AllocationTimeSeries>
      <AllocationTimeSeries>
      <TimeSeriesIdentification v="TotalAllocationResults_TS_4888602"/>
      <BidDocumentIdentification v="GB-NI-Q-BASE-------190101-01"/>
      <BidDocumentVersion v="1"/>
      <BidIdentification v="BID00004-GB-NI"/>
      <BiddingParty codingScheme="A01" v="yyyyyyyyyy"/>
      <AuctionIdentification v="GB-NI-Q-BASE-------190101-01"/>
      <BusinessType v="A03"/>
      <InArea codingScheme="A01" v="10Y1001A1001A016"/>
      <OutArea codingScheme="A01" v="10YGB----------A"/>
      <ContractType v="A08"/>
      <ContractIdentification v="311218D3-GB-NI-FK01"/>
      <MeasureUnitQuantity v="MAW"/>
      <Currency v="EUR"/>
      <MeasureUnitPrice v="MWH"/>
      <Period>
      <TimeInterval v="2018-12-31T23:00Z/2019-03-31T22:00Z"/>
      <Resolution v="P3M"/>
      <Interval>
      <Pos v="1"/>
      <Qty v="0"/>
      <PriceAmount v="4.02"/>
      <BidQty v="3.0"/>
      <BidPriceAmount v="1.51"/>
      </Interval>
      </Period>
      </AllocationTimeSeries>
      </TotalAllocationResultDocument>


      I'm trying to get this data into a pandas dataframe, with columns something like:



      SenderIdentification | DocumentIdentification | InArea | BidID | AllocatedQty



      The code I've got is based on a load of different sources, but looks like:



      import pandas as pd
      import xml.etree.cElementTree as et

      def getvalueofnode(node):
      """ return node text or None """
      return node.text if node is not None else None

      def main():
      """ main """
      parsed_xml =et.parse("example.xml")
      dfcols = ['SenderIdentification', 'DocumentIdentification', 'InArea', 'BidID', 'AllocatedQty']
      df_xml = pd.DataFrame(columns=dfcols)

      for node in parsed_xml.getroot():

      for SenderIdentificationFind in parsed_xml.findall('./SenderIdentification'):
      SenderIdentification = SenderIdentificationFind.get('v')

      for DocumentIdentificationFind in parsed_xml.findall('./DocumentIdentification'):
      DocumentIdentification = DocumentIdentificationFind.get('v')

      for InAreaFind in parsed_xml.findall('./AllocationTimeSeries/InArea'):
      InArea = InAreaFind.get('v')

      for BidIDFind in parsed_xml.findall('./AllocationTimeSeries/BidIdentification'):
      BidID = BidIDFind.get('v')

      for BidIDFind in parsed_xml.findall('./AllocationTimeSeries/Period/Interval/Qty'):
      AllocatedQty = BidIDFind.get('v')

      df_xml = df_xml.append(
      pd.Series([SenderIdentification, DocumentIdentification, InArea, BidID, AllocatedQty], index=dfcols),
      ignore_index=True)

      return df_xml

      main()

      df_xml = main()


      This is creating a dataframe, but only seems to be pulling data from the last "AllocationTimeSeries" entry in the XML, and seems to have far more entries than the total number of "AllocationTimeSeries" entries in the document (eg. I would expect two unique rows based on this example data, but I'm actually seeing 12 identical rows).



      I'm quite confused so any help as to how to get to the dataframe I'm after would be appreciated.



      I've also tried the approach of using an XLST file to simplify the XLS file to an easier format to parse, but I don't really have the knowledge of XLS to get that to work either.



      Thanks!



      Code that works:



      import pandas as pd
      import lxml.etree as et

      def getvalueofnode(node):
      """ return node text or None """
      return node.text if node is not None else None

      def main():
      """ main """
      parsed_xml = et.parse("example.xml")
      dfcols = ['DocumentIdentification', 'DocumentVersion', 'CreationDateTime', 'BidTimeInterval', 'BidIdentification', 'BiddingParty', 'OutArea', 'InArea', 'ContractIdentification', 'BidPrice', ' BidQuantity', 'ClearingPrice', 'AllocatedQuantity']
      df_xml = pd.DataFrame(columns=dfcols)

      root = parsed_xml.getroot()
      result = len(root.findall("./AllocationTimeSeries"))+1

      for DocumentIdentificationFind in parsed_xml.findall('./DocumentIdentification'):
      DocumentIdentification = DocumentIdentificationFind.get('v')

      for DocumentVersionFind in parsed_xml.findall('./DocumentVersion'):
      DocumentVersion = DocumentVersionFind.get('v')

      for CreationDateTimeFind in parsed_xml.findall('./CreationDateTime'):
      CreationDateTime = CreationDateTimeFind.get('v')

      for BidTimeIntervalFind in parsed_xml.findall('./BidTimeInterval'):
      BidTimeInterval = BidTimeIntervalFind.get('v')

      for i in range(1,result):

      BidIdentificationString = './AllocationTimeSeries[%d]/BidIdentification'%(i)
      BiddingPartyString = './AllocationTimeSeries[%d]/BiddingParty'%(i)
      InAreaString = './AllocationTimeSeries[%d]/InArea'%(i)
      OutAreaString = './AllocationTimeSeries[%d]/OutArea'%(i)
      ContractIdentificationString = './AllocationTimeSeries[%d]/ContractIdentification'%(i)
      BidPriceString = './AllocationTimeSeries[%d]/Period/Interval/BidPriceAmount'%(i)
      BidQuantityString = './AllocationTimeSeries[%d]/Period/Interval/BidQty'%(i)
      ClearingPriceString = './AllocationTimeSeries[%d]/Period/Interval/PriceAmount'%(i)
      AllocatedQuantityString = './AllocationTimeSeries[%d]/Period/Interval/Qty'%(i)

      for BidIdentificationFind in parsed_xml.findall(BidIdentificationString):
      BidIdentification = BidIdentificationFind.get('v')

      for BiddingPartyFind in parsed_xml.findall(BiddingPartyString):
      BiddingParty = BiddingPartyFind.get('v')

      for InAreaFind in parsed_xml.findall(InAreaString):
      InArea = InAreaFind.get('v')

      for OutAreaFind in parsed_xml.findall(OutAreaString):
      OutArea = OutAreaFind.get('v')

      for ContractIdentificationFind in parsed_xml.findall(ContractIdentificationString):
      ContractIdentification = ContractIdentificationFind.get('v')

      for BidPriceFind in parsed_xml.findall(BidPriceString):
      BidPrice = BidPriceFind.get('v')

      for BidQuantityFind in parsed_xml.findall(BidQuantityString):
      BidQuantity = BidQuantityFind.get('v')

      for ClearingPriceFind in parsed_xml.findall(ClearingPriceString):
      ClearingPrice = ClearingPriceFind.get('v')

      for AllocatedQuantityFind in parsed_xml.findall(AllocatedQuantityString):
      AllocatedQuantity = AllocatedQuantityFind.get('v')

      df_xml = df_xml.append(
      pd.Series([DocumentIdentification, DocumentVersion, CreationDateTime, BidTimeInterval, BidIdentification, BiddingParty, OutArea, InArea, ContractIdentification, BidPrice, BidQuantity, ClearingPrice, AllocatedQuantity], index=dfcols),
      ignore_index=True)

      return df_xml

      main()

      df_xml = main()

      df_xml









      share|improve this question
















      Edit: Issue resolved, updated code below.



      I've spent several days scouring the internet trying to figure this out and I can't seem to get anywhere.



      Basically I have a heavily nested XML file with all the info I want to extract included as attributes, example below:



      <TotalAllocationResultDocument xmlns:xsi="http://www.w3.org/2001/XMLSchema-instance" DtdRelease="0" DtdVersion="4" xsi:noNamespaceSchemaLocation="total-allocation-result-document.xsd">
      <DocumentIdentification v="A_R-GB-NI-Q-BASE-------190101-01"/>
      <DocumentVersion v="1"/>
      <DocumentType v="A25"/>
      <SenderIdentification codingScheme="A01" v="10X1001A1001A57U"/>
      <SenderRole v="A07"/>
      <ReceiverIdentification codingScheme="A01" v="10XNITSO-12345-O"/>
      <ReceiverRole v="A04"/>
      <CreationDateTime v="2018-11-13T12:55:08Z"/>
      <BidTimeInterval v="2018-12-31T23:00Z/2019-03-31T22:00Z"/>
      <Domain codingScheme="A01" v="10YDOM-1010A024Y"/>
      <AllocationTimeSeries>
      <TimeSeriesIdentification v="TotalAllocationResults_TS_4888599"/>
      <BidDocumentIdentification v="GB-NI-Q-BASE-------190101-01"/>
      <BidDocumentVersion v="1"/>
      <BidIdentification v="BID00001-GB-NI"/>
      <BiddingParty codingScheme="A01" v="xxxxxxxxxx"/>
      <AuctionIdentification v="GB-NI-Q-BASE-------190101-01"/>
      <BusinessType v="A03"/>
      <InArea codingScheme="A01" v="10Y1001A1001A016"/>
      <OutArea codingScheme="A01" v="10YGB----------A"/>
      <ContractType v="A08"/>
      <ContractIdentification v="311218D3-GB-NI-FK01"/>
      <MeasureUnitQuantity v="MAW"/>
      <Currency v="EUR"/>
      <MeasureUnitPrice v="MWH"/>
      <Period>
      <TimeInterval v="2018-12-31T23:00Z/2019-03-31T22:00Z"/>
      <Resolution v="P3M"/>
      <Interval>
      <Pos v="1"/>
      <Qty v="0"/>
      <PriceAmount v="4.02"/>
      <BidQty v="3.0"/>
      <BidPriceAmount v="2.51"/>
      </Interval>
      </Period>
      </AllocationTimeSeries>
      <AllocationTimeSeries>
      <TimeSeriesIdentification v="TotalAllocationResults_TS_4888602"/>
      <BidDocumentIdentification v="GB-NI-Q-BASE-------190101-01"/>
      <BidDocumentVersion v="1"/>
      <BidIdentification v="BID00004-GB-NI"/>
      <BiddingParty codingScheme="A01" v="yyyyyyyyyy"/>
      <AuctionIdentification v="GB-NI-Q-BASE-------190101-01"/>
      <BusinessType v="A03"/>
      <InArea codingScheme="A01" v="10Y1001A1001A016"/>
      <OutArea codingScheme="A01" v="10YGB----------A"/>
      <ContractType v="A08"/>
      <ContractIdentification v="311218D3-GB-NI-FK01"/>
      <MeasureUnitQuantity v="MAW"/>
      <Currency v="EUR"/>
      <MeasureUnitPrice v="MWH"/>
      <Period>
      <TimeInterval v="2018-12-31T23:00Z/2019-03-31T22:00Z"/>
      <Resolution v="P3M"/>
      <Interval>
      <Pos v="1"/>
      <Qty v="0"/>
      <PriceAmount v="4.02"/>
      <BidQty v="3.0"/>
      <BidPriceAmount v="1.51"/>
      </Interval>
      </Period>
      </AllocationTimeSeries>
      </TotalAllocationResultDocument>


      I'm trying to get this data into a pandas dataframe, with columns something like:



      SenderIdentification | DocumentIdentification | InArea | BidID | AllocatedQty



      The code I've got is based on a load of different sources, but looks like:



      import pandas as pd
      import xml.etree.cElementTree as et

      def getvalueofnode(node):
      """ return node text or None """
      return node.text if node is not None else None

      def main():
      """ main """
      parsed_xml =et.parse("example.xml")
      dfcols = ['SenderIdentification', 'DocumentIdentification', 'InArea', 'BidID', 'AllocatedQty']
      df_xml = pd.DataFrame(columns=dfcols)

      for node in parsed_xml.getroot():

      for SenderIdentificationFind in parsed_xml.findall('./SenderIdentification'):
      SenderIdentification = SenderIdentificationFind.get('v')

      for DocumentIdentificationFind in parsed_xml.findall('./DocumentIdentification'):
      DocumentIdentification = DocumentIdentificationFind.get('v')

      for InAreaFind in parsed_xml.findall('./AllocationTimeSeries/InArea'):
      InArea = InAreaFind.get('v')

      for BidIDFind in parsed_xml.findall('./AllocationTimeSeries/BidIdentification'):
      BidID = BidIDFind.get('v')

      for BidIDFind in parsed_xml.findall('./AllocationTimeSeries/Period/Interval/Qty'):
      AllocatedQty = BidIDFind.get('v')

      df_xml = df_xml.append(
      pd.Series([SenderIdentification, DocumentIdentification, InArea, BidID, AllocatedQty], index=dfcols),
      ignore_index=True)

      return df_xml

      main()

      df_xml = main()


      This is creating a dataframe, but only seems to be pulling data from the last "AllocationTimeSeries" entry in the XML, and seems to have far more entries than the total number of "AllocationTimeSeries" entries in the document (eg. I would expect two unique rows based on this example data, but I'm actually seeing 12 identical rows).



      I'm quite confused so any help as to how to get to the dataframe I'm after would be appreciated.



      I've also tried the approach of using an XLST file to simplify the XLS file to an easier format to parse, but I don't really have the knowledge of XLS to get that to work either.



      Thanks!



      Code that works:



      import pandas as pd
      import lxml.etree as et

      def getvalueofnode(node):
      """ return node text or None """
      return node.text if node is not None else None

      def main():
      """ main """
      parsed_xml = et.parse("example.xml")
      dfcols = ['DocumentIdentification', 'DocumentVersion', 'CreationDateTime', 'BidTimeInterval', 'BidIdentification', 'BiddingParty', 'OutArea', 'InArea', 'ContractIdentification', 'BidPrice', ' BidQuantity', 'ClearingPrice', 'AllocatedQuantity']
      df_xml = pd.DataFrame(columns=dfcols)

      root = parsed_xml.getroot()
      result = len(root.findall("./AllocationTimeSeries"))+1

      for DocumentIdentificationFind in parsed_xml.findall('./DocumentIdentification'):
      DocumentIdentification = DocumentIdentificationFind.get('v')

      for DocumentVersionFind in parsed_xml.findall('./DocumentVersion'):
      DocumentVersion = DocumentVersionFind.get('v')

      for CreationDateTimeFind in parsed_xml.findall('./CreationDateTime'):
      CreationDateTime = CreationDateTimeFind.get('v')

      for BidTimeIntervalFind in parsed_xml.findall('./BidTimeInterval'):
      BidTimeInterval = BidTimeIntervalFind.get('v')

      for i in range(1,result):

      BidIdentificationString = './AllocationTimeSeries[%d]/BidIdentification'%(i)
      BiddingPartyString = './AllocationTimeSeries[%d]/BiddingParty'%(i)
      InAreaString = './AllocationTimeSeries[%d]/InArea'%(i)
      OutAreaString = './AllocationTimeSeries[%d]/OutArea'%(i)
      ContractIdentificationString = './AllocationTimeSeries[%d]/ContractIdentification'%(i)
      BidPriceString = './AllocationTimeSeries[%d]/Period/Interval/BidPriceAmount'%(i)
      BidQuantityString = './AllocationTimeSeries[%d]/Period/Interval/BidQty'%(i)
      ClearingPriceString = './AllocationTimeSeries[%d]/Period/Interval/PriceAmount'%(i)
      AllocatedQuantityString = './AllocationTimeSeries[%d]/Period/Interval/Qty'%(i)

      for BidIdentificationFind in parsed_xml.findall(BidIdentificationString):
      BidIdentification = BidIdentificationFind.get('v')

      for BiddingPartyFind in parsed_xml.findall(BiddingPartyString):
      BiddingParty = BiddingPartyFind.get('v')

      for InAreaFind in parsed_xml.findall(InAreaString):
      InArea = InAreaFind.get('v')

      for OutAreaFind in parsed_xml.findall(OutAreaString):
      OutArea = OutAreaFind.get('v')

      for ContractIdentificationFind in parsed_xml.findall(ContractIdentificationString):
      ContractIdentification = ContractIdentificationFind.get('v')

      for BidPriceFind in parsed_xml.findall(BidPriceString):
      BidPrice = BidPriceFind.get('v')

      for BidQuantityFind in parsed_xml.findall(BidQuantityString):
      BidQuantity = BidQuantityFind.get('v')

      for ClearingPriceFind in parsed_xml.findall(ClearingPriceString):
      ClearingPrice = ClearingPriceFind.get('v')

      for AllocatedQuantityFind in parsed_xml.findall(AllocatedQuantityString):
      AllocatedQuantity = AllocatedQuantityFind.get('v')

      df_xml = df_xml.append(
      pd.Series([DocumentIdentification, DocumentVersion, CreationDateTime, BidTimeInterval, BidIdentification, BiddingParty, OutArea, InArea, ContractIdentification, BidPrice, BidQuantity, ClearingPrice, AllocatedQuantity], index=dfcols),
      ignore_index=True)

      return df_xml

      main()

      df_xml = main()

      df_xml






      python xml pandas






      share|improve this question















      share|improve this question













      share|improve this question




      share|improve this question








      edited Nov 20 '18 at 16:36







      rtunnicliffe

















      asked Nov 19 '18 at 21:22









      rtunnicliffertunnicliffe

      13




      13
























          0






          active

          oldest

          votes











          Your Answer






          StackExchange.ifUsing("editor", function () {
          StackExchange.using("externalEditor", function () {
          StackExchange.using("snippets", function () {
          StackExchange.snippets.init();
          });
          });
          }, "code-snippets");

          StackExchange.ready(function() {
          var channelOptions = {
          tags: "".split(" "),
          id: "1"
          };
          initTagRenderer("".split(" "), "".split(" "), channelOptions);

          StackExchange.using("externalEditor", function() {
          // Have to fire editor after snippets, if snippets enabled
          if (StackExchange.settings.snippets.snippetsEnabled) {
          StackExchange.using("snippets", function() {
          createEditor();
          });
          }
          else {
          createEditor();
          }
          });

          function createEditor() {
          StackExchange.prepareEditor({
          heartbeatType: 'answer',
          autoActivateHeartbeat: false,
          convertImagesToLinks: true,
          noModals: true,
          showLowRepImageUploadWarning: true,
          reputationToPostImages: 10,
          bindNavPrevention: true,
          postfix: "",
          imageUploader: {
          brandingHtml: "Powered by u003ca class="icon-imgur-white" href="https://imgur.com/"u003eu003c/au003e",
          contentPolicyHtml: "User contributions licensed under u003ca href="https://creativecommons.org/licenses/by-sa/3.0/"u003ecc by-sa 3.0 with attribution requiredu003c/au003e u003ca href="https://stackoverflow.com/legal/content-policy"u003e(content policy)u003c/au003e",
          allowUrls: true
          },
          onDemand: true,
          discardSelector: ".discard-answer"
          ,immediatelyShowMarkdownHelp:true
          });


          }
          });














          draft saved

          draft discarded


















          StackExchange.ready(
          function () {
          StackExchange.openid.initPostLogin('.new-post-login', 'https%3a%2f%2fstackoverflow.com%2fquestions%2f53382820%2fnested-xml-with-all-data-as-attributes-to-pandas-dataframe-only-final-entry-is%23new-answer', 'question_page');
          }
          );

          Post as a guest















          Required, but never shown

























          0






          active

          oldest

          votes








          0






          active

          oldest

          votes









          active

          oldest

          votes






          active

          oldest

          votes
















          draft saved

          draft discarded




















































          Thanks for contributing an answer to Stack Overflow!


          • Please be sure to answer the question. Provide details and share your research!

          But avoid



          • Asking for help, clarification, or responding to other answers.

          • Making statements based on opinion; back them up with references or personal experience.


          To learn more, see our tips on writing great answers.




          draft saved


          draft discarded














          StackExchange.ready(
          function () {
          StackExchange.openid.initPostLogin('.new-post-login', 'https%3a%2f%2fstackoverflow.com%2fquestions%2f53382820%2fnested-xml-with-all-data-as-attributes-to-pandas-dataframe-only-final-entry-is%23new-answer', 'question_page');
          }
          );

          Post as a guest















          Required, but never shown





















































          Required, but never shown














          Required, but never shown












          Required, but never shown







          Required, but never shown

































          Required, but never shown














          Required, but never shown












          Required, but never shown







          Required, but never shown







          Popular posts from this blog

          Biblatex bibliography style without URLs when DOI exists (in Overleaf with Zotero bibliography)

          ComboBox Display Member on multiple fields

          Is it possible to collect Nectar points via Trainline?