Extract data from XML files

Lei
Lei New Altair Community Member
edited November 2024 in Community Q&A
I have many XML files. They have similar structure but are different in some details. 

The xml structure is similar as follow:

<article>
   <art-front>
       <titlegrp>
           <title>Integrated phytoremediation</title>
       </titlegrp>
       <abstract>
           <p>Phytoremediation is green rehabilitation technology .</p>
       </abstract>
   </art-front>
   <art-body>
       <section>
           <title>One thing</title>
            <p>the main technologies 1...</p>
            <p>the main technologies 2...</p>
       </section>
        <section>
           <title>Others</title>
           <subsect1>
                <p>the main technologies 3...</p>
                <p>the main technologies 4...</p>
                <p>the main technologies 5...</p>
           </subsect1>
       </section> 
   </art-body>
   <art-back>
       <biblist title="References">
            <citauth>
                 <fname>H.</fname>
                 <surname>Ali</surname>
            </citauth>
        </biblist>
   </art-body>
</abstract>

The xml file differences take place between <art-body> and </art-body>. Some xml files have four <section>, some have five...,  the numbers of <p> in <section> tag also can be different. In addition, some xml files have not <subsect> contents, only have multiple <section> contents. 

I want to extract <art-front> and <art-body> contents, but not <art-back> content.

I know that read xml operator can be used to extract content from xml file and also read document operator can finish it. Because my xml files are not totally same, I have no idea to deal with it. Is there any way to do that?

Thanks
Tagged:

Best Answer

  • BalazsBaranyRM
    BalazsBaranyRM New Altair Community Member
    Answer ✓
    Hi!

    In these cases I usually build the process with multiple Read XML operators.

    One would extract the common information, e. g. from the constant header. Another the variable information, like the repeating entries. I can then join the results e. g. based on the file name or some other common attribute.

    Use the most specific XPath for selecting what you need in each Read XML and figure out which join is the best for the task. 

    Regards,
    Balázs

Answers

  • BalazsBaranyRM
    BalazsBaranyRM New Altair Community Member
    Answer ✓
    Hi!

    In these cases I usually build the process with multiple Read XML operators.

    One would extract the common information, e. g. from the constant header. Another the variable information, like the repeating entries. I can then join the results e. g. based on the file name or some other common attribute.

    Use the most specific XPath for selecting what you need in each Read XML and figure out which join is the best for the task. 

    Regards,
    Balázs