How to extract a specific part (section) from a large text (txt format)?

Enthusiast21
Enthusiast21 New Altair Community Member
edited November 5 in Community Q&A
Dear RM Friends,

I have 500 txt files containing large Reports and I need to extract only one section of these Reports. As the Reports are each slightly different, the only common patern I can recognise is that the section' headline by all start with the same 3 words, but in the end of each something different is written and the following section is also not the same. My Question is how I can in general extract part of large Texts in RapidMIner (I think I need to use some regular expressions, but so far I could not find anything suitable for my Task).

Thank you very much for your support in Advance! :smile:

Best Answer

  • kayman
    kayman New Altair Community Member
    Answer ✓
    Hi @Enthusiast21, as discussed find attached an alternative approach to your problem, first splitting by page (double sided), then filtering on the pages containing your term (REPORT ON THE ANNUAL) and then using a more loose way to figure out what is left or right page content. Seems to be relatively ok this way, and maybe you can take it further from there.


Answers

  • kayman
    kayman New Altair Community Member
    Regular expressions are probably what you need indeed. You already know where to start so it's about the where to end part. You don't need to limit yourself with words. Whitespace can also be a good candidate. 

    Are your sections bound by linebreaks, or does your next session start with something that resembles a paatern? 
  • Enthusiast21
    Enthusiast21 New Altair Community Member
    As attachment is part of one report containing two sections of what I need to extract (Independent Auditor's Report), which is another issue - some Reports contain two parts I need to extract. I copied in the attached file also the end of the previous section and the beginning of the next one. The next section is always different in the reports, so I can't find a patern. Each section I need ends with a date, which unfortunately is only common for them, but not uniqe as there are also other dates in the report in general. 
  • kayman
    kayman New Altair Community Member
    edited December 2019
    Nice challenge :-)
    So the idea is to first split the content in left and right page, and then get the section?

    Splitting the page in 2 is something you can achieve by splitting on string length, so basically the first 70 characters belong to the first page, 70 to 140 belong to the second page. Splitting and then merging can give you the both pages in one flow.

    Bit of quick and dirty approach can be found in attachment.
  • Enthusiast21
    Enthusiast21 New Altair Community Member
    Thank you for the solution of the first part of my problem. I'm sorry for the question, but as I am relatively new may I ask you where I enter the xml Code you send me? I tried in the xml pannel, but after that I don't know how to make the process appearing and then running in RapidMiner. 

    About the pattern - I have the beginning that is Independent Auditor's Report, but I don' know About the end as it's a date, but how not to take everything which ends up somewhere with a date? For what other type of pattern I can look for besides words?

    Thank you so much for the support! 
  • kayman
    kayman New Altair Community Member
    Views -> xml -> paste and green tick before save
  • Enthusiast21
    Enthusiast21 New Altair Community Member
    What could I do to remove the error?
  • kayman
    kayman New Altair Community Member
    Install the toolbox extension from the marketplace, but you can also replace this with the common append operator
  • Enthusiast21
    Enthusiast21 New Altair Community Member
    Thank you! I did it, but now I have new problem. Could you help me with it too? 
  • kayman
    kayman New Altair Community Member
    Hmm, there might be more issues with your original file. Could you already verify it works with the 'for the forum' txt file you provided? This way we can already ensure we are using the same environmental conditions.
    Then try again on your data after changing the decoding of the decode url's operator to utf-8, this could also solve some encoding problems with your original text.


  • Enthusiast21
    Enthusiast21 New Altair Community Member
    With the file 'for the forum' it works perfectly, I don't understand why the original one doesn't then as I olny copied part of the text from it in the new txt file which I uploaded here. I tried with an online tool to change to utf-8, but the resulted file didn't give any better results. Is there another ways to decode the file?
  • kayman
    kayman New Altair Community Member
    Would you mind sharing the full text? You can send by pm if ok for you.
  • kayman
    kayman New Altair Community Member
    Answer ✓
    Hi @Enthusiast21, as discussed find attached an alternative approach to your problem, first splitting by page (double sided), then filtering on the pages containing your term (REPORT ON THE ANNUAL) and then using a more loose way to figure out what is left or right page content. Seems to be relatively ok this way, and maybe you can take it further from there.