"Processing multiple xml files for tf-idf"

Question

Hi all, I have an issue regarding processing several news articles available in multiple xml files. The xml files look the following structure: article_set1.xml

...

Meaning that each xml contains different articles to be processed. An article must be considered as a document do be processed by the tf-idf. My first attempt was to use the "read xml" operator and connect to a "process documents from data". It works fine, but it only enable to process only one xml file. Second attempt was to use a "loop files" iterator in the beginning of the process. By using this approach, it creates a tf-idf vector for each xml file processed. Third attempt use only the "process documents from files", and process the xml files internally. This approach assumes that a xml is a document. My objective is that, for each article_id should be considered as a different document, even when multiple xml files need to be processed. Any guidance on this issue is more than welcome. Thank you for your support. Regards, Ruca

Ruca · Answer

Hi Marius,

Thank you very much for your help. I used the "Loop Files operator" using the append data and it works fine!

My problem is now how to store the results into MySQL database. Since the number of columns in MySQL is limited, I had to perform a transpose operation. Which makes the terms into IDs now.
I'm getting two different terms: "el-nino" and "el niño". which should be different terms according to UTF-8 character set. Since the terms are now IDs, I'm not able to store these rows on a table because MYSQL assumes that they are the same term.
I had to change the role of the ID column to regular. It works, but I guess is not the right way to do it.
Does anyone has any other approach for doing this?

Thank you for you support!

Regards,

Ruca

MariusHelf · Answer

Hi Ruca,

if you use Process Documents from Files, you can split a file into its subdocuments via Split Document.
If you use the Loop Files operator, you can use Read XML, append the data, and use Perform Documents from Data after loop, not in the loop.

Does that help?

Best regards,
Marius