"Best practice in dealing with malformed XML"

robin
robin New Altair Community Member
edited November 5 in Community Q&A

I am dealing with malformed xml that is using illegal characters as well as quatations all over the place. The first file from the client I went to the effort of correcting the XML issues after importing the file and before sending it through to the Read XML operator.

 

The next file from the client will be impossible to apply the same fixes as the file is real broken. I was wondering what the best practise was in terms of reading the file into a MySQL database. If I should be using RapidMiner or if I should be calling an external program to fix the issue before processing the data through the operator. 

 

If you recommend staying inside of RapidMiner what would your approach be in terms of tackeling the problem?

 

Kind regards

 

Robin

Tagged:

Best Answer

  • land
    land New Altair Community Member
    Answer ✓

    Hi,

    I cannot agree more with Balazs, however, sometimes you cannot simply hit your customer if he delivers bad quality data. He might not be happy about that :)

    Hence I would like to add two more things:
    If these are random, probably manually created problems, than nor RapidMiner nor any external program can safely import that data. Then Balazs answer applies.

    First check whether you are importing the file in the right encoding. This can create really funny effects if you don't!
    If that's not the problem, check if there's a common rule behind all the problems. If there is, you can use the Text Processing extension to load the file as blank document and apply search and replace, possibly with regular expressions or even more to compensate that.

     

    I would not recommend to treat the file as html as this probably will remove a lot of the content. May be valid XML then, but probably will contain only a blank html page and loose all information :)

     

    Greetings,

    Sebastian

     

     

Answers

  • BalazsBarany
    BalazsBarany New Altair Community Member

    Hi!

     

    There is only one best practice for dealing with invalid XML input: rejecting it.

    A system that produces invalid XML proves itself to be untrustworthy. Input from such a system should be discarded.

     

    That said, you might be able to use a tool like http://www.html-tidy.org/ for cleaning up the input if it's not too broken and format it according to the standards. 

     

    Regards,

    Balázs

  • kayman
    kayman New Altair Community Member

    You might be lucky with the html to xml operator, if your XML is not too dirty this one will try it's best to create xml by just treating your malformed xml as a form of html.

    Results may not be as good as expected but at least it can be parsed and that could be enough to get you started.

  • land
    land New Altair Community Member
    Answer ✓

    Hi,

    I cannot agree more with Balazs, however, sometimes you cannot simply hit your customer if he delivers bad quality data. He might not be happy about that :)

    Hence I would like to add two more things:
    If these are random, probably manually created problems, than nor RapidMiner nor any external program can safely import that data. Then Balazs answer applies.

    First check whether you are importing the file in the right encoding. This can create really funny effects if you don't!
    If that's not the problem, check if there's a common rule behind all the problems. If there is, you can use the Text Processing extension to load the file as blank document and apply search and replace, possibly with regular expressions or even more to compensate that.

     

    I would not recommend to treat the file as html as this probably will remove a lot of the content. May be valid XML then, but probably will contain only a blank html page and loose all information :)

     

    Greetings,

    Sebastian