how to read documents with their file names in exampleset

Lei
Lei New Altair Community Member
edited November 5 in Community Q&A
I would like to read some document files from a folder (not read all files in folder).  The file names which will be read are saved in an excel file. 

The read document operator can be used to read file by giving file name. I can use read excel operator to load file name file to exampleset, and get each file name. My question is how to use obtained file name from exampleset to pass read document operator. 

Is there anyone who can help me for this question?

Thank you very much.

Best Answers

  • BalazsBaranyRM
    BalazsBaranyRM New Altair Community Member
    Answer ✓
    Hi!

    I hope I understand your description correctly.

    You have file names in an Excel file. You can use Read Excel to use this.
    Then you could use Loop Values with the file name attribute and inside the loop Read Document. 

    The "iteration macro" (loop_value by default) contains the current file name. You can include the contents of a macro with Generate Attributes and using the macro syntax %{loop_value}. 

    Regards,
    Balázs
  • BalazsBaranyRM
    BalazsBaranyRM New Altair Community Member
    Answer ✓
    Hi!

    RapidMiner has different kinds of objects passed around in the process, marked by the color of the connection and the connection ports.

    You connected the incoming input of Loop Values - an example set (data table) - to the file input of Read Document. This won't work. You have the loop_value macro inside the loop so you can use that as the file name. Just enter %{loop_value} as the file parameter of Read Document.

    The output of Read Document is a document object, not an example set. If you want to add an attribute (like the file path with Generate Attributes), you will need to convert the document to an example set. How to do this depends on your use case. For example, you would use one of the Extract operators.

    Regards,
    Balázs

Answers

  • BalazsBaranyRM
    BalazsBaranyRM New Altair Community Member
    Answer ✓
    Hi!

    I hope I understand your description correctly.

    You have file names in an Excel file. You can use Read Excel to use this.
    Then you could use Loop Values with the file name attribute and inside the loop Read Document. 

    The "iteration macro" (loop_value by default) contains the current file name. You can include the contents of a macro with Generate Attributes and using the macro syntax %{loop_value}. 

    Regards,
    Balázs
  • Lei
    Lei New Altair Community Member
    Hi, Balazs,

    Your answer is very helpful.
    I followed your suggestion, but got a problem alert: "Your connection is producing the wrong type of data. Try changing the starting point of the connection". There is other problem in my rmp file, I think.

    I upload my rmp file here. Could you help me to find which mistake I have made?
  • BalazsBaranyRM
    BalazsBaranyRM New Altair Community Member
    Answer ✓
    Hi!

    RapidMiner has different kinds of objects passed around in the process, marked by the color of the connection and the connection ports.

    You connected the incoming input of Loop Values - an example set (data table) - to the file input of Read Document. This won't work. You have the loop_value macro inside the loop so you can use that as the file name. Just enter %{loop_value} as the file parameter of Read Document.

    The output of Read Document is a document object, not an example set. If you want to add an attribute (like the file path with Generate Attributes), you will need to convert the document to an example set. How to do this depends on your use case. For example, you would use one of the Extract operators.

    Regards,
    Balázs