nav[aria-label="Primary Navigation"] { padding: 0; & ul { list-style: none; width: 100%; display: flex; flex-direction: row; justify-content: start; align-items: start; gap: 30px; padding: 0; & li { margin: 0; } & ul li { list-style: none; } } }

Siemens Community Catalyst Program

The Siemens Community Catalyst program was co-created with our community to acknowledge technology leaders who consistently contribute to the Siemens Community. Nominations are accepted on a rolling basis.

Nominate Now

How can I do text mining to relate a number and word of a doc and relate both into a dataset ?

Gui

How can I do text mining to relate a number and word of a doc and relate both into a dataset (each one as an attr)?

The idea is taking a doc (similar as a "bill of sale"), read it and process in a way that I can have a simple exampleset as...

Access key | Product code | Product name
xxxxx | yyyyy | XPTO

Do you have any idea or solution on another topic that I haven't found? It will help a lot

Thanks. Best,

G.

Find more posts tagged with

AI Studio

Text Mining + NLP

Accepted answers

kayman

I think generate attributes in combination with regex is a good candidate, as long as your content is pretty distinguishable.

If for instance your acces key would be always 8 digits big you could create a function that checks if your 'base attribute' contains an isolated 8 digit pattern, and if so take the pattern and store it in your new access key attribute. If no match don't add anything.

And this for all of your new attributes.

You will always need recognizable patters, otherwise it will never work.

If you need some support on the regex you can always share some examples, happy to help with that

kayman

Sure, no promises but happy to help

All comments

kayman

Gui

Hi Kayman, thanks for your time and help. I am really soffering with this problem.

I really appreciate if you can support me on the regex. I am sharing a process with the document that I need to do what I described.
About the patterns, I have three kind of docs, PDF ones (with a pattern), scanned docs (images that I need to do the same thing, read, identify, separate in a exampleset, etc. with another pattern) and another scanned docs. I will need to build a process to each one because of the patterns

attached are the process and a notepad with the XML.

Thanks again.

text to get info.rmp

XML text to get info.txt

kayman

Getting the access key is not a real issue, but as I'm not familiar with the rest of your structure it's hard for me to know what is needed and what patterns it can have.

It looks as if you start with a pdf that you convert to a text file, so it might be better to start with using the pdf table extractor extension (available on the market place https://marketplace.rapidminer.com/UpdateServer/faces/product_details.xhtml?productId=rmx_pdf_table_extraction )

This may reduce the complexity a lot as you seem to have quite some columns originally. Combining a few techniques together may work out better than.

Attached an example extracting the Access Key and storing it as a new attribute.

text to get info.rmp

Gui

Kayman,

Can I send you a private message? Then I could share with you an image of the structure of the document. If you have time to do it, of course, it would be wonderful. Let me know if this is feasible

kayman

Sure, no promises but happy to help