getting distinct tokens

New Altair Community Member

Jun 11, 2016

Updated Nov 5, 2024 by Jocelyn

I'm using the text processing modules and the idea is to get unique keywords per document, so basically for every document I have I want to get an attribute containing keywords for the full text field.

My workflow is fairly straightforward,I loop though all the examples, convert data to document, filter on some relevant POS tags, remove all stopwords etc, convert back to data,and append them all together. This works pretty fine but the result still contains duplicates, as in below example :

original : this is just a test sentence to do a test to check the process

keywords : test sentence test check process

wanted result : test sentence check process

How can I get rid of the duplicate tokens? I could eventually do it with some monster regex, but this will be fairly expensive I guess. Are there better ways to achieve this?

Find more posts tagged with

AI Studio

Sort by:

1 - 6 of 61

bhupendra_patil

New Altair Community Member

Jun 12, 2016

hello @kayman

May be you already know this but just confirming

I think that should happen automatically, what are your settings for your tokenize step.

Also is there a differene in case in the output. Unless you use the "transform cases" opertor "text" and "Text" are considered different

kayman

New Altair Community Member

Jun 12, 2016

Hi, Im aware of the case difference, basically my workflow is all to lowercase, tokenize on spaces and up to next one. The output will however contain all tokens (which is probably logical as the document processor needs to be able to define the amount of times a given word is used in a given document.

In the meantime I found a way around myself, with using the Wordlist to Data operator. So first I loop through all the examples, convert the example to a document, clean the data, generate a wordlist, convert wordlist to data and loop through this (wordlist) example set. This gives me now the unique values, and I convert these back to an example. In the end I get my original example set with my keyword attribute.

It does do the trick but is fairly slow (20K documents per hour) so if anybody has a more effficient way to do this I'd be happy to learn about it.

land

New Altair Community Member

Jun 13, 2016

Hi,

perhaps this will help you to speed up the concatenation:

Instead of looping over the word list, rather use the Aggregation operator with the concat aggreation function after transforming the word list into data. This should be way faster.

Greetings,

Sebastian

kayman

New Altair Community Member

Nov 3, 2016

Actually i found a much better and faster solution, so if anybody ever has the same issue like me :

Use a regular expression as this one : \b(\w+)\s(?=.*\b\1:?) and replace by a space. This will only keep the unique (distinct) words in any given string. Note that only the last match of a given word will be kept, so if the order is important you need to handle with care.

land

New Altair Community Member

Nov 3, 2016

Wow, that's an impressive expression...

Isn't that suffering heavily from long texts in terms of runtime? I would envison a special operator being several thousand percent faster

Greetings,

Sebastian

kayman

New Altair Community Member

Nov 3, 2016

Yeah, I'd love to have a dedicated component also. Shouldn't be to hard to add a 'remove duplicates' block in the next text analysis update (hint hint)

Or maybe I just didn't grasp your previous solution completely correct, I was struggling with it and in the end I gave up. Till now as I needed it again :-)

Now, for me it works fairly fine as we have relative small sets and strings (couple of thousand records with few sentences) so it works out ok. Better and faster than my original approach at least. And if anybody has a better approach I'm always intrested.

getting distinct tokens

Find more posts tagged with

Quick Links