byte address / word location for Textual ETL

Hi,

I'm doing fine with the currently provided operators for text processing in RM 5.0 (great! guys :-*)

However there is one aspect that I would like to see during the vector creation of words in documents and that is the byte addresses per word occurence as a key to distinguish one word occurence from another.

This would require a whole new representation of the wordlist where every occurence is displayed with a byte address/word location in stead of the aggregated number of occurences per word per document.

This would open up a new range of possibilities such as determining what other words or terms are found in proximity of a certain word/term. This would be of great value to determine the context of documents.

Of course I would be glad to know if this would already be possible with some combination of current operators ::)

Find more posts tagged with

AI Studio

Accepted answers

All comments

fischer

Hi,

by coincidence this is exactly what we are currently working on. Stay tuned :-)

Cheers,
Simon

Wanttoknow

Simon,

Great. Looking forward to it.

Thanks for your reply

TobiasMalbrecht

Hi,

Wanttoknow wrote:

However there is one aspect that I would like to see during the vector creation of words in documents and that is the byte addresses per word occurence as a key to distinguish one word occurence from another.

This would require a whole new representation of the wordlist where every occurence is displayed with a byte address/word location in stead of the aggregated number of occurences per word per document.

This would open up a new range of possibilities such as determining what other words or terms are found in proximity of a certain word/term. This would be of great value to determine the context of documents.

Of course I would be glad to know if this would already be possible with some combination of current operators ::)

no this is not yet possible, but we are indeed in an initial phase of a re-factoring of the text processing extension. This will also include that the locations of tokens in a document are kept within the tokens, so that 1) the visualization of documents and token sequences will be improved, 2) filtering, token and attribute construction based on the locations of tokens and co-occurances within document regions, etc. will become possible.

Apart from that, we have a lot of other ideas concerning the text processing extension - so it will probably take a while until the re-structuring is finished, stay tuned .. ;-)
Kind regards,
Tobias