Define established terms

Limegreenman900_1
Limegreenman900_1 New Altair Community Member
edited November 2024 in Community Q&A
Hi everyone,

does anybody know whether RM has an operator or a setting inside an operator where I can define established termns? I am currently extracting text from HTML files with the "Cut Document" Operator and inside that I am using the "Extract Content" Operator from the Web Mining extensions, after that I am doing some routine things like "Replace Tokens", "Tokenize" and "Extract Token Number". As I do have some terms in my text that are normally seen as an established term I wondered whether this is possible in RM?

Example:
Generally Accepted Accounting Practice
International Standards on Auditing
....

Until now, due to tokenization, every word is a single token but it would be great to have these expressions be seen as one token.
I know I could use the "Replace Token" operator and replace every term with an abbreviation like "International Standards on Auditing" = "ISA" but that is not what I want.

Any help appreciated!
Tagged:

Answers

  • JEdward
    JEdward New Altair Community Member
    Why not use the replace token operator and instead of replacing as an abbreviation? 

    So:
    Generally Accepted Accounting Practice = Generally_Accepted_Accounting_Practice
    International Standards on Auditing = International_Standards_on_Auditing

    At the end of your processing you can then run a replace tokens again and swap out the '_' for a ' ' so it will return to the established term again.   
  • Limegreenman900_1
    Limegreenman900_1 New Altair Community Member
    You are right, I totally ignored the option using underlines to connect the words  :)
    Thanks for your hint on that!

Welcome!

It looks like you're new here. Sign in or register to get started.

Welcome!

It looks like you're new here. Sign in or register to get started.