How to split a text into several pieces?

fangkuoyu
fangkuoyu New Altair Community Member
edited November 5 in Community Q&A
I want to to split a text into several pieces for retrieval-augmented generation under Generative Models Extension. 
I have checked the document at https://docs.rapidminer.com/latest/studio/generative-ai/#retrieval-augmented-generation 
but I don't know how to reproduce the process. Can someone provide the process? Further, I have tried text processing extensions with "create document" and "window document". But, I get "no elements in this collection" from "window document". Any help? Thanks.

Regards
Frank
Tagged:

Answers

  • RolandJones
    RolandJones
    Altair Employee
    Hi @fangkuoyu,

    I'd recommend looking into the Text Analysis course on the RapidMiner Academy, as it gives a nice overview of how you can load and manipulate text data.

    To split up text, as a starting point generally I would use the Tokenize Operator inside a Process Documents operator. This splits each document by some form or regular pattern, which usually for me ends up being whitespace. Also just make beforehand you set the column data type to Text, and also use a Data to Documents operator.

    Hope this makes sense.

    Best,
    Roland