🎉Community Raffle - Win $25

An exclusive raffle opportunity for active members like you! Complete your profile, answer questions and get your first accepted badge to enter the raffle.
Join and Win

"Split text into paragraphs"

hbuggledUser: "hbuggled"
New Altair Community Member
Updated by Jocelyn

Hi guys,

I have an excel file which consist article from Wikipedia. I want to split the text into paragraphs. I tried the Tokenize operator but there are no option to tokenize my text into paragraphs.  I also tried the Cut Document Operator with the xPath query type. I used the query expression //h: p, but it doesn't work. Is there any posibilities to tokenize/split my text into paragraphs?

 

Thank you in advance.

Find more posts tagged with

Sort by:
1 - 2 of 21
    sgenzerUser: "sgenzer"
    Altair Employee
    Accepted Answer

    hello @hbuggled - welcome to the community.  I think you were on the right track with tokenize but I would choose the regex option in the parameters pane and try using \n as a expression.

     

    Scott

     

    sgenzerUser: "sgenzer"
    Altair Employee
    Accepted Answer

    hello @hbuggled - ok I understand.  This is likely not the most elegant solution but it will do what you're looking for.

     

    <?xml version="1.0" encoding="UTF-8"?><process version="7.6.001">
    <context>
    <input/>
    <output/>
    <macros/>
    </context>
    <operator activated="true" class="process" compatibility="7.6.001" expanded="true" name="Process">
    <process expanded="true">
    <operator activated="true" class="text:create_document" compatibility="7.5.000" expanded="true" height="68" name="Create Document" width="90" x="45" y="34">
    <parameter key="text" value="RapidMiner uses a client/server model with the server offered as either on-premise, or in public or private cloud infrastructures.&#10;According to Bloor Research, RapidMiner provides 99% of an advanced analytical solution through template-based frameworks that speed delivery and reduce errors by nearly eliminating the need to write code. RapidMiner provides data mining and machine learning procedures including: data loading and transformation (Extract, transform, load (ETL)), data preprocessing and visualization, predictive analytics and statistical modeling, evaluation, and deployment. RapidMiner is written in the Java programming language. RapidMiner provides a GUI to design and execute analytical workflows. Those workflows are called “Processes” in RapidMiner and they consist of multiple “Operators”. Each operator performs a single task within the process, and the output of each operator forms the input of the next one. Alternatively, the engine can be called from other programs or used as an API. Individual functions can be called from the command line. RapidMiner provides learning schemes, models and algorithms and can be extended using R and Python scripts.&#10;RapidMiner functionality can be extended with additional plugins which are made available via RapidMiner Marketplace. The RapidMiner Marketplace provides a platform for developers to create data analysis algorithms and publish them to the community. With version 7.0, RapidMiner included updates to its getting started materials, an updated user interface, and improvements to its data preparation capabilities."/>
    </operator>
    <operator activated="true" class="text:tokenize" compatibility="7.5.000" expanded="true" height="68" name="Tokenize" width="90" x="179" y="34">
    <parameter key="mode" value="regular expression"/>
    <parameter key="expression" value="\n+"/>
    </operator>
    <operator activated="true" class="text:extract_token_number" compatibility="7.5.000" expanded="true" height="68" name="Extract Token Number" width="90" x="313" y="34"/>
    <operator activated="true" class="text:documents_to_data" compatibility="7.5.000" expanded="true" height="82" name="Documents to Data" width="90" x="447" y="34">
    <parameter key="text_attribute" value="text"/>
    </operator>
    <operator activated="true" class="split" compatibility="7.6.001" expanded="true" height="82" name="Split" width="90" x="581" y="34">
    <parameter key="attribute_filter_type" value="single"/>
    <parameter key="attribute" value="text"/>
    <parameter key="split_pattern" value="\n"/>
    </operator>
    <operator activated="true" class="extract_macro" compatibility="7.6.001" expanded="true" height="68" name="Extract Macro" width="90" x="715" y="34">
    <parameter key="macro" value="tokenNumber"/>
    <parameter key="macro_type" value="data_value"/>
    <parameter key="attribute_name" value="token_number"/>
    <parameter key="example_index" value="1"/>
    <list key="additional_macros"/>
    </operator>
    <operator activated="true" class="generate_macro" compatibility="7.6.001" expanded="true" height="82" name="Generate Macro" width="90" x="849" y="34">
    <list key="function_descriptions">
    <parameter key="att" value="concat(&quot;text_&quot;,%{tokenNumber})"/>
    </list>
    </operator>
    <operator activated="true" class="select_attributes" compatibility="7.6.001" expanded="true" height="82" name="Select Attributes" width="90" x="983" y="34">
    <parameter key="attribute_filter_type" value="single"/>
    <parameter key="attribute" value="%{att}"/>
    </operator>
    <operator activated="true" class="text:data_to_documents" compatibility="7.5.000" expanded="true" height="68" name="Data to Documents" width="90" x="1117" y="34">
    <parameter key="select_attributes_and_weights" value="true"/>
    <list key="specify_weights">
    <parameter key="%{att}" value="1.0"/>
    </list>
    </operator>
    <operator activated="true" class="text:combine_documents" compatibility="7.5.000" expanded="true" height="82" name="Combine Documents" width="90" x="1251" y="34"/>
    <operator activated="true" class="text:tokenize" compatibility="7.5.000" expanded="true" height="68" name="Tokenize (2)" width="90" x="1385" y="34">
    <parameter key="expression" value="\n+"/>
    </operator>
    <connect from_op="Create Document" from_port="output" to_op="Tokenize" to_port="document"/>
    <connect from_op="Tokenize" from_port="document" to_op="Extract Token Number" to_port="document"/>
    <connect from_op="Extract Token Number" from_port="document" to_op="Documents to Data" to_port="documents 1"/>
    <connect from_op="Documents to Data" from_port="example set" to_op="Split" to_port="example set input"/>
    <connect from_op="Split" from_port="example set output" to_op="Extract Macro" to_port="example set"/>
    <connect from_op="Extract Macro" from_port="example set" to_op="Generate Macro" to_port="through 1"/>
    <connect from_op="Generate Macro" from_port="through 1" to_op="Select Attributes" to_port="example set input"/>
    <connect from_op="Select Attributes" from_port="example set output" to_op="Data to Documents" to_port="example set"/>
    <connect from_op="Data to Documents" from_port="documents" to_op="Combine Documents" to_port="documents 1"/>
    <connect from_op="Combine Documents" from_port="document" to_op="Tokenize (2)" to_port="document"/>
    <connect from_op="Tokenize (2)" from_port="document" to_port="result 1"/>
    <portSpacing port="source_input 1" spacing="0"/>
    <portSpacing port="sink_result 1" spacing="0"/>
    <portSpacing port="sink_result 2" spacing="0"/>
    </process>
    </operator>
    </process>

    Scott