Execute Python breaks Colum if text hasta commas
Marco_Barradas
Altair Employee
Hi I need some help I'm doing some crawling with Python (already tried with RM but I didn't get what I wanted in an easy way)
The last column of DF returns a big chunk of text that describes the product. for some reason when Execute Python creates the DataSet it creates new lines and erases the data that was sent on the DF. I tried writing the info from inside Python Execute and the outcome is a file with 1 row and 5 columns as expected.
Here is the process I'm using.
The last column of DF returns a big chunk of text that describes the product. for some reason when Execute Python creates the DataSet it creates new lines and erases the data that was sent on the DF. I tried writing the info from inside Python Execute and the outcome is a file with 1 row and 5 columns as expected.
Here is the process I'm using.
<?xml version="1.0" encoding="UTF-8"?><process version="9.1.000"> <context> <input/> <output/> <macros> <macro> <key>url</key> <value>https://www.liverpool.com.mx/tienda/pdp/consola-playstation-4-pro-1-tb/1059665339?s=play+station&skuId=1059665339</value> </macro> </macros> </context> <operator activated="true" class="process" compatibility="9.1.000" expanded="true" name="Process"> <parameter key="logverbosity" value="init"/> <parameter key="random_seed" value="2001"/> <parameter key="send_mail" value="never"/> <parameter key="notification_email" value=""/> <parameter key="process_duration_for_mail" value="30"/> <parameter key="encoding" value="SYSTEM"/> <process expanded="true"> <operator activated="true" class="python_scripting:execute_python" compatibility="9.1.000" expanded="true" height="82" name="Execute Python" width="90" x="179" y="34"> <parameter key="script" value="import requests from bs4 import BeautifulSoup import pandas as pd def rm_main(): headers = {'User-Agent': 'Mozilla/5.0 (Windows NT 10.0; Win64; x64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/58.0.3029.110 Safari/537.36'} columnas=['id','precio_n','precio_d','nombre','descripcion'] productos=pd.DataFrame(columns=columnas) session = requests.Session() url='%{url}' session.post(url,headers=headers) content=session.get(url) soup = BeautifulSoup(content.text,'html.parser') precio_normal=soup.find("input",id="listPrice") tipo=soup.find("a",_class="actual") llave=soup.find("input",id="productId") #productId #gtmPrice #productDisplayName precio_descuento=soup.find("input",id="gtmPrice") producto=soup.find("input",id="productDisplayName") descripcion=soup.find("div",id="intro").find('p').get_text() descripcion=descripcion.replace(',', '') descripcion=descripcion.replace('', '') #print(descripcion) fila=[llave['value'], precio_normal['value'], precio_descuento['value'], producto['value'], descripcion ] productos.loc[len(productos)]=fila return productos"/> <parameter key="use_default_python" value="true"/> <parameter key="package_manager" value="conda (anaconda)"/> </operator> <operator activated="true" class="generate_attributes" compatibility="9.1.000" expanded="true" height="82" name="Generate Attributes" width="90" x="313" y="34"> <list key="function_descriptions"> <parameter key="Fecha" value="date_now()"/> </list> <parameter key="keep_all" value="true"/> </operator> <operator activated="true" class="date_to_nominal" compatibility="9.1.000" expanded="true" height="82" name="Date to Nominal" width="90" x="514" y="34"> <parameter key="attribute_name" value="Fecha"/> <parameter key="date_format" value="yyyy/MM/dd hh:mm:ss"/> <parameter key="time_zone" value="SYSTEM"/> <parameter key="locale" value="English (United States)"/> <parameter key="keep_old_attribute" value="false"/> </operator> <connect from_op="Execute Python" from_port="output 1" to_op="Generate Attributes" to_port="example set input"/> <connect from_op="Generate Attributes" from_port="example set output" to_op="Date to Nominal" to_port="example set input"/> <connect from_op="Date to Nominal" from_port="example set output" to_port="result 1"/> <portSpacing port="source_input 1" spacing="0"/> <portSpacing port="sink_result 1" spacing="0"/> <portSpacing port="sink_result 2" spacing="0"/> </process> </operator> </process>
Tagged:
0
Best Answers
-
Hi @MarcoBarradas,
Very interesting problem !
To sum up : There is effectivly a bug in RapidMiner, but there is a workaround....(See the process at the end of this post)
To develop :
I say there is bug in RapidMiner because when the code is executed in a Python Jupyter Notebook, it works fine :
Maybe it is linked to the text attribute ???
The (far fetched) workaround :
1.I modified the Python script to generate the DF like that :
2. then I used the Transpose operator :
3. I used the Generate Aggregate to concatenate the attributes associated to "description" which have been "splitted" for an unknown reason.... :
4. Finally , I rename correctly the relevant attributes and remove the useless attributes, to obtain the final exampleset :
5. The process :<?xml version="1.0" encoding="UTF-8"?><process version="9.2.000"> <context> <input/> <output/> <macros> <macro> <key>url</key> <value>https://www.liverpool.com.mx/tienda/pdp/consola-playstation-4-pro-1-tb/1059665339?s=play+station&skuId=1059665339</value> </macro> </macros> </context> <operator activated="true" class="process" compatibility="9.2.000" expanded="true" name="Process"> <parameter key="logverbosity" value="init"/> <parameter key="random_seed" value="2001"/> <parameter key="send_mail" value="never"/> <parameter key="notification_email" value=""/> <parameter key="process_duration_for_mail" value="30"/> <parameter key="encoding" value="SYSTEM"/> <process expanded="true"> <operator activated="true" class="python_scripting:execute_python" compatibility="9.2.000" expanded="true" height="82" name="Execute Python" width="90" x="45" y="34"> <parameter key="script" value="import requests from bs4 import BeautifulSoup import pandas as pd def rm_main(): headers = {'User-Agent': 'Mozilla/5.0 (Windows NT 10.0; Win64; x64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/58.0.3029.110 Safari/537.36'} columnas=['id','precio_n','precio_d','nombre','descripcion'] productos=pd.DataFrame(columns=columnas) session = requests.Session() url='%{url}' session.post(url,headers=headers) content=session.get(url) soup = BeautifulSoup(content.text,'html.parser') precio_normal=soup.find("input",id="listPrice") tipo=soup.find("a",_class="actual") llave=soup.find("input",id="productId") #productId #gtmPrice #productDisplayName precio_descuento=soup.find("input",id="gtmPrice") producto=soup.find("input",id="productDisplayName") descripcion=soup.find("div",id="intro").find('p').get_text() descripcion=descripcion.replace(',', '') descripcion=descripcion.replace('', '') #print(descripcion) fila=[llave['value'], precio_normal['value'], precio_descuento['value'], producto['value'], descripcion ] productos = pd.DataFrame(data = fila) return productos"/> <parameter key="use_default_python" value="true"/> <parameter key="package_manager" value="conda (anaconda)"/> </operator> <operator activated="true" class="transpose" compatibility="9.2.000" expanded="true" height="82" name="Transpose" width="90" x="179" y="34"/> <operator activated="true" class="generate_aggregation" compatibility="9.2.000" expanded="true" height="82" name="Generate Aggregation" width="90" x="313" y="34"> <parameter key="attribute_name" value="description"/> <parameter key="attribute_filter_type" value="subset"/> <parameter key="attribute" value=""/> <parameter key="attributes" value="att_5|att_6|att_7|att_8"/> <parameter key="use_except_expression" value="false"/> <parameter key="value_type" value="attribute_value"/> <parameter key="use_value_type_exception" value="false"/> <parameter key="except_value_type" value="time"/> <parameter key="block_type" value="attribute_block"/> <parameter key="use_block_type_exception" value="false"/> <parameter key="except_block_type" value="value_matrix_row_start"/> <parameter key="invert_selection" value="false"/> <parameter key="include_special_attributes" value="false"/> <parameter key="aggregation_function" value="concatenation"/> <parameter key="concatenation_separator" value="" ""/> <parameter key="keep_all" value="true"/> <parameter key="ignore_missings" value="true"/> <parameter key="ignore_missing_attributes" value="false"/> </operator> <operator activated="true" class="rename" compatibility="9.2.000" expanded="true" height="82" name="Rename" width="90" x="447" y="34"> <parameter key="old_name" value="att_1"/> <parameter key="new_name" value="Id"/> <list key="rename_additional_attributes"> <parameter key="att_2" value="precio_n"/> <parameter key="att_3" value="precio_d"/> <parameter key="att_4" value="nombre"/> </list> </operator> <operator activated="true" class="select_attributes" compatibility="9.2.000" expanded="true" height="82" name="Select Attributes" width="90" x="581" y="34"> <parameter key="attribute_filter_type" value="regular_expression"/> <parameter key="attribute" value=""/> <parameter key="attributes" value=""/> <parameter key="regular_expression" value="att_.*"/> <parameter key="use_except_expression" value="false"/> <parameter key="value_type" value="attribute_value"/> <parameter key="use_value_type_exception" value="false"/> <parameter key="except_value_type" value="time"/> <parameter key="block_type" value="attribute_block"/> <parameter key="use_block_type_exception" value="false"/> <parameter key="except_block_type" value="value_matrix_row_start"/> <parameter key="invert_selection" value="true"/> <parameter key="include_special_attributes" value="false"/> </operator> <operator activated="true" class="generate_attributes" compatibility="9.2.000" expanded="true" height="82" name="Generate Attributes" width="90" x="715" y="34"> <list key="function_descriptions"> <parameter key="Fecha" value="date_now()"/> </list> <parameter key="keep_all" value="true"/> </operator> <operator activated="true" class="date_to_nominal" compatibility="9.2.000" expanded="true" height="82" name="Date to Nominal" width="90" x="849" y="34"> <parameter key="attribute_name" value="Fecha"/> <parameter key="date_format" value="yyyy/MM/dd hh:mm:ss"/> <parameter key="time_zone" value="SYSTEM"/> <parameter key="locale" value="English (United States)"/> <parameter key="keep_old_attribute" value="false"/> </operator> <connect from_op="Execute Python" from_port="output 1" to_op="Transpose" to_port="example set input"/> <connect from_op="Transpose" from_port="example set output" to_op="Generate Aggregation" to_port="example set input"/> <connect from_op="Generate Aggregation" from_port="example set output" to_op="Rename" to_port="example set input"/> <connect from_op="Rename" from_port="example set output" to_op="Select Attributes" to_port="example set input"/> <connect from_op="Select Attributes" from_port="example set output" to_op="Generate Attributes" to_port="example set input"/> <connect from_op="Generate Attributes" from_port="example set output" to_op="Date to Nominal" to_port="example set input"/> <connect from_op="Date to Nominal" from_port="example set output" to_port="result 1"/> <portSpacing port="source_input 1" spacing="0"/> <portSpacing port="sink_result 1" spacing="0"/> <portSpacing port="sink_result 2" spacing="0"/> </process> </operator> </process>
6. Have fun with your future Playstation 4 .... !
Hope this helps,
Regards,
Lionel
1 -
Hi @MarcoBarradas, I can confirm that this a bug in the operator code. I will create a ticket.It is however not the length of the description that causes the issue, but the newline characters in it. Thus, a workaround might be to remove all line breaks from the text.My understanding is that you are already trying that, but your script only looks for Windows-style line breaks () and is missing Unix-style line breaks (\n) which are more common in the web.For me changing the following lines did the trick:
# descripcion=descripcion.replace('\r\n', '')<br>descripcion=descripcion.replace('\r', '')<br>descripcion=descripcion.replace('\n', '')<br>
See Wikipedia for more info on the different line break styles.
2
Answers
-
Hi @MarcoBarradas,
Very interesting problem !
To sum up : There is effectivly a bug in RapidMiner, but there is a workaround....(See the process at the end of this post)
To develop :
I say there is bug in RapidMiner because when the code is executed in a Python Jupyter Notebook, it works fine :
Maybe it is linked to the text attribute ???
The (far fetched) workaround :
1.I modified the Python script to generate the DF like that :
2. then I used the Transpose operator :
3. I used the Generate Aggregate to concatenate the attributes associated to "description" which have been "splitted" for an unknown reason.... :
4. Finally , I rename correctly the relevant attributes and remove the useless attributes, to obtain the final exampleset :
5. The process :<?xml version="1.0" encoding="UTF-8"?><process version="9.2.000"> <context> <input/> <output/> <macros> <macro> <key>url</key> <value>https://www.liverpool.com.mx/tienda/pdp/consola-playstation-4-pro-1-tb/1059665339?s=play+station&skuId=1059665339</value> </macro> </macros> </context> <operator activated="true" class="process" compatibility="9.2.000" expanded="true" name="Process"> <parameter key="logverbosity" value="init"/> <parameter key="random_seed" value="2001"/> <parameter key="send_mail" value="never"/> <parameter key="notification_email" value=""/> <parameter key="process_duration_for_mail" value="30"/> <parameter key="encoding" value="SYSTEM"/> <process expanded="true"> <operator activated="true" class="python_scripting:execute_python" compatibility="9.2.000" expanded="true" height="82" name="Execute Python" width="90" x="45" y="34"> <parameter key="script" value="import requests from bs4 import BeautifulSoup import pandas as pd def rm_main(): headers = {'User-Agent': 'Mozilla/5.0 (Windows NT 10.0; Win64; x64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/58.0.3029.110 Safari/537.36'} columnas=['id','precio_n','precio_d','nombre','descripcion'] productos=pd.DataFrame(columns=columnas) session = requests.Session() url='%{url}' session.post(url,headers=headers) content=session.get(url) soup = BeautifulSoup(content.text,'html.parser') precio_normal=soup.find("input",id="listPrice") tipo=soup.find("a",_class="actual") llave=soup.find("input",id="productId") #productId #gtmPrice #productDisplayName precio_descuento=soup.find("input",id="gtmPrice") producto=soup.find("input",id="productDisplayName") descripcion=soup.find("div",id="intro").find('p').get_text() descripcion=descripcion.replace(',', '') descripcion=descripcion.replace('', '') #print(descripcion) fila=[llave['value'], precio_normal['value'], precio_descuento['value'], producto['value'], descripcion ] productos = pd.DataFrame(data = fila) return productos"/> <parameter key="use_default_python" value="true"/> <parameter key="package_manager" value="conda (anaconda)"/> </operator> <operator activated="true" class="transpose" compatibility="9.2.000" expanded="true" height="82" name="Transpose" width="90" x="179" y="34"/> <operator activated="true" class="generate_aggregation" compatibility="9.2.000" expanded="true" height="82" name="Generate Aggregation" width="90" x="313" y="34"> <parameter key="attribute_name" value="description"/> <parameter key="attribute_filter_type" value="subset"/> <parameter key="attribute" value=""/> <parameter key="attributes" value="att_5|att_6|att_7|att_8"/> <parameter key="use_except_expression" value="false"/> <parameter key="value_type" value="attribute_value"/> <parameter key="use_value_type_exception" value="false"/> <parameter key="except_value_type" value="time"/> <parameter key="block_type" value="attribute_block"/> <parameter key="use_block_type_exception" value="false"/> <parameter key="except_block_type" value="value_matrix_row_start"/> <parameter key="invert_selection" value="false"/> <parameter key="include_special_attributes" value="false"/> <parameter key="aggregation_function" value="concatenation"/> <parameter key="concatenation_separator" value="" ""/> <parameter key="keep_all" value="true"/> <parameter key="ignore_missings" value="true"/> <parameter key="ignore_missing_attributes" value="false"/> </operator> <operator activated="true" class="rename" compatibility="9.2.000" expanded="true" height="82" name="Rename" width="90" x="447" y="34"> <parameter key="old_name" value="att_1"/> <parameter key="new_name" value="Id"/> <list key="rename_additional_attributes"> <parameter key="att_2" value="precio_n"/> <parameter key="att_3" value="precio_d"/> <parameter key="att_4" value="nombre"/> </list> </operator> <operator activated="true" class="select_attributes" compatibility="9.2.000" expanded="true" height="82" name="Select Attributes" width="90" x="581" y="34"> <parameter key="attribute_filter_type" value="regular_expression"/> <parameter key="attribute" value=""/> <parameter key="attributes" value=""/> <parameter key="regular_expression" value="att_.*"/> <parameter key="use_except_expression" value="false"/> <parameter key="value_type" value="attribute_value"/> <parameter key="use_value_type_exception" value="false"/> <parameter key="except_value_type" value="time"/> <parameter key="block_type" value="attribute_block"/> <parameter key="use_block_type_exception" value="false"/> <parameter key="except_block_type" value="value_matrix_row_start"/> <parameter key="invert_selection" value="true"/> <parameter key="include_special_attributes" value="false"/> </operator> <operator activated="true" class="generate_attributes" compatibility="9.2.000" expanded="true" height="82" name="Generate Attributes" width="90" x="715" y="34"> <list key="function_descriptions"> <parameter key="Fecha" value="date_now()"/> </list> <parameter key="keep_all" value="true"/> </operator> <operator activated="true" class="date_to_nominal" compatibility="9.2.000" expanded="true" height="82" name="Date to Nominal" width="90" x="849" y="34"> <parameter key="attribute_name" value="Fecha"/> <parameter key="date_format" value="yyyy/MM/dd hh:mm:ss"/> <parameter key="time_zone" value="SYSTEM"/> <parameter key="locale" value="English (United States)"/> <parameter key="keep_old_attribute" value="false"/> </operator> <connect from_op="Execute Python" from_port="output 1" to_op="Transpose" to_port="example set input"/> <connect from_op="Transpose" from_port="example set output" to_op="Generate Aggregation" to_port="example set input"/> <connect from_op="Generate Aggregation" from_port="example set output" to_op="Rename" to_port="example set input"/> <connect from_op="Rename" from_port="example set output" to_op="Select Attributes" to_port="example set input"/> <connect from_op="Select Attributes" from_port="example set output" to_op="Generate Attributes" to_port="example set input"/> <connect from_op="Generate Attributes" from_port="example set output" to_op="Date to Nominal" to_port="example set input"/> <connect from_op="Date to Nominal" from_port="example set output" to_port="result 1"/> <portSpacing port="source_input 1" spacing="0"/> <portSpacing port="sink_result 1" spacing="0"/> <portSpacing port="sink_result 2" spacing="0"/> </process> </operator> </process>
6. Have fun with your future Playstation 4 .... !
Hope this helps,
Regards,
Lionel
1 -
Great!!! It works and yes it seems to be a bug.
I'll need to make some changes since sometimes the crawling may not have that attribute and the number of rows maybe dynamic but your workaround works like a charm.1 -
@MarcoBarradas can you pls be more specific about the bug? I'd like to push it internally but need more detail. I'm not a Python coder...
Scott0 -
Hi @sgenzer the bug is that RM changes the Data Frame when it converts it to a RM Dataset. This happens when one of the attributes has a lot of text. In my example the Dataframe has a ágape of 1 example with 5 attributes. But once Execute Python ends it returns 3 example with 5 attributes and it only returne information of the last attribute. The one that had a lot of text1
-
Hi @MarcoBarradas, I can confirm that this a bug in the operator code. I will create a ticket.It is however not the length of the description that causes the issue, but the newline characters in it. Thus, a workaround might be to remove all line breaks from the text.My understanding is that you are already trying that, but your script only looks for Windows-style line breaks () and is missing Unix-style line breaks (\n) which are more common in the web.For me changing the following lines did the trick:
# descripcion=descripcion.replace('\r\n', '')<br>descripcion=descripcion.replace('\r', '')<br>descripcion=descripcion.replace('\n', '')<br>
See Wikipedia for more info on the different line break styles.
2