Approach to standardize merchant names -Tagging

msacs09
New Altair Community Member
Experts,
I'm in the process of the standardizing our transaction type and bucket them in a correct category.
For example we have companies like below. The biggest challenge is tagging and putting them in appropriate bucket. There are lot of variations with transaction types. What machine learning model can we use here to tackle this monstrous tagging work. Are there any sample model that is built to address such use cases. any reference to it is greatly apprenticed.
I'm in the process of the standardizing our transaction type and bucket them in a correct category.
For example we have companies like below. The biggest challenge is tagging and putting them in appropriate bucket. There are lot of variations with transaction types. What machine learning model can we use here to tackle this monstrous tagging work. Are there any sample model that is built to address such use cases. any reference to it is greatly apprenticed.
CatgType | Matched | Actual Entry |
HR | ADP | Adp |
Travel | Airbnb | Airbnb |
Travel | Alaska Air | AlaskaAirlinesInc |
HR | Allied Delta | Allied Delta |
G&A | Amazon | Amazon |
Server | AWS | Amazon Web Services |
Credit Crd | American Express | American Express |
Travel | American Air | AmericanAirlines |
Credit crd | American Express | Amex Epayment |
Insurance | Anthem | Anthem Bc |
Tagged:
0
Answers
-
hmmm I have an idea but I'd like to test it out. Do you have a larger data set you can share?0
-
Sir I sent you the larger data set to you inbox. Thank you for all the support1
-
just wrote you back.0
-
Thank you sir. What i was thinking is to get the Industry/business category by scrapping that data on the google search page to get specific industry for example Toyota would "automotive"
Is there a example on how we scrape a google web page and achieve this? Attached is what i wanted to extract.
0 -
I think we have 2 things to be done for this use case. Again Experts please correct me if i'm off track
First, name matching and grouping different naming of the company to be same ex:- AWS, Amazon Web Services, Amazon Web Services Inc, Amazon Web Services Llc etc., to same company
Second, use Google Search or use wiki API (this isn't as consistent as google) passing company names and scrap the data. In the below example it should be courier delivery services company
https://en.wikipedia.org/w/api.php?action=opensearch&search=FEDEX&limit=1&format=json
https://en.wikipedia.org/w/api.php?format=json&action=query&prop=extracts&exintro&explaintext&redirects=1&titles=FEDEX
So i think i got theory part, but now how to do this in RM is where i have BIG GAP any sample process to get me started is greatly appreciated.
0 -
hmm I don't really understand your theory here but if you want to grab those wikipedia JSONs in RapidMiner, that's not hard to do.
<?xml version="1.0" encoding="UTF-8"?><process version="9.3.000"> <context> <input/> <output/> <macros/> </context> <operator activated="true" class="process" compatibility="9.3.000" expanded="true" name="Process"> <parameter key="logverbosity" value="init"/> <parameter key="random_seed" value="-1"/> <parameter key="send_mail" value="never"/> <parameter key="notification_email" value=""/> <parameter key="process_duration_for_mail" value="30"/> <parameter key="encoding" value="SYSTEM"/> <process expanded="true"> <operator activated="true" class="retrieve" compatibility="9.3.000" expanded="true" height="68" name="Retrieve Transaction Category Tagging" width="90" x="45" y="85"> <parameter key="repository_entry" value="Transaction Category Tagging"/> </operator> <operator activated="true" class="filter_example_range" compatibility="9.3.000" expanded="true" height="82" name="Filter Example Range" width="90" x="179" y="85"> <parameter key="first_example" value="1"/> <parameter key="last_example" value="4"/> <parameter key="invert_filter" value="false"/> </operator> <operator activated="true" class="concurrency:loop_values" compatibility="9.3.000" expanded="true" height="82" name="Loop Values" width="90" x="313" y="85"> <parameter key="attribute" value="Row Labels"/> <parameter key="iteration_macro" value="loop_value"/> <parameter key="reuse_results" value="false"/> <parameter key="enable_parallel_execution" value="false"/> <process expanded="true"> <operator activated="true" class="web:get_webpage" compatibility="9.0.000" expanded="true" height="68" name="Get Page" width="90" x="45" y="34"> <parameter key="url" value="https://en.wikipedia.org/w/api.php?action=opensearch&amp;search=%{loop_value}&amp;limit=1&amp;format=json"/> <parameter key="random_user_agent" value="false"/> <parameter key="connection_timeout" value="10000"/> <parameter key="read_timeout" value="10000"/> <parameter key="follow_redirects" value="true"/> <parameter key="accept_cookies" value="all"/> <parameter key="cookie_scope" value="global"/> <parameter key="request_method" value="GET"/> <list key="query_parameters"/> <list key="request_properties"/> <parameter key="override_encoding" value="false"/> <parameter key="encoding" value="SYSTEM"/> </operator> <operator activated="true" class="delay" compatibility="9.3.000" expanded="true" height="82" name="Delay" width="90" x="179" y="34"> <parameter key="delay" value="fixed"/> <parameter key="delay_amount" value="1000"/> <parameter key="min_delay_amount" value="0"/> <parameter key="max_delay_amount" value="1000"/> </operator> <operator activated="true" class="text:json_to_data" compatibility="8.1.000" expanded="true" height="82" name="JSON To Data" width="90" x="313" y="34"> <parameter key="ignore_arrays" value="false"/> <parameter key="limit_attributes" value="false"/> <parameter key="skip_invalid_documents" value="false"/> <parameter key="guess_data_types" value="true"/> <parameter key="keep_missing_attributes" value="false"/> <parameter key="missing_values_aliases" value=", null, NaN, missing"/> </operator> <connect from_op="Get Page" from_port="output" to_op="Delay" to_port="through 1"/> <connect from_op="Delay" from_port="through 1" to_op="JSON To Data" to_port="documents 1"/> <connect from_op="JSON To Data" from_port="example set" to_port="output 1"/> <portSpacing port="source_input 1" spacing="0"/> <portSpacing port="source_input 2" spacing="0"/> <portSpacing port="sink_output 1" spacing="0"/> <portSpacing port="sink_output 2" spacing="0"/> </process> </operator> <operator activated="true" class="subprocess" compatibility="9.3.000" expanded="true" height="82" name="Union Append" width="90" x="447" y="85"> <process expanded="true"> <operator activated="true" class="loop_collection" compatibility="9.3.000" expanded="true" height="82" name="Output (4)" width="90" x="45" y="34"> <parameter key="set_iteration_macro" value="true"/> <parameter key="macro_name" value="iteration"/> <parameter key="macro_start_value" value="1"/> <parameter key="unfold" value="false"/> <process expanded="true"> <operator activated="false" breakpoints="after" class="select" compatibility="9.3.000" expanded="true" height="68" name="Select (5)" width="90" x="112" y="34"> <parameter key="index" value="%{iteration}"/> <parameter key="unfold" value="false"/> </operator> <operator activated="true" class="branch" compatibility="9.3.000" expanded="true" height="82" name="Branch (2)" width="90" x="313" y="34"> <parameter key="condition_type" value="expression"/> <parameter key="expression" value="%{iteration}==1"/> <parameter key="io_object" value="ANOVAMatrix"/> <parameter key="return_inner_output" value="true"/> <process expanded="true"> <connect from_port="condition" to_port="input 1"/> <portSpacing port="source_condition" spacing="0"/> <portSpacing port="source_input 1" spacing="0"/> <portSpacing port="sink_input 1" spacing="0"/> <portSpacing port="sink_input 2" spacing="0"/> </process> <process expanded="true"> <operator activated="true" class="recall" compatibility="9.3.000" expanded="true" height="68" name="Recall (5)" width="90" x="45" y="187"> <parameter key="name" value="LoopData"/> <parameter key="io_object" value="ExampleSet"/> <parameter key="remove_from_store" value="true"/> </operator> <operator activated="true" class="union" compatibility="9.3.000" expanded="true" height="82" name="Union (2)" width="90" x="179" y="34"/> <connect from_port="condition" to_op="Union (2)" to_port="example set 1"/> <connect from_op="Recall (5)" from_port="result" to_op="Union (2)" to_port="example set 2"/> <connect from_op="Union (2)" from_port="union" to_port="input 1"/> <portSpacing port="source_condition" spacing="0"/> <portSpacing port="source_input 1" spacing="0"/> <portSpacing port="sink_input 1" spacing="0"/> <portSpacing port="sink_input 2" spacing="0"/> </process> </operator> <operator activated="true" class="remember" compatibility="9.3.000" expanded="true" height="68" name="Remember (5)" width="90" x="581" y="34"> <parameter key="name" value="LoopData"/> <parameter key="io_object" value="ExampleSet"/> <parameter key="store_which" value="1"/> <parameter key="remove_from_process" value="true"/> </operator> <connect from_port="single" to_op="Branch (2)" to_port="condition"/> <connect from_op="Branch (2)" from_port="input 1" to_op="Remember (5)" to_port="store"/> <connect from_op="Remember (5)" from_port="stored" to_port="output 1"/> <portSpacing port="source_single" spacing="0"/> <portSpacing port="sink_output 1" spacing="0"/> <portSpacing port="sink_output 2" spacing="0"/> </process> </operator> <operator activated="true" class="select" compatibility="9.3.000" expanded="true" height="68" name="Select (6)" width="90" x="179" y="34"> <parameter key="index" value="%{iteration}"/> <parameter key="unfold" value="false"/> </operator> <connect from_port="in 1" to_op="Output (4)" to_port="collection"/> <connect from_op="Output (4)" from_port="output 1" to_op="Select (6)" to_port="collection"/> <connect from_op="Select (6)" from_port="selected" to_port="out 1"/> <portSpacing port="source_in 1" spacing="0"/> <portSpacing port="source_in 2" spacing="0"/> <portSpacing port="sink_out 1" spacing="0"/> <portSpacing port="sink_out 2" spacing="0"/> </process> </operator> <connect from_op="Retrieve Transaction Category Tagging" from_port="output" to_op="Filter Example Range" to_port="example set input"/> <connect from_op="Filter Example Range" from_port="example set output" to_op="Loop Values" to_port="input 1"/> <connect from_op="Loop Values" from_port="output 1" to_op="Union Append" to_port="in 1"/> <connect from_op="Union Append" from_port="out 1" to_port="result 1"/> <portSpacing port="source_input 1" spacing="0"/> <portSpacing port="sink_result 1" spacing="0"/> <portSpacing port="sink_result 2" spacing="0"/> </process> </operator> </process>
0 -
These are both fairly complex tasks (whether in RapidMiner or any other platform). You can do some text string matching (using similarity measures) to try to combine instances where one string is a subset or close match to another, but many of the examples you provide (such as AWS and Amazon matching) are going to be very difficult to accomplish programmatically. You may want to look at adding a manual dictionary of token replacement for commonly used abbreviations and acronyms.0
-
Yeah, the joy of entity recognition. All nice but you need to get the training data...
I'd follow the advice above and work in 2 steps. First have a kind of 'translation list' where I'd use regex to convert most known variations to a common label. So (AWS|Amazon.*web.*services) becomes AWS or so. Dirty job but someone has to do it.
Next I'd do something as in attached example, where you can use a simple list with all of the entities you like to find (I've made something similar to look for brands etc in reviews) and the process will 'tag' these in the text. This can be relatively easy converted to more official tagging so you create for instance your own entity recognition model in for instance Spacy, and integrate this using python.<?xml version="1.0" encoding="UTF-8"?><process version="9.2.001"> <context> <input/> <output/> <macros/> </context> <operator activated="true" class="process" compatibility="9.2.001" expanded="true" name="Process"> <parameter key="logverbosity" value="init"/> <parameter key="random_seed" value="2001"/> <parameter key="send_mail" value="never"/> <parameter key="notification_email" value=""/> <parameter key="process_duration_for_mail" value="30"/> <parameter key="encoding" value="SYSTEM"/> <process expanded="true"> <operator activated="true" class="utility:create_exampleset" compatibility="9.2.001" expanded="true" height="68" name="Brands" width="90" x="246" y="136"> <parameter key="generator_type" value="comma separated text"/> <parameter key="number_of_examples" value="100"/> <parameter key="use_stepsize" value="false"/> <list key="function_descriptions"/> <parameter key="add_id_attribute" value="false"/> <list key="numeric_series_configuration"/> <list key="date_series_configuration"/> <list key="date_series_configuration (interval)"/> <parameter key="date_format" value="yyyy-MM-dd HH:mm:ss"/> <parameter key="time_zone" value="SYSTEM"/> <parameter key="input_csv_text" value="brand canon nikon panasonic samsung sony jbl sonos bose"/> <parameter key="column_separator" value="\t"/> <parameter key="parse_all_as_nominal" value="false"/> <parameter key="decimal_point_character" value="."/> <parameter key="trim_attribute_names" value="true"/> </operator> <operator activated="true" class="subprocess" compatibility="9.2.001" expanded="true" height="82" name="Subprocess (6)" width="90" x="380" y="136"> <process expanded="true"> <operator activated="true" class="generate_attributes" compatibility="9.2.001" expanded="true" height="82" name="Generate Attributes (3)" width="90" x="45" y="34"> <list key="function_descriptions"> <parameter key="brand" value="trim(lower(brand))"/> <parameter key="first" value="prefix([brand],1)"/> <parameter key="remain" value="suffix([brand],length([brand])-1)"/> </list> <parameter key="keep_all" value="true"/> </operator> <operator activated="true" class="remove_duplicates" compatibility="9.2.001" expanded="true" height="103" name="Remove Duplicates (2)" width="90" x="179" y="34"> <parameter key="attribute_filter_type" value="all"/> <parameter key="attribute" value=""/> <parameter key="attributes" value=""/> <parameter key="use_except_expression" value="false"/> <parameter key="value_type" value="attribute_value"/> <parameter key="use_value_type_exception" value="false"/> <parameter key="except_value_type" value="time"/> <parameter key="block_type" value="attribute_block"/> <parameter key="use_block_type_exception" value="false"/> <parameter key="except_block_type" value="value_matrix_row_start"/> <parameter key="invert_selection" value="false"/> <parameter key="include_special_attributes" value="false"/> <parameter key="treat_missing_values_as_duplicates" value="false"/> </operator> <operator activated="true" class="aggregate" compatibility="9.2.001" expanded="true" height="82" name="Aggregate (2)" width="90" x="313" y="34"> <parameter key="use_default_aggregation" value="false"/> <parameter key="attribute_filter_type" value="all"/> <parameter key="attribute" value=""/> <parameter key="attributes" value=""/> <parameter key="use_except_expression" value="false"/> <parameter key="value_type" value="attribute_value"/> <parameter key="use_value_type_exception" value="false"/> <parameter key="except_value_type" value="time"/> <parameter key="block_type" value="attribute_block"/> <parameter key="use_block_type_exception" value="false"/> <parameter key="except_block_type" value="value_matrix_row_start"/> <parameter key="invert_selection" value="false"/> <parameter key="include_special_attributes" value="false"/> <parameter key="default_aggregation_function" value="average"/> <list key="aggregation_attributes"> <parameter key="remain" value="concatenation"/> </list> <parameter key="group_by_attributes" value="first"/> <parameter key="count_all_combinations" value="false"/> <parameter key="only_distinct" value="false"/> <parameter key="ignore_missings" value="true"/> </operator> <operator activated="true" class="generate_attributes" compatibility="9.2.001" expanded="true" height="82" name="Generate Attributes (4)" width="90" x="447" y="34"> <list key="function_descriptions"> <parameter key="from" value="concat("(?i)\\b(",[first],"(?:",[concat(remain)],"))\\b")"/> <parameter key="to" value=""<:tag:brand:XTAG$1:>""/> </list> <parameter key="keep_all" value="true"/> </operator> <connect from_port="in 1" to_op="Generate Attributes (3)" to_port="example set input"/> <connect from_op="Generate Attributes (3)" from_port="example set output" to_op="Remove Duplicates (2)" to_port="example set input"/> <connect from_op="Remove Duplicates (2)" from_port="example set output" to_op="Aggregate (2)" to_port="example set input"/> <connect from_op="Aggregate (2)" from_port="example set output" to_op="Generate Attributes (4)" to_port="example set input"/> <connect from_op="Generate Attributes (4)" from_port="example set output" to_port="out 1"/> <portSpacing port="source_in 1" spacing="0"/> <portSpacing port="source_in 2" spacing="0"/> <portSpacing port="sink_out 1" spacing="0"/> <portSpacing port="sink_out 2" spacing="0"/> </process> </operator> <operator activated="true" class="text:create_document" compatibility="8.1.000" expanded="true" height="68" name="Create Document" width="90" x="246" y="34"> <parameter key="text" value="This is a string that includes some brands, like Sony, samsung and Panasonic"/> <parameter key="add label" value="false"/> <parameter key="label_type" value="nominal"/> </operator> <operator activated="true" class="text:documents_to_data" compatibility="8.1.000" expanded="true" height="82" name="Documents to Data" width="90" x="380" y="34"> <parameter key="text_attribute" value="strings"/> <parameter key="add_meta_information" value="true"/> <parameter key="datamanagement" value="double_sparse_array"/> <parameter key="data_management" value="auto"/> </operator> <operator activated="true" class="replace_dictionary" compatibility="9.2.001" expanded="true" height="103" name="Replace (2)" width="90" x="782" y="34"> <parameter key="return_preprocessing_model" value="false"/> <parameter key="create_view" value="false"/> <parameter key="attribute_filter_type" value="single"/> <parameter key="attribute" value="strings"/> <parameter key="attributes" value=""/> <parameter key="use_except_expression" value="false"/> <parameter key="value_type" value="attribute_value"/> <parameter key="use_value_type_exception" value="false"/> <parameter key="except_value_type" value="time"/> <parameter key="block_type" value="attribute_block"/> <parameter key="use_block_type_exception" value="false"/> <parameter key="except_block_type" value="value_matrix_row_start"/> <parameter key="invert_selection" value="false"/> <parameter key="include_special_attributes" value="false"/> <parameter key="from_attribute" value="from"/> <parameter key="to_attribute" value="to"/> <parameter key="use_regular_expressions" value="true"/> <parameter key="convert_to_lowercase" value="false"/> <parameter key="first_match_only" value="false"/> </operator> <connect from_op="Brands" from_port="output" to_op="Subprocess (6)" to_port="in 1"/> <connect from_op="Subprocess (6)" from_port="out 1" to_op="Replace (2)" to_port="dictionary"/> <connect from_op="Create Document" from_port="output" to_op="Documents to Data" to_port="documents 1"/> <connect from_op="Documents to Data" from_port="example set" to_op="Replace (2)" to_port="example set input"/> <connect from_op="Replace (2)" from_port="example set output" to_port="result 1"/> <portSpacing port="source_input 1" spacing="0"/> <portSpacing port="sink_result 1" spacing="0"/> <portSpacing port="sink_result 2" spacing="0"/> </process> </operator> </process>
0