Data to Similarity - how to define the control group
Hi everyone,
i have a large number of documents (one folder "auditor report" and one "audit committee report"(AC) ) and want to compare them. With the operator "Data to similarity" the programm compares each file with each file. I want to compare just the matching file names.
The documents in the folder 1 "auditor report" are named: year_company name
and the documents in the folder 2 "audit committee report" are named: AC_year_company name
So instead of comparing each document with each document from the other file i just want to compare the matching documents (= same year and company name in the document name).
Many thanks in advance!!!
Christina
Answers
-
Assuming the time stamps match (i.e. yyyy in one file and yyyy in the othe file), just use a Join operator first to join the two files together and match on your timestamp. Then use the similarity measures.
0 -
Hi Thomas,
thanks for the quick reply. I tried it with the operator "join" before testing on similarity. I chose join type "inner" and used as key attributes "metadata_file" for the right and the left key attribute. But somehow it didn't work out as i was expecting it.
For example:
AC_2015_A.G.Barr PLC,GB00B6XZKY75
should match before i use the similarity operator with
2015_A.G.Barr PLC,GB00B6XZKY75.So that the similarity test just runs between those two files (almost same name just once with and once without AC in the doc name) instead of comparing each doc with another.
This is what I've got:
Thanks a lot in advance
Christina
0 -
Dear Christina,
i do not think that there is anyway to do this w/o a loop. Propably something like Loop Values, Filter Examples for the value, left join with the other table and than data to similarity.
In RM 7 we added a Group into Collection operator in the operator toolbox extension. That would make it a bit nicer.
Best,
Martin
0 -
Dear Martin,
i installed the new version of RM. I still get the same results and dont see a way how to solve my problem of matching samples. I have to files and the programm should be able to read the name of each document and just check the matching ones for similarity. Its still comparing all documents with each other. As i have over 400 documents in total the program does not run with so many.
Thanks in advance
Christina
Here you can see which match i want to have. So my question is which operator do i have to use ? In excel it would work with =A2="AC_"&B2
,
0 -
Dear Christina,
i thought about something along the lines of the attached process. Not too handsome but working. 7.5 has a bit of a different loop interface but parallized and therefore way faster loops.
Best,
Martin
<?xml version="1.0" encoding="UTF-8"?><process version="7.3.001">
<context>
<input/>
<output/>
<macros/>
</context>
<operator activated="true" class="process" compatibility="7.3.001" expanded="true" name="Process">
<process expanded="true">
<operator activated="true" class="generate_data" compatibility="7.3.001" expanded="true" height="68" name="Generate Data" width="90" x="112" y="34"/>
<operator activated="true" class="generate_id" compatibility="7.3.001" expanded="true" height="82" name="Generate ID" width="90" x="246" y="34"/>
<operator activated="true" class="extract_macro" compatibility="7.3.001" expanded="true" height="68" name="Extract Macro" width="90" x="380" y="34">
<parameter key="macro" value="exa"/>
<list key="additional_macros"/>
</operator>
<operator activated="true" class="generate_data" compatibility="7.3.001" expanded="true" height="68" name="Generate Data (2)" width="90" x="112" y="136"/>
<operator activated="true" class="generate_id" compatibility="7.3.001" expanded="true" height="82" name="Generate ID (2)" width="90" x="246" y="136"/>
<operator activated="true" class="loop" compatibility="7.3.001" expanded="true" height="103" name="Loop" width="90" x="514" y="85">
<parameter key="set_iteration_macro" value="true"/>
<parameter key="iterations" value="%{exa}"/>
<process expanded="true">
<operator activated="true" class="filter_example_range" compatibility="7.3.001" expanded="true" height="82" name="Filter Example Range" width="90" x="179" y="34">
<parameter key="first_example" value="%{iteration}"/>
<parameter key="last_example" value="%{iteration}"/>
</operator>
<operator activated="true" class="filter_example_range" compatibility="7.3.001" expanded="true" height="82" name="Filter Example Range (2)" width="90" x="179" y="187">
<parameter key="first_example" value="%{iteration}"/>
<parameter key="last_example" value="%{iteration}"/>
</operator>
<operator activated="true" class="append" compatibility="7.3.001" expanded="true" height="103" name="Append" width="90" x="313" y="85"/>
<operator activated="true" class="data_to_similarity" compatibility="7.3.001" expanded="true" height="82" name="Data to Similarity" width="90" x="447" y="85"/>
<operator activated="true" class="similarity_to_data" compatibility="7.3.001" expanded="true" height="82" name="Similarity to Data" width="90" x="581" y="85"/>
<connect from_port="input 1" to_op="Filter Example Range" to_port="example set input"/>
<connect from_port="input 2" to_op="Filter Example Range (2)" to_port="example set input"/>
<connect from_op="Filter Example Range" from_port="example set output" to_op="Append" to_port="example set 1"/>
<connect from_op="Filter Example Range (2)" from_port="example set output" to_op="Append" to_port="example set 2"/>
<connect from_op="Append" from_port="merged set" to_op="Data to Similarity" to_port="example set"/>
<connect from_op="Data to Similarity" from_port="similarity" to_op="Similarity to Data" to_port="similarity"/>
<connect from_op="Data to Similarity" from_port="example set" to_op="Similarity to Data" to_port="exampleSet"/>
<connect from_op="Similarity to Data" from_port="exampleSet" to_port="output 1"/>
<portSpacing port="source_input 1" spacing="0"/>
<portSpacing port="source_input 2" spacing="126"/>
<portSpacing port="source_input 3" spacing="0"/>
<portSpacing port="sink_output 1" spacing="0"/>
<portSpacing port="sink_output 2" spacing="0"/>
</process>
</operator>
<operator activated="true" class="append" compatibility="7.3.001" expanded="true" height="82" name="Append (2)" width="90" x="648" y="85"/>
<connect from_op="Generate Data" from_port="output" to_op="Generate ID" to_port="example set input"/>
<connect from_op="Generate ID" from_port="example set output" to_op="Extract Macro" to_port="example set"/>
<connect from_op="Extract Macro" from_port="example set" to_op="Loop" to_port="input 1"/>
<connect from_op="Generate Data (2)" from_port="output" to_op="Generate ID (2)" to_port="example set input"/>
<connect from_op="Generate ID (2)" from_port="example set output" to_op="Loop" to_port="input 2"/>
<connect from_op="Loop" from_port="output 1" to_op="Append (2)" to_port="example set 1"/>
<connect from_op="Append (2)" from_port="merged set" to_port="result 1"/>
<portSpacing port="source_input 1" spacing="0"/>
<portSpacing port="sink_result 1" spacing="0"/>
<portSpacing port="sink_result 2" spacing="0"/>
</process>
</operator>
</process>0