How to "unjoin" a dataset?
Hi,
I have a dataset that is the result of a join and I want to break up, back into the 2 source tables with a foreign key id in one of the tables.
For example the table I have looks a bit like this:
Invoice_id | Product_weight | Product_Color | Product_size
1 | 25 | blue | Large
2 | 5 | red | Small
3 | 17 | green | Large
4 | 15 | blue | Medium
5 | 25 | blue | Large
6 | 25 | blue | Large
I want to break it back in to 2 tables like this:
Invoice_id | Product_id
1 |1
2 |2
3 |3
4 |4
5 |1
6 |1
and
Product_id | Product_weight | Product_Color | Product_size
1 | 25 | blue | Large
2 | 5 | red | Small
3 | 17 | green | Large
4 | 15 | blue | Medium
How do I do this?
Answers
-
hi @mdale9 - so I assume you created this via a left-hand-join like this, joining on "Product_id" as your key:
Hence the result is what you say:
So if this is the case, to "unjoin" is to simply select the first two attributes (Invoice_id and Product_id) from the result as this is exactly what join did to make them in the first place (that is left join):
As for the other side, when you join you create duplicates. So removing duplicates should do the trick:
Here's the process from soup to nuts:
<?xml version="1.0" encoding="UTF-8"?><process version="9.0.003">
<context>
<input/>
<output/>
<macros/>
</context>
<operator activated="true" class="process" compatibility="9.0.003" expanded="true" name="Process">
<process expanded="true">
<operator activated="true" breakpoints="after" class="retrieve" compatibility="9.0.003" expanded="true" height="68" name="Retrieve table1" width="90" x="45" y="34">
<parameter key="repository_entry" value="//Local Repository/table1"/>
</operator>
<operator activated="true" breakpoints="after" class="retrieve" compatibility="9.0.003" expanded="true" height="68" name="Retrieve table2" width="90" x="45" y="136">
<parameter key="repository_entry" value="//Local Repository/table2"/>
</operator>
<operator activated="true" class="concurrency:join" compatibility="9.0.003" expanded="true" height="82" name="Join" width="90" x="179" y="34">
<parameter key="join_type" value="left"/>
<list key="key_attributes">
<parameter key="Product_id" value="Product_id"/>
</list>
</operator>
<operator activated="true" class="multiply" compatibility="9.0.003" expanded="true" height="103" name="Multiply" width="90" x="313" y="85"/>
<operator activated="true" class="select_attributes" compatibility="9.0.003" expanded="true" height="82" name="Select Attributes (2)" width="90" x="447" y="136">
<parameter key="attribute_filter_type" value="single"/>
<parameter key="attribute" value="Product_id"/>
<parameter key="attributes" value="Invoice_id|Product_id"/>
<parameter key="invert_selection" value="true"/>
</operator>
<operator activated="true" class="remove_duplicates" compatibility="9.0.003" expanded="true" height="103" name="Remove Duplicates" width="90" x="581" y="136"/>
<operator activated="true" class="select_attributes" compatibility="9.0.003" expanded="true" height="82" name="Select Attributes" width="90" x="447" y="34">
<parameter key="attribute_filter_type" value="subset"/>
<parameter key="attributes" value="Invoice_id|Product_id"/>
</operator>
<connect from_op="Retrieve table1" from_port="output" to_op="Join" to_port="left"/>
<connect from_op="Retrieve table2" from_port="output" to_op="Join" to_port="right"/>
<connect from_op="Join" from_port="join" to_op="Multiply" to_port="input"/>
<connect from_op="Multiply" from_port="output 1" to_op="Select Attributes" to_port="example set input"/>
<connect from_op="Multiply" from_port="output 2" to_op="Select Attributes (2)" to_port="example set input"/>
<connect from_op="Select Attributes (2)" from_port="example set output" to_op="Remove Duplicates" to_port="example set input"/>
<connect from_op="Remove Duplicates" from_port="example set output" to_port="result 2"/>
<connect from_op="Select Attributes" from_port="example set output" to_port="result 1"/>
<portSpacing port="source_input 1" spacing="0"/>
<portSpacing port="sink_result 1" spacing="0"/>
<portSpacing port="sink_result 2" spacing="0"/>
<portSpacing port="sink_result 3" spacing="0"/>
</process>
</operator>
</process>Example sets are attached.
Scott0 -
Hi Scott,
Not quite. If you look at my original dataset, I don't have a product_id in the joined dataset and hence can't extract it, unlike in your Exampleset(Join). I need to remove duplicates, generate a new product_id and then figure out a way of getting that original link between invoice and product.
thanks for taking the time to look at this though.
Martin
0