Workflow: Using the Deduplicate block to remove duplicate rows from a dataset


The Deduplicate block enables you to remove duplicate rows from an input dataset.

The following demonstrates how to use the Deduplicate block is used to remove a duplicate observation from an input dataset iris.csv (which contains observations that describe measurements of iris flowers).

  1. Import the iris.csv dataset onto a Workflow canvas using the Text File Import block.
  2. Expand the Data Preparation group in the Workflow palette, then click and drag a Deduplicate block onto the Workflow canvas.
  3. Click the Output port of the iris dataset block and drag a connection towards the Input port of the Deduplicate block.
  4. Double-click the Deduplicate block to display the Configure Deduplicate dialog box.
    1. In the Configure Deduplicate dialog box, ensure Basic - Make all observations unique is selected and click OK:

A green execution status is displayed in the Output ports of the Deduplicate block and the new Working Dataset. The Deduplicate block output dataset contains the input dataset with one duplicate row removed.