Workflow: Picking predictive variables with the Top Select block


The Top Select block enables you to retain variables in a dataset. The block is used to keep or discard certain columns from an input loan_data.csv dataset (containing observations describing a loan and the person taking the loan out).

The following demonstrates how to use the Top Select block to sort variables from the loan_data.csv dataset based on entropy variance to the Default variable. The variables with the highest entropy variance are selected and output:

  1. Import the loan_data.csv dataset onto a Workflow canvas using the Text File Import block.
  2. Expand the Data Preparation group in the Workflow palette, then click and drag a Top Select block onto the Workflow canvas.
  3. Click the Output port of the loan_data dataset block and drag a connection towards the Input port of the Top Select block.
  4. Double-click the Top Select block to display the Configure Top Select dialog box.
  5. In the Configure Top Select dialog box:
    1. In the Dependent variable drop-down list, select Default.
    2. At the top of the Unselected Independent Variables list, click Entropy Variance to sort the list in descending order of entropy variance.
    3. In the Unselected Independent Variables list click Loan_Period.
    4. Hold down the Shift key and click the sixth variable in the list, Loan_Amount.
    5. Click Select to move all variables to the Selected Independent Variables list.
  6. Click OK to save the configuration and close the Configure Top Select dialog box.

A green execution status is displayed in the Output port of the Top Select block and the new Working Dataset. The Top Select block output dataset contains only the selected variables from the input loan_data dataset.