Introduction to the DATA step statements

Ian Balanzá-Davis
Ian Balanzá-Davis
Altair Employee
edited September 2022 in Altair RapidMiner

The Introduction to the DATA step article describes the general structure of the DATA step. Various statements can be used in the DATA step that enable you to read and write data, format data, manipulate data, control the order of execution of statements in the DATA step, and so on.

The format of the DATA step is as follows:

DATA options;   data_step_statements; RUN;

The DATA and RUN statements define the start and end of a DATA step, while options enable you to specify one or more datasets to which data is written, and how the data is written. The DATA step contains DATA step statements; these enable you to read and write data, manipulate the data in various ways, and control the operation of the DATA step. The statements also include functions that provide much of the data manipulation functionality; both statements and functions are described in the Altair SLC Reference for Language Elements. For example, the following DATA step contains statements:

DATA out;   LENGTH var1 $ 10;   var1 = "Smith"; RUN;

This contains:

  • The LENGTH statement, which creates the variable var1, defines it as a character variable that has a length of 10 bytes.
  • An assignment statement, which, in this case, assigns the value Smith to the variable var1. Because Smith is shorter than 10 bytes, the value is padded with space characters so that it is 10 bytes long. Character variables in the SAS language have exactly the specified length and are padded or truncated if necessary.

The DATA step in this example iterates once and ends as there is no input file. See Introduction to the DATA step for more information on DATA step iteration.

In the following DATA step, a dataset is read and the value of one variable is used to specify the value for another variable. The dataset contains the following values:

33 Smith 32 Jones 56 Brown

Each line in a dataset is called an observation. Each observation contains an age and a surname. Each value in the observation (for example, 33 and Smith) is assigned to a variable. In this dataset the variables are named Age and Surname.

DATA out;   SET ds_in;   LENGTH fs $ 8;   IF Sex EQ 'F' then fs = 'Female';   IF Sex EQ 'M' then fs = 'Male';   DROP Sex;   OUTPUT; RUN;

In this DATA step:

    • The SET statement specifies the name of the input dataset and reads the next observation from it.
    • The LENGTH statement specifies the length of the variable fs; this is a new variable that is not in the input dataset and is created in the DATA step to hold a new value.
    • The IF-THEN statements are used to assign a value to the variable fs.
    • The DROP statement removes the specified variable Sex from the output; the output dataset therefore only contains values for the variables Age and fs.
    • The OUTPUT statement specifies that the value of all variables in the DATA step are written to the output dataset out. The OUTPUT statement is optional; if it is omitted, it is executed by default at the end of the DATA step.

      Note: You might not want the statement executed at the end of the DATA step; if you need to control at what point data is written to the dataset, you must explicitly specify the OUTPUT statement.

The dataset that results from this DATA step contains the observations:

33 Smith 32 Jones 56 Brown

In the following example, the out dataset created in the previous example is used as input. The DATA step function UPCASE is used to change the case of the letters in the fs variable so that the resulting value is all uppercase.

DATA uc_out;   SET out;   ucfs = UPCASE(fs);   DROP fs; RUN;

Because an OUTPUT statement is not specified in this DATA step, it is performed by default at the end of each iteration to write each observation to the dataset.

The dataset uc_out now contains two variables Age and ucfs. The dataset contains the values:

33 SMITH 32 JONES 56 BROWN

The UPCASE function is just one of many functions available that enable you to perform operations on data.