Introduction to the DATA step

IanBD
IanBD
Altair Employee
edited September 2022 in Altair RapidMiner

The DATA step is one element of a computer program written in the SAS language; other elements include global statements, procedure steps and macros.

A SAS language program does not require a DATA step – you can create a valid program using only other elements of the language, such as macro statements or procedures. The DATA step is, however, ideal for preparing data for further use. You can use the DATA step to clean data, re-organise it, and perform various operations on it. The DATA step can use a wide variety of functions for mathematical operations, string manipulation, financial calculations, and so on.

When defining a DATA step, the DATA and RUN statements define the start and end of the DATA step:

DATA options;   data_step_statements; RUN;

The options you can specify to the DATA statement, and the data_step_statements you can use in the DATA step, are defined in the Altair SLC Reference for Language Elements.

One very important option to the DATA statement is the output dataset name. This enables you to specify one or more datasets to which data is written. In the following DATA step, any data assigned to variables in the statements specified in data_step_statements is written to the dataset out:

DATA out;   data_step_statements; RUN;

Note: In this example, the dataset out is created in the work library, the default library. To create the dataset in another library, you would need to use the LIBNAME global statement, see Introduction to the LIBNAME statement: SAS language libraries.

Another useful option to the DATA statement is _NULL_. This prevents data being written to a dataset. You might want to do this if you want to write output to the log, or to an external file.

An important DATA step statement is SET, which specifies the name of a dataset to read. For example:

DATA out;   SET ds_in; RUN;

If a DATA step reads data from an input dataset or file, then the step automatically iterates; that is, when the DATA step reaches the RUN statement it starts again from the beginning. It does this for each observation or record in turn and then executes statements in the step until the end of the file or dataset. Every time the SET statement is executed, the next record is read from the input dataset.

Note: A DATA step can have more than one SET statement, enabling it to read from multiple datasets.

Any variable created in the DATA step creates a corresponding variable with its value in the output dataset. This DATA step creates a dataset out in the work library and reads data from the dataset ds_in (also in the work library).

When an input dataset is specified in the DATA step, the DATA step iterates (loops) for each line of data in the dataset. Suppose the input dataset contains these lines:

33 Smith 32 Jones 56 Brown

Each line in a dataset is called an observation. Each value in an observation (for example, 33 and Smith) is assigned to a variable. In this input dataset the variables are named Age and Surname. A dataset contains data, and information about that data, such as how each observation is split into variables, the name and size of a variables, and so on.

The DATA step reads the value of each variable in the first observation (33, Smith) from the input dataset, and then writes those values to the output dataset. When the RUN statement is executed, the DATA step returns to the start of the DATA step and executes each statement again. The SET statement therefore reads the values in the next observation in the input dataset, and then writes them to the output dataset. The DATA step continues to iterate until it has reached the end of the dataset.