Introduction to the DATA step statements


The Introduction to the DATA step article describes the general structure of the DATA step. Various statements can be used in the DATA step that enable you to read and write data, format data, manipulate data, control the order of execution of statements in the DATA step, and so on.

The format of the DATA step is as follows:

DATA options;
  data_step_statements;
RUN;

The DATA and RUN statements define the start and end of a DATA step, while options enable you to specify one or more datasets to which data is written, and how the data is written. The DATA step contains DATA step statements; these enable you to read and write data, manipulate the data in various ways, and control the operation of the DATA step. The statements also include functions that provide much of the data manipulation functionality; both statements and functions are described in the Altair SLC Reference for Language Elements. For example, the following DATA step contains statements:

DATA out;
  LENGTH var1 $ 10;
  var1 = "Smith";
RUN;

This contains:

The DATA step in this example iterates once and ends as there is no input file. See Introduction to the DATA step for more information on DATA step iteration.

In the following DATA step, a dataset is read and the value of one variable is used to specify the value for another variable. The dataset contains the following values:

33 Smith
32 Jones
56 Brown

Each line in a dataset is called an observation. Each observation contains an age and a surname. Each value in the observation (for example, 33 and Smith) is assigned to a variable. In this dataset the variables are named Age and Surname.

DATA out;
  SET ds_in;
  LENGTH fs $ 8;
  IF Sex EQ 'F' then fs = 'Female';
  IF Sex EQ 'M' then fs = 'Male';
  DROP Sex;
  OUTPUT;
RUN;

In this DATA step:

The dataset that results from this DATA step contains the observations:

33 Smith
32 Jones
56 Brown

In the following example, the out dataset created in the previous example is used as input. The DATA step function UPCASE is used to change the case of the letters in the fs variable so that the resulting value is all uppercase.

DATA uc_out;
  SET out;
  ucfs = UPCASE(fs);
  DROP fs;
RUN;

Because an OUTPUT statement is not specified in this DATA step, it is performed by default at the end of each iteration to write each observation to the dataset.

The dataset uc_out now contains two variables Age and ucfs. The dataset contains the values:

33 SMITH
32 JONES
56 BROWN

The UPCASE function is just one of many functions available that enable you to perform operations on data.