Lesson 7: Data Set Options, Set and Merge Statements
Data Set Options
"Options" are used in SAS in various places, and with various different kinds of syntax. For this reason, the concept of options can be a bit confusing. An earlier lesson introduced the idea of a global system options statement, where we can set things like linesize, page numbering, and so forth. Data set options are something different. They are commands added to a basic data step statement to refine or modify the work of the data step, or they can be used to modify how a data set is used in a proc step. (The term, "options," is also used for any optional command in any SAS statement. This is why we have to distinguish between system options, data set options, and other options.)
We begin with the drop and keep options. The examples below show how these two options can be used to accomplish the same thing. The choice between them is purely a matter of convenience. The drop option lists the variables to be omitted from the data set, while the keep option lists those which are to remain in the data set. There could be many reasons why you might not want all of the original variables in your data set. In the next example, we use the original variables for calculations, but do not include them in the data set. In other cases, we may create variables that are only used in the program and do not need to be saved.
Take note of the syntax used here, as it is unique for data set options. All the options to be applied to a particular data set are enclosed in parentheses, following the name of the data set to which they apply. An equal sign after each option is followed by the list of items involved. Here they are part of the data step, but they are NOT called DATA STEP options, but rather DATA SET options, and they can be used any time a data set is referenced, such as in this proc print statement:
(A var statement under proc print would be more appropriate, but this works. Var statements are covered later.)
The Set Statement
The set statement is used to create a new data set from an existing one. In the simplest example, it merely copies a data set. The following program will create a new data set called two that is an exact copy of one, providing one exists.
Now suppose that a data set called one has been created, as shown in the data step below, and that you later wanted to create a new data set called two, with some new variables, but leaving the old one unchanged. The set statement can be used to copy the original data set and add new variables.
SAS will now go through all the observations in one, and place in the new data set two the original variables together with the new ones that are defined.
There are two data sets now.
Suppose we want to print both of them. Most procs, including proc print, will use the
most recently created data set by default. Thus the second statement below would print only two.
To specify another data set, use the syntax shown in the first statement
below.
Each “proc print” statement prints one data set.
Using Data Set Options with Set Statements
Next we show how data set options can be applied to either input or output data sets when using the set statement to read another data set. In the example below, data set one is first created to serve as the input data set. The data sets two and three will be exactly the same, but the processes by which they are produced differ. When two is created, all four variables from one are read into memory, then when the data are written to two the variables x and y are eliminated. When three is created, only sumxy and prodxy are read into memory from one, and subsequently saved to three. If there are many variables and observations involved, the second method is preferred because it is more efficient (saves system resources). However, you cannot drop variables from an input data set if you intend to use them in calculations!
The firstobs and obs options specify the starting and stopping observation number to read. They do not apply to output data sets, but are used with set statements and in procs where data sets are referenced. In the first example, a second data set is created using these options to make a subset of the original data. Note the observation numbers in the output.
In the next example, the option is given in the proc print statement. Compare the observation numbers. In both cases, the observation numbers correspond to the observations actually stored in the data. They are not renumbered by proc print.
Finally, we discuss the rename option. At times it will be necessary to change the names of variables. As with the other options, this can be done when reading or writing the data. When used with an output data set, it renames the variables when the data set is saved, and does not affect the names used within the data step, as shown below. Also, the rename option requires another set of parentheses for the list of variables to be renamed. The name change is specified as "old name=new name."
When the rename statement is used with an input data set, the new names are in effect during the data step, as in this example:
Using Firstobs and Obs in the Infile Statement
Firstobs and obs can also be used in the infile statement when reading from an external file or cards. Here, the syntax is different, since they are not connected with a data set, and the options are simply typed on the line following the filename or cards keyword. There is one other difference--in this case, the numbers refer to the starting and ending line in the data, which is not always the same as the observation (e.g., observations might take up two lines each).
Concatenating Data Sets Using the Set Statement
The set statement can also be used to combine data sets in various ways. The first way is called concatenation, and is simply combining them in order, one after the other. The example below uses two data sets, but there can be more. When using concatenation, all of the observations of all of the listed data sets are included in the result. All variables from all the sets are included, with missing values assigned in cases where a variable does not occur in the original.
Suppose the data sets have different variable names.
You could use data set options to re-align the variables in a case like this.
Combining Observations with the Set Statement
The second method of combining data sets is called "one-to-one reading" and may be thought of as a side-by-side version of concatenation. The programming difference is that while in concatenation the source data sets are listed in one set statement, in one-to-one reading each source data set is given in a separate set statement. If the same variable names occur in more than one source data set, the values of the later sets overwrite the earlier ones. If the data sets do not have the same number of observations, the result is cut off at the length of the shortest set. The example below shows two data sets of different lengths and the same variables. The x variable is changed using a rename option during input. Thus the source variable is overwritten with the second data set values, but because of the name change, we now have x and y coming from the original x variables.
Interleaving Data with the Set Statement
The third method is called interleaving. This is like concatenation, except that the observations are combined so that they are sorted, instead of having one data set placed after the other. The variable(s) on which the sort is done are called by variables, and the interleave is done by adding a by statement to the concatenation program. The by variables must be sorted before the interleave is done, so if they are not in order to begin with, we use proc sort to do the job. Proc sort will also have a by statement, listing the by variable(s), just like the data step that does the interleave.
Merging Data Sets
Now, if you're thinking ahead, you must realize that we should be able to take this to another level--how about putting the data side-by-side while matching up variables? That's called a merge. And that's where we're going next. But before getting to the programming, there are some things that need to be explained. We need to be very careful about how things match up. For convenience, we'll visualize the data sets laid out side by side, so that we have a right and a left data set. The simplest case occurs when there is exactly one item on the left to match exactly one item on the right. This is called a one-to-one merge. If there is exactly one on the left to match more than one on the right, we call it a one-to-many merge, and if the roles are reversed, a many-to-one merge. Finally, if there are multiple instances of variable values on both sides, it is called a many-to-many merge. The latter should usually be avoided, but an example will be provided to show what happens.
Our first example will be one-to-one reading again, but this time using a merge statement. The difference is that the data set continues building until the end of the longest data set is reached. (In the following examples the variables have been given unique names to avoid needing rename options.)
Next, we do a one-to-one merge. The data sets will be matched by the name variable, therefore this variable may be called the match key. The match key is identified using a by statement, and it must be the same in both source data sets. Just as in interleaving, the data sets must be sorted. To simplify the example, the data are entered in sorted order. Note that if something doesn't have a match, it is included, with missing values where the matching data should be. Also, this is a good time to mention that SAS character variables are case sensitive, so that "john" and "John" are two different values. If your data might have mixed case, the upcase function can be used to convert everything to upper case to ensure it matches.
Now consider a one-to-many merge. We will have one instance of each name on the left, and multiple instances of some names on the right. There is no change to the program (only the data). The result is that one observation is produced for each observation on the "many" side, but if there is no match, again the relevant observations on either side are included with missing values.
Here is the result of a many-to-many merge. The technical details of what happens are a bit complicated, but we can say, in a simplified way, that SAS matches lines side-to-side until one side runs out of the matching observations, then it repeats the last observation from the short side until the long side runs out of the matching value. (If you go back to the previous examples, you can see that they are special cases of this more general result.) There are very few situations where this behavior is desirable. Most merging is done with one-to-one or one-to-many relationships.
As a final comment, note that while there is a useful purpose for having two or more set statements, there is nothing similar for merge statements. You can put two or more merge statements in a data step, with or without a by statement, and not get an error message, but the resulting data values may be intermingled in "unexpected" ways.
Exercises
1. Refer to the data in Exercise #1 of Lesson 6. For each of the following, include a proc print to display the results. Use appropriate titles. This can all be done in one program.
2. Refer to Exercise #2 of Lesson 5.
3. Download this data. Do not change the file or copy it into your editor. Look at it with a text editor, such as Notepad. There are two sections, with headings that say "1." and "2." (These lines are not to be considered observations).
4. Copy the SAS code below into the editor to start with. Assume that these data sets represent an inventory list that is being revised at each step. The prices change each time, but the "itemno" is revised between new1 and new2 only. Write a program that does the following, and print each of the data sets you create.
data new1; input itemno name $ price; cards; 325 PrintCrd 211 276 KeyPad 37 842 PnclHldr 8 422 PaprShrd 132 523 Basket 29 ; data new2; input itemno name $ price; cards; 333 PrintCrd 399 277 KeyPad 25 802 PnclHldr 12 417 PaprShrd 122 515 Basket 17 ; data new3; input itemno name $ price; cards; 333 PrintCrd 386 277 KeyPad 25 802 PnclHldr 11 417 PaprShrd 135 515 Basket 15 ;
Copyright reserved by Dr. Dwight Galster, 2006. Please request permission for reprints (other than for personal use) from dwight.galster@sdstate.edu . "SAS" is a registered trade name of SAS Institute, Cary North Carolina. All other trade names mentioned are the property of their respective owners. This document is a work in progress and comments are welcome. Please send an email if you find it useful or if your site links to it.