Lesson 15: Tabs and Other Delimiters, Controlling Observations, Variables, and Output Data Sets
There are essentially three ways that data can be identified in a raw data file. Either 1) The position of each value in the file must be known, either by columns or some other scheme, 2) The values must be identified by a name or other symbol, or 3) The file is "delimited" by a symbol that tells where one variable stops and the next begins. The latter case is the subject of this lesson. We also discuss some issues about missing values in list input.
We have already dealt with delimited files where the delimiter is a space. This is what we were doing whenever we worked with the list input style. Now consider the program below. Notice that there are two missing values, indicated by a dot (period). Although all the data values are numbers, the second variable has been defined as character in the input statement. In the output, the missing character value is blank, but the missing numeric value is represented by a dot (i.e., the decimal point of a number that isn't there). If column input is used, blanks in the data will be correctly interpreted as missing values, but with list input that would cause a problem.
The data in the above program was entered using tabs between the values. The SAS editor does a funny thing with tabs--although you can move the cursor back and forth over them, and the cursor will "jump" just as you expect, when SAS executes the program, it interprets the tabs as a series of spaces. If you try reading the data above with column input, you will find that the data are located in columns 1, 5, and 9. However, when the data are read from an external file, we encounter a much different situation.
Fortunately, there is an option for the infile statement, called expandtabs, to fix this. It still requires that the "dots" be in place for the missing values, though.
Of course, our delimiters need not be tabs or spaces--they can be almost any character. Commas are probably the most common alternative, though sometimes we see hyphens or slashes used. In such cases, we can use the "dlm=" option, or spell it out, "delimiter=". Here we have replaced the tabs with commas in the external file.
There are special delimited files that have more complex rules. You may have seen "csv" files, for example, that can be produced and read by Excel. That's an abbreviation for "comma separated values," which implies that they use a comma as a delimiter. But there is more to it that that. Missing values are not represented by dots or spaces, just two delimiters in a row. There is also a feature that text with embedded commas will be saved inside quote marks, so that the imbedded commas are not interpreted as delimiters. To read this kind of file, use the "dsd" option in the infile statement. The "dlm=" option can be used with it, in the rare event that the delimiter is something other than a comma. Here is a csv file, created in Excel, with the same data and empty cells for missing values.
The data import wizard will also import csv and tab-delimited files, as well as a number of other types, so you may want to check that out as well.
In a previous lesson, the obs and firstobs options were used to control which observations were included in a data set In this lesson some ways to fine-tune the selection of observations and variables are introduced. A text file containing the data below can be downloaded here.
You can decide whether or not an observation is included based on the value of a variable.
Or you can go the other way.
You can use comparison operators, like <, >, <=, >=, and their corresponding character abbreviations, lt, gt, le, and ge. If you use a character variable for the test, enclose the value in quotes and remember that case matters.
You can use logical operators like "and" and "or," and control order of operations using parentheses.
When reading from another SAS data set, such as when using a set statement, the same syntax can be used, or, you can use "where" in place of "if."
The use of if and where can be summarized as follows:
IF: Use in any data step, not in proc steps
WHERE: Use only when input is a SAS data set, in data or proc steps.
A data set can create more than one data set at a time. Perhaps you want to create a temporary data set for immediate use in the work directory, but also want to save it to a permanent library, as shown in the example below. Note that the data set names were placed on separate lines and indented in order to make it more clear to the reader what is being done.
Suppose, on the other hand, that you want to create different data sets. The following program will create three data sets, each containing different variables, but the same number of observations as the complete data.
You can also send different observations to different data sets. To do this, construct a logical test involving "if-then" clauses together with the "output" keyword to specify the target data set.
Notice that the three data sets being created are still named in the data statement. Data set options, such as keep and drop, can be used here just as in the last example. The input statement comes next, then the output instructions. As each observation is read from the cards (or infile, or source data set), the three if statements are processed, and whenever the logical condition is met, the observation is output to that data set. Some observations are output to more than one data set, because more than one condition is met.
If the intention is to split the data into mutually exclusive and exhaustive sets, then "if-then-else" clauses can be used to good effect. An "else statement" is only executed if the previous "if statement" is not (the logical test fails).
Use the example data given at the beginning of the lesson to do the following. Print all data sets.
1. In one data step,
create three data sets containing the following combinations of variables:
a) Name and age,
b) Height and weight, and
c) Name and height.
2. In one data step, create
three data sets containing the following groups of observations with all
a) All observations for names that come after "E" in alphabetical order,
b) All observations with ages of 9 or 10, and
c) All observations where the height is less than or equal to 48 and the weight is greater than or equal to 90.
In one data step, create three
data sets containing mutually exclusive observations (use if-then-else):
a) All observations where the weights are less than 75,
b) All observations where the weights are at least 75 but less than 100, and
c) All observations where the weights are at least 100.
Copyright reserved by Dr. Dwight Galster, 2006. Please request permission for reprints (other than for personal use) from firstname.lastname@example.org . "SAS" is a registered trade name of SAS Institute, Cary North Carolina. All other trade names mentioned are the property of their respective owners. This document is a work in progress and comments are welcome. Please send an email if you find it useful or if your site links to it.