Lesson 6: Creating Variables
Creating Variables with Assignment Statements
can create variables that are not in the data that you are reading.
In the following program, avgrowth is calculated
from the other variables in the data.
Second, dummy is assigned a
Since the value assigned to it is a character string,
indicated by the quotes around it, dummy will be a
variable of length 3, because that is the length of the first value
assigned to it. Unless a length statement is used to set the length of a
character variable before it is used, the length will be determined by the first
value assigned to it. Numeric constants can also be assigned, simply by
putting a number on the right side of the equal sign (no quotes).
Obviously, variables are created by the input statement, but they are also created if they are specified in a length, attrib, format, or informat statement (see below). They can also be created by array definitions (a later topic), or by assignment statements, such as those in the example above. An assignment statement is made up of a variable name, an equal sign, and an expression representing the value to be assigned to the variable. The variable can appear in its own assigned expression, such as x=x+1, or x=log(x). A very special form of assignment statement, called a sum statement, or an accumulator, is an exception to this syntax. In the example below, p and q are accumulators. Their values are incremented, starting from zero, by the amount specified, for each succeeding observation. The accumulator p+1, below, is essentially the same as p=p+1, except that it is initialized to zero, which does not automatically happen if you use p=p+1. (See also the retain statement.)
The arithmetic operations and mathematical functions used in assignment statements for numeric values are quite intuitive. The syntax is similar to that used for formulas on a graphing calculator or spreadsheet. The arithmetic operations are "+" (add), "-" (subtract or negative), "*" (multiply), and "/" (divide). Exponents are given with a double asterisk, such as "3**2" (three to the second power). Parentheses are used in the usual manner for controlling order of operations. Many functions are available, and their names can often be guessed because of their similarity to standard mathematical notation. All functions have at least one argument enclosed in parentheses. Some examples are sqrt(x) for square root of x (where x can be a number, variable name, or other expression that evaluates to a non-negative number), log(x) for natural log, and exp(x) for the exponential function ("e to the x"). There are also some constants, such as pi, given by the function constant(pi). For more detailed information about functions, see the SAS Documentation under "Base SAS/SAS Language Reference: Dictionary/Dictionary of Language Elements/Functions and CALL Routines." (Note: Some of the documented functions may not work in The Learning Edition.) Here are a few more examples:
Since dates are numbers, you can do simple things like subtract two dates to find the number of days between them, without any problem. However, for more complicated tasks, SAS has quite a few date-related functions. For example day(x) returns the day of the month, month(x) returns the month number for a date, and qtr(x) returns the quarter number. If you have to do any serious computations with dates, check the SAS documentation for available tools. Remember, SAS also has date-time values, and functions to go along with them, as well.
For character variables, there is an operation called concatenation, indicated by "||" (two vertical bars), that puts two character strings together. There are many, many functions for character variables. We will just look at a few: substr(source, position, length) which extracts a substring, trim(source) which eliminates trailing blanks, length(source) which calculates the length of the value excluding trailing blanks, and upcase(source) which changes all the letters to upper case.
In the program below, a length statement has been used for the city variable, to allow up to 15 letters. Note that this method would not work for city names that have more than one word, like "New York City." The st (state) variable has been given an informat for two characters, but a colon modifier is used so that the pointer will move on to the beginning of the zip code. Zip codes should always be character variables, otherwise those that start with zero will be shortened.
The first assigned variable, addr1, is created by simply concatenating all three variables. Note the (possibly undesirable) result, with the "extra" spaces between city and state, and the lack of spaces between state and zip code. The spaces are there because the variable length is, in fact, 15, and the unused positions are filled with spaces. Concatenation uses the whole variable, including spaces.
In addr2 we have removed the trailing spaces from city by using a trim function. Now there are no spaces between any of the combined variables.
In addr3, we have included punctuation and spaces between the variables. Notice that the concatenation operation works with constant expressions enclosed in quotes, as well as variables. Spaces are preserved just as written between the quotes, including the one space after the comma and the two spaces in front of the zip code.
The upcase function is demonstrated in addr4, which converts addr3 to all uppercase characters. Following that, the substring function is used to create a four-letter abbreviation, by extracting the first two letters of city and combining them with the state code. Note the order of the three arguments, first the source variable, then the starting position, then the number of characters to extract.
The last assignment statement shows how we can combine various functions to perform a specialized task. The idea here was to find the middle character of the city variable, defined to be the actual middle character for odd lengths and the letter immediately prior to the middle for even lengths. The substring function is used to extract the character, but the starting position must be calculated. The length function divided by two would be almost right, as it works fine for even lengths, but for odd lengths gives a half, like 4.5 for "Brookings." Since the middle character is the next higher whole number, we can use the ceiling function, one of several rounding functions available, this being the one that always rounds any decimal value up to the next integer.
Length, informat, and attrib statements
An alternative to specifying informats in the input statement is to use an informat statement. The informat statement has the same syntax as the format statement. It doesn't do anything that can't be done in the input statement, but it might be convenient to keep things organized, as in this example:
Numeric variables have a default length of 8 bytes in SAS. As we have seen, there is also a default length of 8 for character variables, if they are read using a $ informat, or if an informat is used, the length depends on n in "$n.". In a later section, we will see that if character variables are created using data step programming statements, they get their lengths from the first value assigned to them. The length statement can be used to override the default lengths for both character and numeric variables.
It's not often we want to change the length of a numeric variable. Sometimes space can be saved when the values are integers. The allowed lengths are from 3 to 8 for PC SAS. A length of 3 will accommodate accurate integer values from -8192 to 8192. A length of 4 works to slightly over 2 million. It is not recommended to use shortened numeric values when fractions (decimals) are involved.
In the above example, you can see that the length statement has syntax similar to the format or informat statements. However, the "dot" is not required. Here it has been left out for the numeric length and included for the character length, just for an example. The dollar sign, however, is required for character variables. The length statement must occur before the first use of the variable in the program, or it will not have any effect.
Another way to use the length statement is shown below. This example sets the default numeric length to 3. Unless you specify other lengths, all numeric variables in this data set will have length 3. (This only works for numeric variables.) The character variables will have length 8.
Another way to do this is with the attrib statement, which is more complicated and allows you to set the lengths, formats, informats, and labels all in one command:
For each of the following exercises, copy and paste the data given in the problem into the SAS editor. Write a data step to read the data and create the new variables described, then print the results using proc print, using appropriate titles.
1. These numbers represent dimensions of cardboard boxes, length, width, and height, in inches.
32 18 12 16 15 24 48 12 32 15 30 45 20 30 36
2. This problem will provide a little practice in writing complicated formulas in SAS, paying attention to order of operations. Use the data below, with variables a, b, and c, and apply the following formulas to create two new variables called root and trunk. The first observation's results are -1 and 2.094, respectively.
1 6 5 4 -20 2 12 22 -11 3 -15 -9
3. Read the following data into three variables, making sure to get complete names. Use the character functions and operators to extract initials from the following names so that they look like "J.F.K." Then create an abbreviation for each name that looks like "J-n F-d K-y".
John Fitzgerald Kennedy Martha Helena Goetz Frederich Anthony Sailer Albert Blake Codwell
Copyright reserved by Dr. Dwight Galster, 2006. Please request permission for reprints (other than for personal use) from email@example.com . "SAS" is a registered trade name of SAS Institute, Cary North Carolina. All other trade names mentioned are the property of their respective owners. This document is a work in progress and comments are welcome. Please send an email if you find it useful or if your site links to it.