Lesson 4: Numeric Formats and Informats
SAS stores all data with only two variable types, character and numeric. We have seen that character variables can have different lengths. In contrast, numeric variables (including integers and dates) are almost always stored in 8-byte floating-point form. These numbers have a precision of about 16 significant digits. Some other languages refer to this as a "real" data type (which is not mathematically correct, of course). Therefore, we do not have to be concerned about how numbers will be stored. We do, however, have to think about how to read and write them.
Let's begin with a simple example. Here we see a data set with three numbers. There are no complications in reading these numbers; the variable x in the input statement is a numeric variable by default (no informat). Observe that in the output, the first two observations were written in the same form they appeared in the data, but the third was not. Because it is such a large number, SAS defaulted to printing this number in scientific notation. The "E18" is interpreted as "times 10 to the 18th power."
The appearance of observation 3 in the output can be changed by adding a format statement to proc print, as shown below. The actual format code is "19.". The period or decimal point is part of the code and is standard syntax for all formats and informats. Character formats and informats start with "$". In most formats and informats there is a number right before the period that indicates the field width. For numeric variables only, a number after the period indicates the width of the decimal portion of the number.
Notice that 19 digits are now displayed, however, the last two are not the same as the original data. There has been some rounding error.
Commas can be included in the output. Note that the field width had to be increased to accommodate the commas.
Or perhaps you want dollars:
Let's turn to reading numbers in various formats. First, we should note that we cannot read numbers that are not in "standard format" without an informat, a code that tells SAS how to interpret the number it is reading. See what happens in this example:
The commaw. (dollarw. is the same) informat reads numbers with commas, as well as dollar signs and some other imbedded symbols. The w stands for the field width, but be sure to count the commas (and dollar signs) when determining the number of columns needed. Also, make use of the colon modifier just as with character informats, if the field widths vary. (If you are working with other currencies or European style numbers, check the documentation for alternative informats.)
Handling decimal places correctly can sometimes be tricky. Here we see a basic example with no informat. In addition, no format has been specified in proc print, so SAS chooses a format that it "thinks" is "best." Note that the third observation is rounded, but this is only because of the printing format that was used, and does not mean the number is rounded off in the data set. If you add a format statement with a 9.4 format to the proc print step, all the original digits will be displayed.
Here is an example that is incorrect. Numeric formats and informats can specify a number of decimal places by putting a number after the period. The informat below, 4.2, is saying that SAS should read a field of width 4 with two decimal places.
One problem is that the decimal is part of the field and needs to be counted, which caused the second and third observations to be cut short. We should have 5 instead of 4 for the field width.. Secondly, when a width for the decimal portion of the number is specified in an informat, it means that those decimal places are to be assumed whenever no decimal point appears in the number. It does not over-ride an existing decimal point. The first observation is thus interpreted as 11.22, that is, assuming that the last two digits should be in the decimal portion. This may or may not be correct, so great care must be taken when using this method. Normally, it would only be used when the data is known to be recorded with an implied decimal. Usually such data are not mixed (with and without decimals). The most common situation in which mixing might occur is when a variable is a percent or proportion, and has been recorded inconsistently, using both notations. In that case, you might want 22 and .22 to mean the same thing. A 3.2 format would accomplish the correct result.
Here the field width is corrected, the decimal is left off of the informat, and a format is included in proc print to display all the decimals.
Formats can be permanently stored with the data set. If this is done, the formats are then available to any proc that can use them, without writing another format statement. The program shown below uses the "5." informat, but also stores the "9.4" format in the data set. There is no need for another format statement in proc print. However, if you want to use a format other than the one stored with the data, you can still specify it in proc print (perhaps "format x 7.2;"). It should also be mentioned here that SAS syntax allows specifying a format for several variables at once. For example, "format x y z 7.2;" would apply the 7.2 format to all three variables, x, y, and z. Or, you can use "format x 7.2 y z 5.;" which will apply the 7.2 format to x and the 5. format to y and z.
While this covers the most frequently needed informats for numbers, there are many other special cases. Check the SAS documentation (under Base SAS, Language Reference: Dictionary) for other formats and informats.
Date Formats and Informats
Since SAS has only two data types, you may wonder what we do with dates. While it is possible to store dates in character form, doing so would make calculations with dates very difficult. Dates are stored as numbers, precisely, the number of days since (or before) January 1, 1960, which is "day zero."
You cannot read a date without an informat, except perhaps in the rare event that it is already coded as the number of days since 1/1/1960. Dates can be written in many ways, and SAS can read almost any of them, with the right instructions. In the example above, the dates are given in the most common American format, month/day/year. The informat uses the codes mm for month, dd for day, and yy for year. In some other countries, dates are written in the form day/month/year, so in SAS we simply switch the order accordingly, to "ddmmyyw. Once again, w stands for the width of the field that is being read, including the delimiters. SAS does not require a specific delimiter when interpreting the date, so it does not matter if it says 1/1/1960 or 1-1-1960 or 1.1.1960. The width is usually 10 to accommodate two digits for month, two for day, and four for year, plus two delimiters. However, it could be 8 if only two year digits are used, and it could be 6 or 8 if no delimiters are used (010160 or 01011960). SAS will interpret these variations correctly as long as they are not ambiguous (1160 for January 1, 1960 would not work), as in this example:
Of course, we don't want our printout to give dates like "0" or "16604," since they are not very meaningful to human readers! Therefore, we should include a nice format to make it readable.
There are many formats to choose from. See the SAS documentation for more. Here are some more examples. In the program below, the dates appear in the data in three different formats. The first one is like those discussed above, the second is what SAS considers a "standard" date, and the third is a Julian date (used in many businesses--it's the year followed by the number of the day in the year). Three proc print steps follow, to demonstrate the dates with no format, and with six different formats. As with character and numeric formats, date formats can also be stored with the data set.
Note that SAS also has informats and formats for time values and date-time combination values. If you have need of these, or want to explore the many other possibilities, check out the SAS documentation.
Interpreting Two-digit Years
In the years leading up to 2000, the government and businesses became very concerned about the "Y2K" problem. Many programs stored years using two digits, because most business software did not need to deal with dates outside of the twentieth century. In 2000, this all changed. SAS never had a "Y2K" problem for storing dates, since a date is just the number of days before or after January 1, 1960. However, there can still be a problem when reading dates with two-digit years from a file or instream data. SAS interprets two-digit years as belonging in a specific 100-year interval. The first year of this interval is called the Year Cut Off, and the SAS system option "YearCutOff=n" is used to set it in an options statement. The default value of YearCutOff is set by the administrator when SAS is installed, typically 1920.
1. Copy the raw data below into a SAS program. a) Write a data step to read these data into three variables: Invoice, Amount, and Quantity. Using proc print, display the data so that all the Amount values are formatted like the Amount value in the first observation of the raw data. b) Then, revise the data step so that a format for Amount is stored with the data set, and show the results in proc print, without using any format statement in proc print. Use appropriate titles that identify which part of the exercise the output comes from.
12244 $1,499.99 144 32189 $20,000 1 92314 49.28 3
2. Copy the data below into a SAS program. Write a data step to read them into a SAS data set. The variables are capitol, state, capitol population, and state population. Store labels with the data set. Print the data with proc print, displaying labels, using appropriate formats and a title.
Bismarck ND 56,344 633,837 Pierre SD 13,939 764,309 Helena MT 26,718 917,621 Madison WI 218,432 5,472,299
3. Download this file. The data contain the names of some of the past presidents of the United States together with their birth and death dates. The data are aligned in columns, as shown in the example below (the longest name). Save the file to your computer and use an infile statement to read the data from the file into a SAS data set. Read the entire name into one variable, using a character variable length of 23. Store formats for the dates with the data set. Write three proc print steps that result in three different sets of date formats. Use appropriate titles.
William Henry Harrison 02/09/1773 04/04/1841
Copyright reserved by Dr. Dwight Galster, 2006. Please request permission for reprints (other than for personal use) from firstname.lastname@example.org . "SAS" is a registered trade name of SAS Institute, Cary North Carolina. All other trade names mentioned are the property of their respective owners. This document is a work in progress and comments are welcome. Please send an email if you find it useful or if your site links to it.