Lesson 10: Basic Statistics with Proc Univariate, More ODS
Summarizing Data with Proc Univariate
Proc Univariate and Proc Means are procedures in Base SAS that calculate statistics one variable at a time (they do not explore relationships between variables). The two procedures have quite different listing output but many similar capabilities. Proc univariate is the more extensive of the two. In order to demonstrate these procedures in a meaningful way, a larger data set than those we have seen previously will be needed. The data set we will use is shown below. A text file containing the data is here. The data set contains three variables, a group variable with values 1, 2, and 3, a discrete variable x with values 1-6 (a die toss) and a continuous variable y.
First, we look at a very simple proc univariate step. The "var" statement lists the variables for which analysis is to be performed. If "var" is omitted, univariate will give analyses for all numeric values in the data set. Incidentally, this would include any for which the analysis is silly, such as the "group" variable in this example. Thus, specifying variables to be analyzed is a good idea.
The results include a fairly detailed summary with all kinds of statistics for the variable, spread over two pages.
Proc univariate has many options and optional statements. We will explore a few of the more common ones. For more, see the documentation under Base SAS/Base SAS Procedures Guide: Statistical Procedures.
In the middle of the first page of output, above, note the section titled "Tests for Location: Mu0=0." These are statistical hypothesis tests where the null hypothesis is that the mean of the random variable is equal to zero. The small p-values indicate that the null hypothesis should be rejected, and the conclusion drawn that the mean is not zero. Perhaps you would like to test whether the mean is some other value, say, 100, for example. You can add the following option to the proc univariate statement. The higher p-values indicate that this null hypothesis cannot be rejected.
Two other options in the proc univariate statement are normal and plot. The normal option produces the section on tests for normality, and the plot option gives the stem and leaf, box and whisker, and normal probability plots below.
The notation below the stem and leaf plot, which says "Multiply Stem.Leaf by 10**+2" means that if you read the numbers like 6.9 for the first one, you should multiply that by 10 to the second power, so it is really 690. This data is badly skewed, so the box plot is not at all symmetrical. It usually has a dashed line through the middle of the box for the median. The "+" represents the mean. To interpret the normal probability plot, look at the band made up of "+" signs. The asterisks are the data, and if they mostly fall within the band, the data may be considered normal. In this case, the data are not normal, as the normality tests also show, since the low p-values indicate the assumption of normality should be rejected.
Optional Statements in Proc Univariate
Like proc print, proc univariate has a by statement, which will produce separate analyses for each value of the variable specified. In this case, the result is three sets of output for each value of "group" (results not shown).
The graphics shown above are somewhat rough, but proc univariate can also produce high resolution graphs, such as a histogram, which is displayed in a graph window. If a "var" statement is used, the histogram variable must be included in the listed variables. The "/normal" portion is an option to the histogram statement that superimposes a normal curve on the histogram. This demonstrates again that the normal distribution is not a good fit to this data. (Other distributions can be specified for the curve.)
The "qqplot" statement produces a high resolution version of the qqplot. Here is an example with the exponential distribution. A qqplot should fall in a nice straight line if the distribution is a good fit. Obviously, we are not having too much luck fitting a distribution for y!
Producing an Output Data Set
Univariate can also produce a data set containing the statistics seen in the output. If this is the only goal, a "noprint" option in the proc statement is a good idea. This suppresses the usual listing output in the output window. A "var" statement must be used with the output statement to determine which variables will be used for the output data set. The desired statistics must also be specified. There is a long list of these; again, see the help or documentation for details. In the example here, standard deviation and mean have been requested.
Note that the syntax of the output statement requires a keyword for each requested statistic, followed by an equals sign, followed by a list of variable names for the statistics, one for each variable in the "var" statement. There will be only one observation, unless a "by" statement is also given, in which case there will be one for each value of the "by" variable, as shown in this example.
Using ODS to Control Output
We saw that proc univariate creates several sections, each with its own heading and a table of information (except the graphs). In ODS, each of these sections is an output object. An output object generally has two parts, the data component, and the table definition. The data component is obviously the data that will be displayed in the table, and the table definition is a set of instructions that describes how to format the data. Each output object has a name and can be accessed separately through ODS. To see information about the output objects your procedure is producing, you can issue the following ODS commands and look at the results in the log (only part shown here):
Notice that the name of each object corresponds roughly to the label in the output. In most cases, just the spaces are eliminated. This makes it fairly easy to identify the object name. Sometimes in the output there will be more specifics, like the section on "Tests For Location" which also gives the null hypothesis value in the output, but that is not part of the name or label. In any case, having the name, we can now use the ODS select statement to choose which objects to print or send to any ODS destination. Alternatively, and ODS exclude statement can be used to eliminate unwanted objects with similar syntax.
ODS can also be used to save objects to SAS data sets. They are then available for use in data steps or other procedures.
Use the data set of used cars inventory from previous lessons for the following problems:
1) Use proc univariate to analyze the price variable in the used cars data. In addition to the default output, produce tests of normality, and low-resolution plots (box plot, stem & leaf, and qqplot).
2) Use proc univariate to analyze the miles variable, change the null hypothesis value for the tests of location to 50,000, and use ODS commands to display only the tests of location in the output window.
3) Use proc univariate to analyze the price variable. Use ODS commands to print only the "Moments" object in the output window and to save the "Moments" object to a SAS data set and print it.
Copyright reserved by Dr. Dwight Galster, 2006. Please request permission for reprints (other than for personal use) from email@example.com. "SAS" is a registered trade name of SAS Institute, Cary North Carolina. All other trade names mentioned are the property of their respective owners. This document is a work in progress and comments are welcome. Please send an email if you find it useful or if your site links to it.