Lesson 13: Proc Reg
No doubt one of the most widely used (and therefore abused) statistical procedures is regression. We are not going to learn how to do regression here. However, because of its popularity, we will use proc reg to demonstrate some of the typical syntax of statistical programs. Proc reg is not part of Base SAS, It is part of the statistical package called SAS/STAT. Therefore the documentation is found under "SAS/STAT," then "SAS/STAT User's Guide," then "The REG Procedure."
The data used for these examples is here. The file contains the variables x, z, and y and has two header lines. The idea is to use proc reg to derive an equation for the prediction of y based on x and z. This is often called the "model" or "prediction equation." The model statement in proc reg is used to define the form of the equation. The dependent variable (the one to be predicted) is given, followed by an equal sign, and then the independent variables (those on which prediction is based) are listed after the equal sign. Technically, the statement "model y=x z;" specifies an equation of the form
Where y-hat is the predicted value of y, and the beta-hats are the estimated coefficients of the equation, with beta-hat-naught being the intercept. In the SAS output the beta-hats are called "parameter estimates." Here is the program and output for this data:
We are not going to go into the interpretation of these results here. What we want to do is study how additional statements and options are used customize the results of the regression procedure. While the options, statements, and syntax vary for different statistical procedures, learning about proc reg will give you some of the general ideas, and thus give you some background for learning other procedures when you need them.
Note that the model statement specifies the form of the equation that will be fit to the data. There can be more than one model statement. If that is the case, it might be helpful to give an identifying label to each model. This label will be printed at the top of the output for each model. By default these are numbered, as in the above example, where the model is called "MODEL1." Labels can be up to 32 characters, with no spaces, followed by a colon, and placed in front of the model statement, as follows:
This produces one section of output for each model. The first is identical to that shown above, except for the heading that says "Model: Full" instead of "Model: MODEL1." The second section looks like this:
Proc reg, like proc plot, does not automatically quit running when it encounters a run statement. Unless another proc follows, it will wait for more statements to be submitted. For example, if you added the following lines to the program above, left them selected as shown, and clicked submit, SAS would produce the output for the next model, without re-running the rest of the program. (Any selected text in the Enhanced Editor is submitted without the rest of the program, a source of great irritation when done accidentally!) If you want to make proc reg quit, issue a "quit;" statement at the end of the program. One of the minor benefits of this is that it leaves the Output window on top, rather than bringing the Editor back up. Experiment with this a bit and you'll see.
Sometimes it is convenient to have the results of the regression, such as the parameter estimates and other statistics, in a data set. Proc reg uses an option in the proc statement, "outest=", to do this. Other options can be added to control what statistics are included. (You might notice that the editor has some trouble with the color coding on these, but even if they aren't blue, they still work.)
Notice the naming of some of the variables, how they begin and end with an underscore. It is important to include these underscores when referencing the variables, since they are part of the name.
Next, we will give some example options for the model statement, which are placed after a slash. Some of these options control what goes in the output, and some affect the modeling process. The "noint" option is used to fit a model with no intercept (recall the intercept is automatically included in the examples above). The "VIF" option adds a "Variance Inflation" column to the parameter table, and the "P" option gives a table of "Output Statistics" that includes predicted values of y (y-hats) and the "Residual," which is the difference between y and y-hat..
Another statement in proc reg is the output statement. This creates a data set, but unlike the "outest=" option in the proc statement, which gives observations for each model, this data set will contain output statistics for each observation in the data, such as printed in the example above. There is no slash in the output statement, the options simply follow the word "output." You should specify a data set name with the "out=" option, and then list the statistics you want included, such as predicted values, studentized residuals, etc. Each statistic has a keyword that requests it, then you must specify a variable name to use in the output data set. Thus "p=yhat" means to include the predicted value using the variable name "yhat." Some other examples are "r=" for residuals, "ucl=" and "lcl=" for the upper and lower confidence limits of the prediction. See the Online Documentation for the complete list.
Of course you can take this data set and make plots with proc plot. But proc reg also has its own plot statement built in. You can plot any of the variables in the original data set, plus the same new variables that are available in the output statement. These are named like the keyword that specifies them in the output statement, followed by a period. Thus, the predicted values are given by "p." and the studentized residuals are given by "student.", for example.
Use the used cars data from previous lessons. In proc reg, do the following (This should all be done in one program, with one proc reg step):
a) Compute a regression model for price based on miles and age of the car, and a second model for price based on miles alone. Use labels for the models.
b) Create a data set which contains the parameter estimates and rsquare values for each model.
c) Create a data set containing the predicted values and residuals (as well as the original data, which is included automatically).
d) Plot price vs miles and price vs year using a plot statement in proc reg.
e) Plot the residuals against the predicted values for each model using plot statements in proc reg (note: these plots use the residuals and predicted values produced by the immediately preceding model statement).
f) Print the data sets created by proc reg.
Copyright reserved by Dr. Dwight Galster, 2006. Please request permission for reprints (other than for personal use) from firstname.lastname@example.org. "SAS" is a registered trade name of SAS Institute, Cary North Carolina. All other trade names mentioned are the property of their respective owners. This document is a work in progress and comments are welcome. Please send an email if you find it useful or if your site links to it.