Lesson 14: Proc Transpose and Proc Report
Proc transpose is used when you need to "turn your data on its side." Basically, it turns observations into variables and variables into observations. Suppose we had the following sales data in Excel:
Now we can clearly see that this data consists of four columns and five rows. The header row across the top and the first column are not really data. However, when we read this into SAS, we will need to consider the first column to be a variable also, or we will lose the designation for the week that goes with each row. The following program will do:
And the result looks like this:
Now suppose what we really wanted was for each salesperson to be an observation, and each week to be a variable. The id statement tells proc transpose which column holds the new variable names. If no id is given, SAS will use default variable names like _COL1_ etc. Also notice that SAS puts the old variable names in a new column called _NAME_. This variable name can be changed using a "name=" option in the proc transpose statement.
Now suppose we have another variable, say "City," and within each city we have weekly sales data for our four salesmen. The number of weeks is reduced here for brevity and also to show how certain missing values are handled. We want to transpose the variables again, but this time within each city.
At first glance, proc report looks like it does the same thing as proc print. In the simplest form, the output can be very similar to proc print. But you might say proc report is like proc print on steroids--it has far more capabilities. The example below shows how the default settings of proc print and proc report compare. Proc report doesn't print observation numbers, doesn't skip a line below the headings, and aligns the columns differently. But that's just the beginning. The data for this example is here.
Notice the "nowindows" option in the above example. We will always use this option in this lesson. Proc report comes with its own window-based user interface, which will come up automatically if you do not specify "nowindows." We will not be studying the report window in this lesson.
The column statement is used to specify which variables are printed and the order in which they appear. A define statement is used for each variable to specify the the options associated with that variable. In the example below, name is defined as a group variable. Notice that the names are no longer repeated on each line (there will be more significance to the group variable later). For each variable, an expression in quotes has been included--this is the same as a label, and can be controlled with split characters similar to what we have seen in proc print. Notice that the last column has the label words split in an awkward way. This is due to a column width of 5, which comes from the format given in the define statement for the last column. Observe the three different formats that were used, and the effect of each.
Now for a few more options to dress up the output. First, in the proc report statement, we add the options "headskip" and "headline". They add a line under the headings and a "skip" or blank line beneath them. The formats have all been changed to accommodate the money amounts properly, but two new options are demonstrated in the define statements. For the cheese variable, a "width=10" option is included. This expands the column for that variable to 10, so that the column is now wider than the format specifies. For the cream variable, a "spacing=5" option is added. This puts five blank spaces in front of the cream column. Notice how this affects things differently, compared with widening the whole column.
Next, we have two new statements, break and rbreak. Break works together with the "group" designation given for name. It breaks the report for each change in the value of the group variable listed in the break statement. Notice that it says "break after". This gives the location of the break information, which can be either before or after the group it refers to. After the slash come some options. The "ol" means overline--it prints a line above the break line. There is also "ul" which means underline. Then we have "dul" which means double underline. This is done using equals signs. There is also "dol" for double overline. "Skip" means to insert a blank line after the break line, and summarize means to print summary statistics for the numeric columns. The default is sum, but many other statistics are available. The specific statistic is given as an option in the define statement, "sum" for sum, "min" for minimum, "max" for maximum, "mean" for mean, etc.
The rbreak command means "report break" so it provides summarization for the whole report. Similar options apply, except no variable is specified since it applies to the whole report and not the grouping of any variable. This provides the grand totals in our example.
The next example shows how you can duplicate a column for the purpose of getting different summary statistics. In the column statement, new columns are introduced using the syntax "variable=newvariable". The new variables are then given their own define statements, where the summary statistic "mean" is now included. The other thing that is changed is that the columns have been grouped, inside parentheses, and each group has a label specified as the first item in the parentheses. This provides a group heading for all the variables in the parentheses. Also, the labels have a minus sign at the beginning and end. This causes a line to be drawn out to the width of the column group for each label.
You can also calculate columns from the other columns. The new column must be listed in the column statement, and will have a define statement as well. To actually calculate the values, a compute block is used. It begins with the statement "compute newvariable" and ends with "endcomp;". The formulas for calculation use data step syntax, for the most part. However, there is a two-level name for some variables, as shown here. The variables are given as "variable.sum" because they are used as sum variables for summarization purposes. If we did not have the group variable, this would not be necessary.
Perhaps the idea of the two-level variable names, like cream.sum, will be more clear if we leave out the detail levels and show these variables only as summary variables. Taking the month variable, which relates to the detail rows, out of the column statement and deleting its define statement will do that. We also no longer need the break statement, and add an overline to the rbreak statement since all the dividing lines were in the break statement before.
Proc report has many more features and capabilities. We have only introduced some of the basic ideas here. If you are ever in a situation that calls for the generation of periodic reports on the same kind of data, it is worthwhile to spend the time to create a nice report that can be used over and over.
1. Begin with this data step:
data one; input trt $ s1 s2 s3 s4 s5; cards; Cont 4 5 5 4 6 Fast 5 5 6 6 5 Drug 7 7 6 5 6 ;
Suppose this is data from a designed experiment with three treatments and five subjects. To analyze the data in SAS, we will need it rearranged so that there are 15 observations, each having the treatment, subject identifier, and just one subject response. Use proc transpose to rearrange the data in this way.
2. Begin with this data step:
data one; input trt $ subject $ t1 t2 t3; cards; Cont s1 4 5 5 Cont s2 5 4 6 Cont s3 6 6 5 Fast s1 5 5 6 Fast s2 5 6 7 Fast s3 7 8 8 Drug s1 7 7 8 Drug s2 6 8 9 Drug s3 5 7 9 ;
Use proc transpose to turn the subject and time around within each treatment, calling the new time variable "time" and the using the subject values for the new subject variable names.
3. Refer to the used cars data from previous lessons (usedcars3.txt). Create a report that groups the cars by year and gives the average mileage and price for each year. Include nice headers and at least one example of each of the format, width, and spacing commands. Give each column a heading in the define statements.
4. Again using the used cars data, create a report that summarizes the prices for each make, giving the total and mean but omitting detail information.
Copyright reserved by Dr. Dwight Galster, 2006. Please request permission for reprints (other than for personal use) from firstname.lastname@example.org. "SAS" is a registered trade name of SAS Institute, Cary North Carolina. All other trade names mentioned are the property of their respective owners. This document is a work in progress and comments are welcome. Please send an email if you find it useful or if your site links to it.