Lesson 9: Proc Rank, Proc Contents, and proc Format
Proc rank is used to generate rankings for observations. Ranks may be useful in their own right, or they may be needed for non-parametric statistical methods. The procedure is fairly straightforward.
A var statement names the variables to be ranked, and a ranks statement names the variables that will contain the ranks. Both data= and out= options are available as in proc sort, but there is a difference in default behavior that sometimes causes confusion. Suppose I submit the following program:
Whereas proc sort would have given us a sorted data set one, proc rank didn't put the ranks in one. Where did they go? A look in the log shows us that a new data set called data1 was created. Proc rank is one of several SAS procedures that follow this convention: if you do not provide data set names for new data sets, they will be named sequentially as data1, data2, etc. Proc rank will not over-write an existing data set unless you supply a name.
If we had not specified a data set for proc print, it would simply have printed data1 since it is the most recently created data set. Specifying the data set is a good idea, though, because we can easily make mistakes by not paying attention to which data set is being processed. Here, we name the output data set, with ranks, two, and let proc print display it by default.
Look at the scores and the ranks. Are they what you expected? Perhaps not, as we often think of the highest score as being "number one," but here the lowest score is "number one." This is an ascending ranking. If you want the ranks to go the other way, you need a descending option to reverse the order of the ranking. Proc rank doesn't allow ascending and descending ranks in the same proc rank step. You can overcome this by using two steps, taking the output of the first as input for the second.
Notice that srank and grank are both produced in the first rank step, and both give lower ranks to larger numbers (descending). The second step takes the ranked data set, two, as its input, and adds srank2, and ascending rank for the scores.
Did you notice the values for grank? You may wonder how or why you get ranks that are not whole numbers. This happens because some values are tied. In fact, there are two of each grade. SAS has taken all the tied observations and averaged their ranks. You can use a ties= option to specify what to do in case of ties. The possibilities are high, low, and mean. If you use high or low, it will take the highest or lowest rank of the tied cases. Be careful! The result you want may be affected by whether you are ranking in ascending or descending order.
Getting Information About the Contents of a Data Set
Proc contents displays information about the variables in a data set, as well as various characteristics of the data set. The information you are most likely to be interested in is the third section on variable attributes. The variables appear in alphabetic order. The "#" column indicates the order of the variables in the file, while the "pos" column gives the actual position in bytes from the beginning of the line. "Type" should be obvious, and "Len" is length, of course.
Proc contents doesn't have many options, but here are a couple of them. Short gives a very short version of the output, which is actually just a list of the variables. Varnum causes the variables to be displayed in the order of their position instead of alphabetically.
Custom Formats and Informats
Proc format creates custom formats and informats. As we have seen, informats are used in reading data and determine how a value will be stored. Formats are used to determine how a value will be printed. Custom formats and informats allow grouping of values, for example, ranges of numbers could be recorded or printed as "Low," "Med" or "Hi" values. We will focus on formats, although similar commands can be used to produce informats.
Your format will need a name. There are some rules that must be observed, in addition to the normal rules for SAS names (you can use letters, numbers, and underscores, but can't start with a number). First of all, the name you choose cannot be the same as that of an existing format supplied by SAS. The length cannot exceed 32 characters, but this includes the "$" that must begin a character format, and an "@" prefix automatically appended by SAS to user-defined informats. (You may see this in the log.) Also, character format names cannot end in a number. Well, it's not likely you'll want to make any names that long anyway, and the "$" requirement is familiar, so if you make it a practice not to end with a number, you shouldn't have too much trouble. To avoid duplicating a name SAS already has, it is a good idea to include a short character combination that is unusual--perhaps your initials, business acronym, etc. as part of the name.
Suppose you have a data set like the following, with a product number (a character variable) and a price. You'd like to print a report that contains the product description and the price.
A good way to do this would be to create a format that associates a description with each product number. In proc format, the value statement is the actual command that defines a format. (A similar invalue statement defines an informat.) In the example below, the expression following the key word "value" is the name of the format. Note that there is no period at the end of the format name here, but the period is used when the format is associated with a variable, as in the format statement under proc print. The expressions after the name are called value range sets. In this case each variable value is assigned one formatted value, but there are other possibilities. The formatted values can be up to 32,767 characters long, but some procedures only use the first 8 or 16 characters.
The original data did not contain labels, so here we show that label statements can be added in proc print (and some other procs) as well.
A real world application like this would probably involve thousands of items. It would not be good to rebuild the format every time it was used, so user-defined formats can be permanently saved and accessed when needed. The simplest way to do this is to use the special libname library. If you include a libname statement like that shown below, together with the library=library option in proc format when creating the format, then put the same libname statement in any program that uses the format, SAS will store the format in the specified directory and will search for it there when you want to use it.
The following program will find and use the format created above.
In the next example, we have a list of students in various grades. An informat is created to classify the values into categories representing the school level. This is a character informat, since the resulting values will be character strings. It may be tempting to think that the numbers representing grades are numeric, but they are treated as character values too. This is important when specifying ranges, because numbers and alphabetic expressions do not sort the same way. The first value range set, which defines the Elementary category, represents grades 1, 2, and 3. Multiple values can be listed on the left side, separated by commas, or ranges can be specified, using a dash. SAS accepts these values without quote marks around them, but quotes can be included if desired, such as "1", "2", "3". There is an important reason why this range was not given as 1-3. That is because in character sort order, 10, 11, and 12 come between 1 and 2! Using 1-3 would indicate that students in grades 10-12 were to be classified as Elementary, which is incorrect. Furthermore, these values are defined again later, which produces an error and the informat is not created. Similarly, for the High Schl category, a range of 9-12 cannot be used. SAS complains with an error message in the log, that "Start is greater than end" in this case. If a value occurs in the data that is not defined in the format procedure, SAS uses a default informat, as occurs with the "K" grade level. You can also use other as a range for anything that does not fit what has been listed.
The next example shows a numeric format. The ranges include the words low and high, which can be used for unspecified or infinite lower and upper bounds. Also, the less-than sign is used as a way of excluding endpoints. Each of the ranges (except the last) given here will exclude the upper endpoint. For example, a score of 79.99999 would get a value of 2, but a score of 80 would get a value of 3. If you want to exclude a lower endpoint, put the less-than sign before the minus, such as "60<-70."
Make use of the used cars data set created in the previous lesson. There should be no data step in this lesson.
1) Rank the prices so that the highest is number 1, and the miles so that the lowest is number 1, and send the result to a new data set and print it. (Note: this has to be done in two steps because all ranks in one step will go the same direction.)
2) Run proc contents on the data set created in the previous problem, displaying the variables in the order they exist in the data set..
3) Create a format that combines the colors into three categories, "light," "dark," and "other" for those that aren't specifically assigned. Use your own judgment in classifying the colors. Be sure to leave at least one out so there is something for the "other" category.
4) Create a format to classify the miles into categories of "high," "medium," and "low." Use your own judgment to define these categories.
5) Print the data using your new formats. Keep these formats for use in later exercises.
Copyright reserved by Dr. Dwight Galster, 2006. Please request permission for reprints (other than for personal use) from firstname.lastname@example.org . "SAS" is a registered trade name of SAS Institute, Cary North Carolina. All other trade names mentioned are the property of their respective owners. This document is a work in progress and comments are welcome. Please send an email if you find it useful or if your site links to it.