Lesson 17:  Generating Random Data


SAS provides several functions for generating pseudo-random data.  The most popular are the functions that provide normal and uniform distributions.  Uniformly distributed values may be generated by the uniform(seed) function (alias ranuni), which gives random numbers in the interval [0,1).  Standard normal values are generated with the normal(seed) function (alias rannor).  In both cases, using a seed of 0 gives a random start based on the system clock. For publication, specify a seed of your choice so that others may duplicate your values.

Note that there is no source data (either raw or other SAS data set) for this data step.  There is only one iteration of the data step, therefore we control the entire process using a do loop.  The next example shows how to generate a simulated die toss. The die tosses are, of course, integers from 1 to 6.  So we need to convert from a uniform interval on [0,1) to a uniform discrete distribution with values from 1 to 6.  Multiplying the uniform values by 6 gives the interval [0,6).  The "int" function takes the integer part and discards the decimal, so now we have integers from 0 to 5.  Adding one gives the desired result.

If you want to generate random normal values to simulate a population, you need to know the standard deviation and mean.  You multiply the standard normal values by the standard deviation, then add the mean.  Say we wanted heights of male college students, and believed the mean was 70 inches and the standard deviation was 5 inches.  Then, the following program would give a good simulation.

Perhaps it would have been more satisfying to write the equation above as x=70+normal(0)*5.  The result is the same, of course.  But we like to think of the mean as the value around which the population varies, so it makes sense to start with the mean, then add the term that creates the variation.

A similar strategy can be used to obtain function values for a series of numbers.  For example, suppose you wanted to make a table of the probabilities for a binomial distribution with n=10 and p=0.2.  The following program gives the cumulative probabilities. Notice that the loop counter (index)  is actually a variable we want to keep, and is used in the calculations.

Or, suppose you'd like to graph a parabola in SAS.  In this example, the loop counter is used in the calculations too, but this time, we don't increment it by 1, but in steps of 5 each time the loop executes.

Suppose we'd like to simulate a discrete distribution with unequal probabilities for each value, such as the following:

x  P(x)
1  0.1
2  0.2
3  0.4
4  0.2
5  0.1

This can be done by "cutting up" the uniform interval and assigning different values to different sized parts of it.



1.  Generate 1000 tosses of two dice, calculate the sums, and make a bar chart for the sums.

2.  Simulate 10,000 observations from the following distribution and print a frequency table of the results.
    x   P(x)
    0  0.2
    1  0.3
    2  0.5

3.  Suppose the population of male college students has a mean height of 69 inches and a standard deviation of 4.5 inches, while the population of female college students has a mean height of 64 inches and a standard deviation of 3.5 inches.  Simulate heights for 50 male and 50 female college students.  Each observation should include a gender variable and a height variable.  Use proc means to see how close the mean and standard deviation of your simulated values come to the specified values.

Copyright reserved by Dr.  Dwight Galster, 2006.  Please request permission for reprints (other than for personal use) from dwight.galster@sdstate.edu  .  "SAS" is a registered trade name of SAS Institute, Cary North Carolina.  All other trade names mentioned are the property of their respective owners.  This document is a work in progress and comments are welcome.  Please send an email if you find it useful or if your site links to it.