Lesson 19-20:  Project Yahtzee Simulation


In the game of Yahtzee, five dice are tossed, and various combinations of numbers, similar to poker hands, are assigned point values.  In the game, dice can be selected and re-tossed, but we will focus on calculating the probabilities for the first toss only.  We will also deal only with the "lower half" of the score card in the game.  For the interested student, continuing this project to account for the complete rules of play would be an entertaining challenge.

Anyone not familiar with Yahtzee should try a web search for the rules of the game.  Some sites have applets that let you play online.

All you really need to know for this lesson, though, is which combinations are counted.  We will call these "hands," as the combinations in poker are called.  The hands in Yahtzee are:

Three of a Kind (three of one number and two others that are different)
Full House (three of one number and two of another number)
Four of a Kind (four of one number and one other)
Yahtzee (five of the same number)
Small Straight (four consecutive numbers)
Large Straight (five consecutive numbers)
Chance (anything that does not fit the above patterns)

We begin by creating an array to hold the die tosses.  There are five dice, so there will be five elements in the array.  A do loop can be used to "toss" the dice.

That gives us one toss.  To simulate the game, we will want to toss the dice many times and see what the probability of getting each scoring combination, or hand.  We will need another loop, surrounding this one, to give these repeated observations.  Note that an explicit output is now needed.  Since this program is going to get rather complicated, we will pay close attention to issues of style and readability.  Putting in comments to identify the beginning and end of major loops is helpful.  Care should be taken with indenting, to make sure all lines associated with a particular loop are indented at the same level.  The statement that begins a loop and the corresponding end should be at the same level of indenting, and statements within the loop should be indented two spaces from the level of the loop.  Statements within nested loops are indented again.

Suppose we systematically build up the identification of the hands.  There are many ways to do this.  Some are easier to program, some are more efficient from a processing standpoint.  At the beginning, it may not be clear what the best method is, so you should try some of your own ideas before reading on.  The solution presented here is kind of a compromise.  It may not be the easiest to program, nor the most efficient method. 

To start, let's see if we can identify a Yahtzee.  Now, Yahtzees are quite rare, so we can't rely on getting one by doing 10 random tosses.  The best thing is to put in a temporary piece of code that will artificially give us a Yahtzee, so we are sure to have something to idenify.

That was easy, huh?

OK, now you should take some time and think about what is required to identify a "Four of a Kind."  Consider all the possible ways that one would show up in the data.  How can you check for all the possibilities in an efficient way?  Is there something that can be done to make the search easier?

Don't read on until you've thought about it!





Well, I hope you thought about it.  Maybe you came up with the idea that it would be easier to identify the hands if the dice were sorted.  In fact, that is a very big help.  But, sorting between variables is not such a straightforward thing.  We can do something called a "Bubble Sort."  It is one of the simplest sort algorithms to program.  For more information, look it up on the internet (Wikipedia has a good explanation).  The sort routine can be inserted after the data are generated, and before the identification part of the program.  Here we have included a set of test values that are exactly backwards.  The sort routine handles these correctly, along with all the random observations.  Examine the sort routine thoroughly so you understand how it works, and how it makes good use of the array structure. (A drop option has also been added to the data set to streamline the output.)

Sorting the dice means that all dice that are equal will be next to each other.  Thus, to check for a Yahtzee, all we need is to find out if x1=x5. If x1 and x5 are the same, it is not possible (in sorted order) for the numbers in between to be different.

 Some examples of Four of a Kind (after sorting) are:
    1 2 2 2 2
    2 2 2 2 3
As you can see, either x1=x4 or x2=x5.  If it is not a Yahtzee, then these two conditions will identify Four of a Kind.

When it comes to Three of a Kind, we run into a little complication.  If we follow the strategy used for Four of a Kind, we would check if x1=x3, x2=x4, or x3=x5.  Consider the following examples:
    1 1 1 2 3
    1 2 2 2 5
    2 4 5 5 5
These would all be correctly identified.  But what about:
    1 1 1 2 2
    3 3 5 5 5
As you can see, these would all fulfill the first and third conditions proposed above, but they should be classified as Full House.  Therefore  we also need to check for a Full House in these cases.  The following identification routine checks for these types of Hands. At each stage, we have to be very careful that all possibilities are accounted for.

Now we are down to the straights. Large straights are simpler, as the only possibilities are 1-2-3-4-5 and 2-3-4-5-6.  Small straights have a number of different forms, such as 1-2-3-4-6 and 1-3-4-5-6, where none of the numbers are the same, and a number of possibilities involving numbers that are doubled, such as 1-1-2-3-4, 1-2-2-3-4, and 2-3-4-5-5, to give just a few examples.

Now we can change the number of observations  to 10,000 and use  proc freq to count the hands (DO NOT PRINT!).  Here is a table of the theoretical proportions.  These are given with 4 decimal places, which is convenient for a simulation of 10,000, because if you ignore the decimal point,  it is the expected number out of 10,000.

Yahtzee              .0008
Four of a Kind    .0193
Three of a Kind   .1543
Full House           .0386
Large Straight      .0309
Small Straight      .1235
Chance               .6326



1.  Create another array called "d" (for differences) with four variables.  After the sort routine, load the "d" variables with the differences between the dice.  That is d1=x2-x1, d2=x3-x2, etc.  Rewrite the identification routine to use the differences rather than the original die values.  Plan your strategy by writing out what the differences look like for each hand, and try to come up with an efficient method of identifying the hands.  When finished, run 10,000 simulations and compare your results (use proc freq) with the theoretical values given above. 


Copyright reserved by Dr.  Dwight Galster, 2006.  Please request permission for reprints (other than for personal use) from dwight.galster@sdstate.edu  .  "SAS" is a registered trade name of SAS Institute, Cary North Carolina.  All other trade names mentioned are the property of their respective owners.  This document is a work in progress and comments are welcome.  Please send an email if you find it useful or if your site links to it.