Lesson 41: Bootstrap II
Difference between regression and anova simulation
Download stateinfo data set.
Use proc univariate to examine the distribution of the area. It is not normal due in large part to outliers.
This data will be used to demonstrate how we can use bootstrapping to estimate a population parameter, and then in addition, to use simulation to compare the bootstrap estimate to a "normal" estimate in terms of its statistical performance.
We will consider the 50 state values to be the entire population. We will be sampling from this population, and studying the behavior of our estimates for the mean in terms of their ability to predict the population mean. We will use the area variable to do this.
Because this data is highly skewed with serious outliers, the measure of central tendency that should be used is the median. However, for the purpose of this example, we will focus on estimating the mean. We can find the population mean (mu) from the proc univariate output, it is 75894.1.
What is a confidence interval?
The next part of the process is a bit confusing because we have different levels of sampling. We are going to begin by simulating taking a sample (without replacement) from the population. In this part we are simulating what happens when we really take a sample. The process is similar to what we studied the first time we did it in IML. We can use a uniform random variable assigned to each observation and then sort them and take the top 20 or whatever our sample size is. From this sample we can calculate a traditional x-bar and confidence interval based on normal theory.
Then we go one step further, and take this sample and do a bootstrap on it, which means we will resample from it, with replacement, and from all these samples, we find the percentiles corresponding to the confidence level we want. (Should do 90% because it is easier to get P5 and P95). These will be the bootstrap lower and upper bounds. We then repeat the sampling process (sample from the population again) and find a new P5 and P95. Do this many times and see what the coverage is as well as the variability. Does it perform better than the normal theory sample?
libname s "c:\stat510"; proc print data=s.stateinfo; run; options ls=80 nonotes; *treat data set as population data. Examine distributions, particularly area.; proc univariate data=s.stateinfo normal plot; var area pop hipt; run; *note non-normality, mainly due to Alaska.; proc univariate data=s.stateinfo normal plot; var area; where numenter < 49; run; *We will study the efficiency of the bootstrap technique for building confidence intervals. Copy the mean from univariate output.; *Let us use normal techniques for building confidence intervals and see what the true coverage is.; %let trumean=75894.1; %let n=20; %let sims=10; %let ds=s.stateinfo; %cistates; %macro CIstates; %do j=1 %to &sims; *This is the simulation loop; data temp (keep=state area rand); set &ds; rand=uniform(0); *This data step creates random numbers for simulation sample; proc sort data=temp out=sub1(drop=rand); by rand; data sub; set sub1(obs=&n); *proc print;run; *This sql step creates the bootstrap sample; proc sql; create table est as select mean(area)-tinv(.975,(&n-1))*std(area)/sqrt(&n) as LB, mean(area)+tinv(.975,(&n-1))*std(area)/sqrt(&n) as UB from sub; run; %if &j=1 %then %do; *Creates the summary data set on the first iteration; data summ; set est; %end; %else %do; *Adds to the summary data set on subsequent iterations; proc sql; insert into summ select * from est; %end; %end; *end of simulation loop; *process summary data set; data summ2; set summ; cover=0; if lb<&trumean and ub>&trumean then cover=1; proc print data=summ2(obs=10); run; proc freq data=summ2; tables cover/nocum; proc means data=summ2 mean std cv; var lb ub; run; %mend CIstates; %macro CIboot; data temp (keep=state area rand); set &ds; rand=uniform(0); *This data step creates random numbers for simulation sample; proc sort data=temp out=sub1(drop=rand); by rand; data sub; set sub1(obs=&n); idr+1; run; %do j=1 %to &sims; *This is the simulation loop; *This data step creates random numbers for bootstrap sample; data rands(drop=i); do i=1 to &n; idr=int(uniform(0)*&n)+1; output; end; *This sql step creates the bootstrap sample; proc sql; * create table sub as ; select sub.* from sub, rands where sub.idr=rands.idr; quit; *left off here; %if &j=1 %then %do; *Creates the summary data set on the first iteration; data summ; set est; %end; %else %do; *Adds to the summary data set on subsequent iterations; proc sql; insert into summ select * from est; %end; %end; *end of simulation loop; *process summary data set; data summ2 (keep=modl); set summ; proc freq data=summ2; tables modl; run; %mend CIboot;
Exercise:
Copyright reserved by Dr. Dwight Galster, 2006. Please request permission for reprints (other than for personal use) from dwight.galster@sdstate.edu . "SAS" is a registered trade name of SAS Institute, Cary North Carolina. All other trade names mentioned are the property of their respective owners. This document is a work in progress and comments are welcome. Please send an email if you find it useful or if your site links to it.