Lesson 42:  Bootstrap III

We should be using t distribution to calculate normal confidence intervals.  Use tinv(alpha, df) but be careful about the tail probability, which is different in different software.  SAS uses the left tail (excel uses two tails).

Continue building program from last time.

```libname s "c:\stat510";
proc print data=s.stateinfo;
run;
options ls=80 nonotes;
*treat data set as population data.  Examine distributions, particularly area.;
proc univariate data=s.stateinfo normal plot;
var area pop hipt;
run;

*note non-normality, mainly due to Alaska.;
proc univariate data=s.stateinfo normal plot;
var area;
where numenter < 49;
run;

*We will study the efficiency of the bootstrap technique for building
confidence intervals.  Copy the mean from univariate output.;

*Let us use normal techniques for building confidence intervals and
see what the true coverage is.;
%let trumean=75894.1;
%let n=20;
%let sims=100;
%let ds=s.stateinfo;
%cistates;
%macro CIstates;
%do j=1 %to &sims; *This is the simulation loop;
data temp (keep=state area rand);
set &ds;
rand=uniform(0);
*This data step creates random numbers for simulation sample;
proc sort data=temp out=sub1(drop=rand);
by rand;
data sub;
set sub1(obs=&n);
*proc print;run;
*This sql step creates the bootstrap sample;
proc sql;
create table est as
select mean(area)-tinv(.95,(&n-1))*std(area)/sqrt(&n) as LB,
mean(area)+tinv(.95,(&n-1))*std(area)/sqrt(&n) as UB from sub;
run;
%if &j=1 %then %do;
*Creates the summary data set on the first iteration;
data summ;
set est;
%end;
%else %do;
*Adds to the summary data set on subsequent iterations;
proc sql;
insert into summ select * from est;
%end;
%end;  *end of simulation loop;
*process summary data set;
data summ2;
set summ;
cover=0;
if lb<&trumean and ub>&trumean then cover=1;
proc print data=summ2(obs=10);
run;
proc freq data=summ2;
tables cover/nocum;
proc means data=summ2 mean std cv;
var lb ub;
run;
%mend CIstates;

%let trumean=75894.1;
%let n=20;
%let sims=100;
%let boots=100;
%let ds=s.stateinfo;
%ciboot;

%macro CIboot;
%do k=1 %to &sims;  *This is the simulation loop;
data temp (keep=state area rand);
set &ds;
rand=uniform(0);
*This data step creates random numbers for simulation sample;
proc sort data=temp out=sub1(drop=rand);
by rand;
data sub;
set sub1(obs=&n);
idr+1;
run;
%do j=1 %to &boots; *This is the bootstrap loop;
*This data step creates random numbers for bootstrap sample;
data rands(drop=i);
do i=1 to &n;
idr=int(uniform(0)*&n)+1;
output;
end;
*This sql step creates the bootstrap sample and calculates xbar;
%if &j=1 %then %do;
proc sql;
create table summ as
select mean(area)as xbar from sub, rands where sub.idr=rands.idr;
quit;
%end;
%else %do;
proc sql;
insert into summ
select mean(area)as xbar from sub, rands where sub.idr=rands.idr;
quit;
%end;
%end;  *end of bootstrap loop;
proc means data=summ noprint;
var xbar;
output out=summ2 p5=lb p95=ub;
run;
%if &k=1 %then %do;
data summ3;
set summ2;
run;
%end;
%else %do;
proc sql;
insert into summ3
select * from summ2;
quit;
%end;

%end; *simulation loop;
data summ4;
set summ3;
cover=0;
if lb<&trumean and ub>&trumean then cover=1;
proc print data=summ4(obs=10);
proc freq data=summ4;
tables cover/nocum;
proc means data=summ4 mean std cv;
var lb ub;
run;
%mend CIboot;
```

Exercise:

Using what we have learned, use simulations to compare the following, using the area variable in the stateinfo data set:

The bias and variance of the mean vs the median as measures of central tendency.  Bias is E(sample statistic) - Parameter.  In the simulation, E(sample statistic) is the mean of the simulated values.  Do for sample sizes of 10, 20, and 30.  Write a paragraph explaining your results.

Using the bootstrapping method of the example, investigate the effect of sample sizes (10, 20, 30) on the coverage of the bootstrap confidence intervals.  You may want to try to make the program more efficient.  You can even insert IML code in the macro if you wish.  In any case, be prepared to do other work while waiting for your simulations to finish :-)  Write a paragraph explaining yoru results.

Copyright reserved by Dr.  Dwight Galster, 2006.  Please request permission for reprints (other than for personal use) from dwight.galster@sdstate.edu  .  "SAS" is a registered trade name of SAS Institute, Cary North Carolina.  All other trade names mentioned are the property of their respective owners.  This document is a work in progress and comments are welcome.  Please send an email if you find it useful or if your site links to it.