SAS Programming: October 2007

10/18/2007

Creating and Using Dummy Variables

A dummy variable is a numerical variable used in regression analysis to represent subgroups of the sample in your study. Dummy variable is used to distinguish different treatment groups. They are useful because they enable us to use a single regression equation to represent multiple groups. This means we don't need to write out separate equation models for each subgroup. If we have "n" subgroups we will have (n-1) dummy variables.
EX:
consider this simple data file having 9 subjects in 3 groups with a score iv dv.

sas program :

data dummy;
input sub iv dv;
cards;
1 1 48
2 1 49
3 1 50
4 2 17
5 2 20
6 2 23
7 3 28
8 3 30
9 3 32
;
run;
data dummy2;
set dummy;
if (iv=1) then iv1=1; else iv1=0;
if (iv=2) then iv2=1; else iv2=0;
if (iv=3) then iv3=1; else iv3=0;
run;
proc reg data=dummy2;
model dv=iv1 iv2;
run;

simple anova example

EX: the response time in milliseconds was determined for 3 different types of circuits.

sas program:

data time;
input circuit $ Time @@;
cards;
A 9 A 12 A 10 A 9 A 15
B 20 B 21 B 23 B 17 B 30
C 6 C 5 C 8 C 16 C 7
;
run;
proc anova data=time;
class circuit ;
model time= circuit;
means circuit / duncan scheffe alpha= 0.01 tukey lines;
run;

Analysis of variance

Proc Anova General Form

proc anova data=dataset;
by variables;
class variables;
model dependent variable=independent variables;
means effects ;

the class statement is used to designate which variables are factors in the model. it can have numeric values or character values.
the model statement has the form dependent variable=factorial.
the means statement requests multiple comparisons. list each variable for which you want multiple comparisons after means.

10/04/2007

examples

example1:
The data set “fitness.dat” is required in this problem. Consider the simple straight-line regression of maxpluse (Y) on runpulse (X).

Physical Fitness Data
These measurements were made on 31 men involved in a physical
fitness course at NC State University. The variables are
AGE (years)
WEIGHT (KG)
OXY (Oxygen uptake rate, ML per KG body weight per minute)
RUNTIME (time to run 1.5 miles, minutes)
RSTPULSE (heart rate while resting)
RUNPULSE (heart rate while running)
MAXPULSE (maximum heart rate while running)

Determine the ANOVA table
Test the significance of the straight line using an F test.
Compute the estimated regression line.
Plot the data and overlay the estimated regression line.
Comment on how well the line fits the data. (HINT: Comment on everything that you see on your output.)
Perform a multiple regression analysis and comment on the output. (Y=maxpulse; X1=age, X2=weight, X3=oxy, X4=runtime, X5=rstpluse, X6=runpulse)

answer:

filename data 'c:\abc\fitness.dat';
data dataset;
infile data;
input age weight oxy runtime rstpulse runpulse maxpulse;
run;
proc print data=dataset;
run;
proc anova data=dataset;
class runpulse;
model maxpulse=runpulse;
means runpulse;
run;
proc reg data=dataset;
model maxpulse=runpulse;
run;
proc reg data=dataset;
model maxpulse=age weight oxy runtime rstpulse runpulse;
run;
proc reg data=dataset;
model maxpulse=runpulse;
plot predicted.*runpulse='p' maxpulse*runpulse='*' / overlay;
run;

example 2:

DATA D1;
INPUT STATUS $ SEX $ NUMBER;
DATALINES; I BB 34
I BG 26
I GG 15
S BB 14
S BG 22
S GG 36 ;
PROC FREQ DATA=D1;
TABLES STATUS*SEX / ALL;
WEIGHT NUMBER;
TITLE1 'JOHN DOE';
RUN;

the result of this program SAS gives some tables. SAS program constructs a table which counts the number of observations with respect to status and sex in the data.

advanced examples

example:

In a class of 88 students, the exam score of “Mechanics (X1)”, “Vector Analysis (X2)”, “Algebra (X3)”, “Calculus(X4)” and “Statistics (X5)” are given in the file exam.dat.

Calculate the coefficient of correlation matrix of all variables.
Calculate the descriptive statistics of “Mechanics” and “Statistics” by using PROC UNIVARATE with options “normal” and “plot”
Answer the following parts:

i) Is the average of Algebra results different than the average of Vector Analysis?

ii) Does any difference between the averages of Calculus and Statistics exist?

(HINT: Think of t-test for two samples’ mean.)

Testing Differences Between Two Means

For a paired t-test use Proc means or proc univariate. The paired t-test looks at the differences between two measures that dependent or correlated and tests whether or not the mean difference equals zero. To use either procedure for a paired t-test , create a new variable that is the difference between the two measures and test whether the difference is equal to 0.

proc t-test procedure

proc ttest procedure tests whether two means are equal. It reports p values for the case where the two variances are equal which is called the two sample t-test or the unequal variances t-test and for the case where the two variances are equal ,the pooled t-test. For the two sample t-test it computes the approximate degrees of freedom. Proc ttest also reports of testing whether the two variances are equal for deciding which test is appropriate.
The Class statement identifies the variable that divides the data set into two groups. The CLASS variable must have only two values(can be either numeric or character)

Proc ttest general form:
....
...
....
...
run;
proc ttest data=dataset;
by variables;
class variable;
var variales;
run;

one simple example

EXAMPLE:
Plant wages: Weekly wages($) of 60 wage earners in a plant during the week of Jan 5 were as follows:
609 601 592 604 569 625 655 582 583 610 582 589 586 625 610 598 608 600 595 598 589 621 605 650 610 602 627 600 599 576 591 621 603 597 605 565 627 579 601 610 578 615 575 646 587 572 618 645 575 609 631 631 653 615 607 635 586 637 609 585

Use sas to produce descriptive statistics. comment on the distribution of the given data set.

answer:
the program must be

data plant_wages;
INPUT id 1-2 wages 4-6;
cards;
1 609
2 601
3 592
4 604
5 569
6 625
7 655
8 582
9 583
10 610
11 582
12 589
13 586
14 625
15 610
16 598
17 608
18 600
19 595
20 598
21 589
22 621
23 605
24 650
25 610
26 602
27 627
28 600
29 599
30 576
31 591
32 621
33 603
34 597
35 605
36 565
37 627
38 579
39 601
40 610
41 578
42 615
43 575
44 646
45 587
46 572
47 618
48 645
49 575
50 609
51 631
52 631
53 653
54 615
55 607
56 635
57 586
58 637
59 609
60 585
;
proc sort;
by wages;
run;
proc means;
var wages;
run;
proc univariate normal plot;
var wages;
run;

comment:
The distribution of these data set is normal.
Ho: data is normally dist.
H1: data is not normally dist.
Prob
In the box plot median and mean number are approx same.

proc univariate statement

proc univariate generates the descriptive statistics. (skewness, kurtosis, t-test, sign test, rank test, median, mode etc...)
the plot option: will generate a stem leaf plot box plot and a normal probability plot if we are dealing with large data sets a horizontal bar chart may be produced instead of stem and leaf plot.
the normal option: generates a statistic to test for normality and its p-value. here our hypothesis tests are
Ho: the data set is distributed normally
H1: the data set is not distributed normally
by looking p-value we can/not reject hypothesis.
if p value< style="font-weight: bold;">proc univariate general form :
....
....
...
run;
proc univariate data=dataset ;
by variable ;
var variable;
run;

proc means statement

proc means is a procedure to use when you are only interested in the basic descriptive statistics.
proc means general statement:
...
...
..
run;
proc means data=dataset ;
by variables;
var variables;
run;
we have some options there. when we write only proc means SAS system gives us sample size, minimum maximum values,average and standard deviation.
some useful options:
range: the range
sum:the sum
var:the variance
stderr:the standard error of the mean
prt: p value for this test different from "0"
clm: two sided 95% confidence interval

example:

DATA WAGES;
INPUT SUBJECT WAGE;
DATALINES;
1 609
2 601
3 592
4 604
5 569
6 625
7 655
8 582
9 583
10 610
;
PROC MEANS DATA=WAGES clm;
VAR WAGE;
RUN;

this program gives us 10 workers' mean wage ,standard deviation min and max

if we add clm it gives 2 sided confidence interval for the mean.

10/03/2007

proc print statement

this procedure tells SAS to print out certain variables in the data set.
proc print general form:

data dataset;
input variales;
datalines;
...
...
...
;
run;
proc print data=dataset;
by variables;
var variables;
run;

the keyword var is short for variable list. You list the variables you want to print after var in the order you want them printed.

note: when you write input statement if your variable data are not number you write "$"after variable for example
data dataname;
input name $ age city $ ---> name and city variables are not number

general sas structure

general form of simple data step

data dataset; ---> the data statement names the data set.
input variales;----> input is the keyword that defines the names of variables in the data set.
datalines;----->this statement signals the begining of variables
the lines of data
...
...
...
;
run;

sas introduction

When SAS program is started it has 4 main windows open:
1-the program editor
2-the log window
3-the output window
4-the explorer window

we write all codes to editor window.
SAS is organized into two steps. there are two types of steps DATA steps, which put data in a form that the SAS program can use an PROC steps which use procedures to do sth. to the data, such as sorting it, analyzing it or printing it.
A semicolon (;) is required to denote the end of a statement. spacing does not matter.You can put as many spaces between word/ keywords of the statements as you like.

what is sas?

The SAS system is a combination of programs originally designed to perform statistical analysis of data. Comparing to the other programs SAS system provides very many statistical and nonstatistical functions. usage of SAS is very simple comparing to the others.

SAS Programming