10/18/2007
Creating and Using Dummy Variables
EX:
consider this simple data file having 9 subjects in 3 groups with a score iv dv.
sas program :
data dummy;
input sub iv dv;
cards;
1 1 48
2 1 49
3 1 50
4 2 17
5 2 20
6 2 23
7 3 28
8 3 30
9 3 32
;
run;
data dummy2;
set dummy;
if (iv=1) then iv1=1; else iv1=0;
if (iv=2) then iv2=1; else iv2=0;
if (iv=3) then iv3=1; else iv3=0;
run;
proc reg data=dummy2;
model dv=iv1 iv2;
run;
simple anova example
sas program:
data time;
input circuit $ Time @@;
cards;
A 9 A 12 A 10 A 9 A 15
B 20 B 21 B 23 B 17 B 30
C 6 C 5 C 8 C 16 C 7
;
run;
proc anova data=time;
class circuit ;
model time= circuit;
means circuit / duncan scheffe alpha= 0.01 tukey lines;
run;
Analysis of variance
proc anova data=dataset;
by variables;
class variables;
model dependent variable=independent variables;
means effects
the class statement is used to designate which variables are factors in the model. it can have numeric values or character values.
the model statement has the form dependent variable=factorial.
the means statement requests multiple comparisons. list each variable for which you want multiple comparisons after means.
10/04/2007
examples
The data set “fitness.dat” is required in this problem. Consider the simple straight-line regression of maxpluse (Y) on runpulse (X).
These measurements were made on 31 men involved in a physical
fitness course at NC State University. The variables are
AGE (years)
WEIGHT (KG)
OXY (Oxygen uptake rate, ML per KG body weight per minute)
RUNTIME (time to run 1.5 miles, minutes)
RSTPULSE (heart rate while resting)
RUNPULSE (heart rate while running)
MAXPULSE (maximum heart rate while running)
- Determine the ANOVA table
- Test the significance of the straight line using an F test.
- Compute the estimated regression line.
- Plot the data and overlay the estimated regression line.
- Comment on how well the line fits the data. (HINT: Comment on everything that you see on your output.)
- Perform a multiple regression analysis and comment on the output. (Y=maxpulse; X1=age, X2=weight, X3=oxy, X4=runtime, X5=rstpluse, X6=runpulse)
answer:
filename data 'c:\abc\fitness.dat';
data dataset;
infile data;
input age weight oxy runtime rstpulse runpulse maxpulse;
run;
proc print data=dataset;
run;
proc anova data=dataset;
class runpulse;
model maxpulse=runpulse;
means runpulse;
run;
proc reg data=dataset;
model maxpulse=runpulse;
run;
proc reg data=dataset;
model maxpulse=age weight oxy runtime rstpulse runpulse;
run;
proc reg data=dataset;
model maxpulse=runpulse;
plot predicted.*runpulse='p' maxpulse*runpulse='*' / overlay;
run;
example 2:
DATA D1;
INPUT STATUS $ SEX $ NUMBER;
DATALINES; I BB 34
I BG 26
I GG 15
S BB 14
S BG 22
S GG 36 ;
PROC FREQ DATA=D1;
TABLES STATUS*SEX / ALL;
WEIGHT NUMBER;
TITLE1 'JOHN DOE';
RUN;
the result of this program SAS gives some tables. SAS program constructs a table which counts the number of observations with respect to status and sex in the data.
advanced examples
In a class of 88 students, the exam score of “Mechanics (X1)”, “Vector Analysis (X2)”, “Algebra (X3)”, “Calculus(X4)” and “Statistics (X5)” are given in the file exam.dat.
- Calculate the coefficient of correlation matrix of all variables.
- Calculate the descriptive statistics of “Mechanics” and “Statistics” by using PROC UNIVARATE with options “normal” and “plot”
- Answer the following parts:
i) Is the average of Algebra results different than the average of Vector Analysis?
ii) Does any difference between the averages of Calculus and Statistics exist?
(HINT: Think of t-test for two samples’ mean.)
Testing Differences Between Two Means
proc t-test procedure
The Class statement identifies the variable that divides the data set into two groups. The CLASS variable must have only two values(can be either numeric or character)
Proc ttest general form:
....
...
....
...
run;
proc ttest data=dataset;
by variables;
class variable;
var variales;
run;
one simple example
Plant wages: Weekly wages($) of 60 wage earners in a plant during the week of Jan 5 were as follows:
609 601 592 604 569 625 655 582 583 610 582 589 586 625 610 598 608 600 595 598 589 621 605 650 610 602 627 600 599 576 591 621 603 597 605 565 627 579 601 610 578 615 575 646 587 572 618 645 575 609 631 631 653 615 607 635 586 637 609 585
Use sas to produce descriptive statistics. comment on the distribution of the given data set.
answer:
the program must be
data plant_wages;
INPUT id 1-2 wages 4-6;
cards;
1 609
2 601
3 592
4 604
5 569
6 625
7 655
8 582
9 583
10 610
11 582
12 589
13 586
14 625
15 610
16 598
17 608
18 600
19 595
20 598
21 589
22 621
23 605
24 650
25 610
26 602
27 627
28 600
29 599
30 576
31 591
32 621
33 603
34 597
35 605
36 565
37 627
38 579
39 601
40 610
41 578
42 615
43 575
44 646
45 587
46 572
47 618
48 645
49 575
50 609
51 631
52 631
53 653
54 615
55 607
56 635
57 586
58 637
59 609
60 585
;
proc sort;
by wages;
run;
proc means;
var wages;
run;
proc univariate normal plot;
var wages;
run;
comment:
The distribution of these data set is normal.
H1: data is not normally dist.
Prob
In the box plot median and mean number are approx same.
proc univariate statement
the plot option: will generate a stem leaf plot box plot and a normal probability plot if we are dealing with large data sets a horizontal bar chart may be produced instead of stem and leaf plot.
the normal option: generates a statistic to test for normality and its p-value. here our hypothesis tests are
Ho: the data set is distributed normally
H1: the data set is not distributed normally
by looking p-value we can/not reject hypothesis.
if p value< style="font-weight: bold;">proc univariate general form :
....
....
...
run;
proc univariate data=dataset
by variable ;
var variable;
run;
proc means statement
proc means general statement:
...
...
..
run;
proc means data=dataset
by variables;
var variables;
run;
we have some options there. when we write only proc means SAS system gives us sample size, minimum maximum values,average and standard deviation.
some useful options:
range: the range
sum:the sum
var:the variance
stderr:the standard error of the mean
prt: p value for this test different from "0"
clm: two sided 95% confidence interval
example:
DATA WAGES;
INPUT SUBJECT WAGE;
DATALINES;
1 609
2 601
3 592
4 604
5 569
6 625
7 655
8 582
9 583
10 610
;
PROC MEANS DATA=WAGES clm;
VAR WAGE;
RUN;
this program gives us 10 workers' mean wage ,standard deviation min and max
if we add clm it gives 2 sided confidence interval for the mean.
10/03/2007
proc print statement
proc print general form:
data dataset;
input variales;
datalines;
...
...
...
;
run;
proc print data=dataset;
by variables;
var variables;
run;
the keyword var is short for variable list. You list the variables you want to print after var in the order you want them printed.
note: when you write input statement if your variable data are not number you write "$"after variable for example
data dataname;
input name $ age city $ ---> name and city variables are not number
general sas structure
data dataset; ---> the data statement names the data set.
input variales;----> input is the keyword that defines the names of variables in the data set.
datalines;----->this statement signals the begining of variables
the lines of data
...
...
...
;
run;
sas introduction
1-the program editor
2-the log window
3-the output window
4-the explorer window

we write all codes to editor window.
SAS is organized into two steps. there are two types of steps DATA steps, which put data in a form that the SAS program can use an PROC steps which use procedures to do sth. to the data, such as sorting it, analyzing it or printing it.
A semicolon (;) is required to denote the end of a statement. spacing does not matter.You can put as many spaces between word/ keywords of the statements as you like.