发信人: Bighappy (快乐大大大), 信区: Statistics标  题: SAS变量生成求助
发信站: BBS 未名空间站 (Wed Dec 20 15:31:31 2006)
我现在有一个数据如下:
发信站: BBS 未名空间站 (Wed Dec 20 15:31:31 2006)
我现在有一个数据如下:
to be connected. All dots will be connected. what matters is where the line ends
 It's the _expected_ value that is important (in a chi-square test). Another good reference is Ian Campbell   http://www.iancampbell.co.uk/ who has researched the history....30 - odd tests....but this can be summarised as
(1) Where all expected numbers are at least 1, analyse by the 'N - 1' chi-squared test (the K. Pearson chi-squared test but with N replaced by N - 1).
(2) Otherwise, analyse by the Fisher-Irwin test, with two-sided tests carried out by Irwin's rule (taking tables from either tail as likely, or less, as that observed).
There is an online 
calculator  for the 'N-1' chi-squared test.
I think that's a bit more explicit !
Consider two studies to look the relationship between
smoking and number of colds in 2004.
i) The first gives a questionaire to n=150 people and asks them
How much do you smoke?
a. not at all
b. a pack or less of cigarettes per day
c. more than a pack of cigarettes per day
How many colds did you have last year?
a. none
b. 1
c. 2
d. 3 or more
The 150 people were then put in a 3 by 4 contingency table:
# of colds in 2004
| 0 | 1 | 2 | >=3 |
--------------------------
No cigs | n11 | n12 | n13 | n14 | n1+
|------------------------|
1 pack/day | n21 | n22 | n23 | n24 | n2+
|------------------------|
>1 pack/day | n31 | n32 | n33 | n34 | n3+
|------------------------|
n+1 n+2 n+3 n+4 n=150
This is a single multinomial situation with 12
cells and therefore 11 free parameters
pi11, pi12, ..., pi34. It looks like 12 parameters
but pi11+pi12+ ...+ pi34=1. So because of this constraint,
there are only 11. Note that n1+, n2+, n3+, n+1, etc. are random.
This is a survey or cross-sectional study. It might be called
retrospective in the sense that they were asked to report on
the previous year even though the survey is taken at one point
in time. It is observational.
ii) The second study interviews people at the beginning of 2004
and chooses 50 nonsmokers, 50 less than one pack a day smokers,
and 50 more than one pack a day smokers. They are asked to keep
a diary of the colds they get during 2004. At the end of 2004,
they are asked to give the number of colds they had. This
data is put into a contingency table that looks pretty much
the same as for the first study:
# of colds in 2004
| 0 | 1 | 2 | >=3 |
--------------------------
No cigs | n11 | n12 | n13 | n14 | n1+=50
|------------------------|
1 pack/day | n21 | n22 | n23 | n24 | n2+=50
|------------------------|
>1 pack/day | n31 | n32 | n33 | n34 | n3+=50
|------------------------|
n+1 n+2 n+3 n+4 n=150
The main difference is that the row totals are fixed at 50 each.
Also, the rows are independent multinomials
mult(n=50;pi1|1, pi2|1, pi3|1, pi4|1) 3 free parameters
mult(n=50;pi1|2, pi2|2, pi3|2, pi4|2) 3 free parameters
mult(n=50;pi1|3, pi2|3, pi3|3, pi4|3) 3 free parameters
This is a cohort study. It is prospective. It is observational.
You really can't do a clinical trial on colds and smoking unless
you could actually force people to smoke or not smoke. Only bad
guys can carry out such clinical trials.
A case-control study wouldn't make sense here either. With a
case-control, you typically are interested in a rare event like
cancer or heart attack. So you could get a group of people who
had lung cancer in 2004 and find out their smoking habits. Then
get a group of people without lung cancer (the controls) but similar
in other ways to the cases, and ask about their smoking behavior.
That would result in a table like
Lung No
|Cancer|L.Ca.|
---------------
No cigs | n11 | n12 | n1+
|-------------
1 pack/day | n21 | n22 | n2+
|-------------
>1 pack/day | n31 | n32 | n3+
|-------------
100 100 n=200
Notice that we now have two independent multinomial columns. We could
use "local odds" ratios, No cigs vs. 1 pack/day, and No cigs vs.
>1 pack/day.