Wednesday, January 25, 2006

categorical data analysis

Consider two studies to look the relationship between
smoking and number of colds in 2004.

i) The first gives a questionaire to n=150 people and asks them

How much do you smoke?
a. not at all
b. a pack or less of cigarettes per day
c. more than a pack of cigarettes per day

How many colds did you have last year?
a. none
b. 1
c. 2
d. 3 or more

The 150 people were then put in a 3 by 4 contingency table:

# of colds in 2004
| 0 | 1 | 2 | >=3 |
--------------------------
No cigs | n11 | n12 | n13 | n14 | n1+
|------------------------|
1 pack/day | n21 | n22 | n23 | n24 | n2+
|------------------------|
>1 pack/day | n31 | n32 | n33 | n34 | n3+
|------------------------|
n+1 n+2 n+3 n+4 n=150

This is a single multinomial situation with 12
cells and therefore 11 free parameters
pi11, pi12, ..., pi34. It looks like 12 parameters
but pi11+pi12+ ...+ pi34=1. So because of this constraint,
there are only 11. Note that n1+, n2+, n3+, n+1, etc. are random.

This is a survey or cross-sectional study. It might be called
retrospective in the sense that they were asked to report on
the previous year even though the survey is taken at one point
in time. It is observational.


ii) The second study interviews people at the beginning of 2004
and chooses 50 nonsmokers, 50 less than one pack a day smokers,
and 50 more than one pack a day smokers. They are asked to keep
a diary of the colds they get during 2004. At the end of 2004,
they are asked to give the number of colds they had. This
data is put into a contingency table that looks pretty much
the same as for the first study:

# of colds in 2004
| 0 | 1 | 2 | >=3 |
--------------------------
No cigs | n11 | n12 | n13 | n14 | n1+=50
|------------------------|
1 pack/day | n21 | n22 | n23 | n24 | n2+=50
|------------------------|
>1 pack/day | n31 | n32 | n33 | n34 | n3+=50
|------------------------|
n+1 n+2 n+3 n+4 n=150

The main difference is that the row totals are fixed at 50 each.
Also, the rows are independent multinomials

mult(n=50;pi1|1, pi2|1, pi3|1, pi4|1) 3 free parameters
mult(n=50;pi1|2, pi2|2, pi3|2, pi4|2) 3 free parameters
mult(n=50;pi1|3, pi2|3, pi3|3, pi4|3) 3 free parameters

This is a cohort study. It is prospective. It is observational.

You really can't do a clinical trial on colds and smoking unless
you could actually force people to smoke or not smoke. Only bad
guys can carry out such clinical trials.

A case-control study wouldn't make sense here either. With a
case-control, you typically are interested in a rare event like
cancer or heart attack. So you could get a group of people who
had lung cancer in 2004 and find out their smoking habits. Then
get a group of people without lung cancer (the controls) but similar
in other ways to the cases, and ask about their smoking behavior.
That would result in a table like

Lung No
|Cancer|L.Ca.|
---------------
No cigs | n11 | n12 | n1+
|-------------
1 pack/day | n21 | n22 | n2+
|-------------
>1 pack/day | n31 | n32 | n3+
|-------------
100 100 n=200

Notice that we now have two independent multinomial columns. We could
use "local odds" ratios, No cigs vs. 1 pack/day, and No cigs vs.
>1 pack/day.

No comments: