Thursday, May 04, 2023

c index implementations

 copied from here

Assessment of Discrimination in Survival Analysis (C-statistics, etc)

References

Discrimination

  • Differentiating those who will have events and those who will not have events.
  • AUC is an established method for logistic regression (higher probability for cases than for non-cases).
  • In survival data, the concept of AUC can be defined in different ways.

Different C-statistics and related measures implemented in R

See individual examples below for links to the original papers.

  • Summary measures
  • Logistic AUC ignoring time-to-event data
  • Harrell’s C or concordance (Hmisc::rcorrcens or survival::survConcordance),
  • C-statistic by Begg et al. (survAUC::BeggC)
  • C-statistic by Uno et al. (survC1::Inf.Cval; survAUC::UnoC)
  • Gonen and Heller Concordance Index for Cox models (survAUC::GHCI, CPE::phcpe, clinfun::coxphCPE)
  • Integrated AUC (survAUC::IntAUC for AUC.cd, AUC.hc, AUC.sh, AUC.uno)
  • \( R^2 \)-type coefficents (survAUC::OXS, Nagelk, XO)
  • IDI, NRI, and median improvement (survIDINRI::IDI.INF)
  • .
  • Time-dependent measures
  • Cumulative case/dynamic control ROC/AUC by Heagerty et al. 2000 (survivalROC::svivalROC)
  • Incident case/dynamic control ROC/AUC by Heagerty et al. 2005 (risksetROC::risksetROC)
  • Cumulative case/dynamic control AUC by Chambless and Diao 2006 (survAUC::AUC.cd)
  • Cumulative case/dynamic control AUC by Hung and Chiang 2010 (survAUC::AUC.hc)
  • Incident case or Cumulative case/dynamic control AUC by Song and Zhou 2008 (survAUC::AUC.sh)
  • Cumulative case/dynamic control AUC by Uno et al. 2007 (survAUC::AUC.uno)

For the explanation of the difference between cumulative/dynamic AUCs vs incident/dynamic AUCs, see the paper by Heargery et al ( http://www.statmed.medicina.unimib.it/statisticalps2011/materiale/Heagerty%20and%20Zheng,%20Biometrics%202005.pdf ).

As I understand it,

  • Cumulative/dynamic AUCs assess discrimination of t-year cumulative incidence (risk),
  • Incident/dynamic AUCs discrimination of incidence (not cumulative incidence) at t years among those who survived until that point (risk set).

Thus, the idea behind incident/dynamic AUCs is closer to the idea of hazard (dynamically changing instantaneous incidence at a given time), and it can handle time-varying predictors. But the question answered by cumulative/dynamic AUCs may be more clinically relevant, e.g., does this model discriminate if I will survive next five years .

CRAN Task View: Survival Analysis: Predictions and Prediction Performance

  • http://cran.r-project.org/web/views/Survival.html
  • The pec package provides utilities to plot prediction error curves for several survival models
  • peperr implements prediction error techniques which can be computed in a parallelised way. Useful for high-dimensional data.
  • survivalROC computes time-dependent ROC curves and time-dependent AUC from censored data using Kaplan-Meier or Akritas’s nearest neighbour estimation method (Cumulative sensitivity and dynamic specificity).
  • risksetROC implements time-dependent ROC curves, AUC and integrated AUC of Heagerty and Zheng (Biometrics, 2005).
  • Various time-dependent true/false positive rates and Cumulative/Dynamic AUC are implemented in the survAUC package.
  • The survcomp package (Bioconductor) provides several functions to assess and compare the performance of survival models.
  • C-statistics for risk prediction models with censored survival data can be computed via the survC1 package.
  • The survIDINRI package implements the integrated discrimination improvement index and the category-less net reclassification index for comparing competing risks prediction models.

PBC dataset

       age:       in years
       albumin:   serum albumin (g/dl)
       alk.phos:  alkaline phosphotase (U/liter)
       ascites:   presence of ascites
       ast:       aspartate aminotransferase, once called SGOT (U/ml)
       bili:      serum bilirunbin (mg/dl)
       chol:      serum cholesterol (mg/dl)
       copper:    urine copper (ug/day)
       edema:     0 no edema, 0.5 untreated or successfully treated
                  1 edema despite diuretic therapy
       hepato:    presence of hepatomegaly or enlarged liver
       id:        case number
       platelet:  platelet count
       protime:   standardised blood clotting time
       sex:       m/f
       spiders:   blood vessel malformations in the skin
       stage:     histologic stage of disease (needs biopsy)
       status:    status at endpoint, 0/1/2 for censored, transplant, dead
       time:      number of days between registration and the earlier of death,
                  transplantion, or study analysis in July, 1986
       trt:       1/2/NA for D-penicillmain, placebo, not randomised
       trig:      triglycerides (mg/dl)

Load the PBC dataset and modify for later use

library(survival)
data(pbc)

pbc <- within(pbc, {
    ## transplant (1) and death (2) are considered events, and marked 1
    event <- as.numeric(status %in% c(1,2))

    ## Create a survival vector
    Surv <- Surv(time, event)
})

Kaplan-Meier plots

par(mar = c(2,2,2,2))
layout(matrix(1:4, byrow = T, ncol = 2))
library(rms)
survplot(survfit(Surv ~ 1, data = pbc))
survplot(survfit(Surv ~ age >= median(age), data = pbc),         label.curves = list(method = "arrow", cex = 1.2))
survplot(survfit(Surv ~ sex, data = pbc),                        label.curves = list(method = "arrow", cex = 1.2))
survplot(survfit(Surv ~ albumin >= median(albumin), data = pbc), label.curves = list(method = "arrow", cex = 1.2))

plot of chunk unnamed-chunk-3

AUC by logistic regression models

Completely ignore the time variable

It is the simplest method. Logistic regression is used instead of Cox regression model.

Completely ignore the time variable and use the outcome variable as a binary outcome variable. It does not take into acount the variable length of follow-up.

## Load epicalc package to calcuate AUC
library(epicalc)

## Model with age and sex
logit.age.sex <- glm(event ~ age + sex, data = pbc, family = binomial)
lroc(logit.age.sex, graph = F)$auc
[1] 0.5879
## Model with age, sex, and albumin
logit.age.sex.albumin <- glm(event ~ age + sex + albumin, data = pbc, family = binomial)
lroc(logit.age.sex.albumin, graph = F)$auc
[1] 0.6751

Cut the follow up at a specifict time point

Only events that occured within two years are considered events and others are treated as non-events. This method can be valid if the specified time is short enough so that there are few censored subjects. Most people have complete follow-up in this situation.

## Create a variable indicating 2-year event
pbc <- within(pbc, {
    outcome2yr <- NA
    outcome2yr[(event == 1) & (time <= 2 * 365)] <- 1 # event+ within two years
    outcome2yr[(event == 0) | (time  > 2 * 365)] <- 0 # otherwise
})

## 2-year outcome model with age and sex
logit.age.sex <- glm(outcome2yr ~ age + sex, data = pbc, family = binomial)
lroc(logit.age.sex, graph = F)$auc
[1] 0.6484
## 2-year outcome model with age, sex, and albumin
logit.age.sex.albumin <- glm(outcome2yr ~ age + sex + albumin, data = pbc, family = binomial)
lroc(logit.age.sex.albumin, graph = F)$auc
[1] 0.7918

Fit Cox regression models for later use

The linear predictors (lp), when exponetiated, will provide the predicted hazard ratios for individuals. Thus these can be used as the summary predictors calculated from multiple raw predictors.

## Null model
coxph.null             <- coxph(Surv ~ 1, data = pbc)
coxph.null
Call:  coxph(formula = Surv ~ 1, data = pbc)

Null model
  log likelihood= -1009 
  n= 418 
## Model with age and sex
coxph.age.sex          <- coxph(Surv ~ age + sex, data = pbc)
coxph.age.sex
Call:
coxph(formula = Surv ~ age + sex, data = pbc)

        coef exp(coef) se(coef)     z      p
age   0.0221     1.022  0.00727  3.04 0.0024
sexf -0.3000     0.741  0.20975 -1.43 0.1500

Likelihood ratio test=12  on 2 df, p=0.0025  n= 418, number of events= 186 
## Model with age, sex, and albumin
coxph.age.sex.albumin  <- coxph(Surv ~ age + sex + albumin, data = pbc)
coxph.age.sex.albumin
Call:
coxph(formula = Surv ~ age + sex + albumin, data = pbc)

           coef exp(coef) se(coef)     z       p
age      0.0148     1.015  0.00749  1.97 4.9e-02
sexf    -0.3467     0.707  0.21079 -1.64 1.0e-01
albumin -1.4066     0.245  0.17096 -8.23 2.2e-16

Likelihood ratio test=73.9  on 3 df, p=6.66e-16  n= 418, number of events= 186 
## These models are significantly different by likelihood ratio test
anova(coxph.age.sex, coxph.age.sex.albumin, test = "LRT")
Analysis of Deviance Table
 Cox model: response is  Surv
 Model 1: ~ age + sex
 Model 2: ~ age + sex + albumin
  loglik Chisq Df P(>|Chi|)    
1  -1003                       
2   -972  61.9  1   3.7e-15 ***
---
Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1 
## Put linear predictors ("lp") into pbc dataset
pbc$lp.null            <- predict(coxph.null, type = "lp")
pbc$lp.age.sex         <- predict(coxph.age.sex, type = "lp")
pbc$lp.age.sex.albumin <- predict(coxph.age.sex.albumin, type = "lp")

Harrell’s C or concordance

Trying to compare (test) the difference in C is not recommended.

http://stats.stackexchange.com/questions/17480/how-to-do-roc-analysis-in-r-with-a-cox-model/17517#17517

Hmisc::rcorrcens

A larger marker value is considered to be associated with a longer survival by this function. Thus, the linear predictor (the higher, the worse) needs to be negated

library(Hmisc)

## Model with age and sex
rcorrcens(formula = Surv ~ I(-1 * lp.age.sex), data = pbc)

Somers' Rank Correlation for Censored Data    Response variable:Surv

                       C   Dxy  aDxy    SD    Z      P   n
I(-1 * lp.age.sex) 0.569 0.138 0.138 0.043 3.18 0.0015 418
## Model with age, sex, and albumin
rcorrcens(formula = Surv ~ I(-1 * lp.age.sex.albumin), data = pbc)

Somers' Rank Correlation for Censored Data    Response variable:Surv

                               C   Dxy  aDxy    SD    Z P   n
I(-1 * lp.age.sex.albumin) 0.703 0.407 0.407 0.041 9.86 0 418

survival package

Actually, the summary method for coxph objects prints “Concordance” (five lines from bottom), which is the same thing as the Harrells’C, and \( R^2 \).

## Model with age and sex
summary(coxph.age.sex)
Call:
coxph(formula = Surv ~ age + sex, data = pbc)

  n= 418, number of events= 186 

         coef exp(coef) se(coef)     z Pr(>|z|)   
age   0.02210   1.02234  0.00727  3.04   0.0024 **
sexf -0.29995   0.74085  0.20975 -1.43   0.1527   
---
Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1 

     exp(coef) exp(-coef) lower .95 upper .95
age      1.022      0.978     1.008      1.04
sexf     0.741      1.350     0.491      1.12

Concordance= 0.569  (se = 0.023 )
Rsquare= 0.028   (max possible= 0.992 )
Likelihood ratio test= 12  on 2 df,   p=0.0025
Wald test            = 12.2  on 2 df,   p=0.0022
Score (logrank) test = 12.3  on 2 df,   p=0.00214
## Model with age, sex, and albumin
summary(coxph.age.sex.albumin)
Call:
coxph(formula = Surv ~ age + sex + albumin, data = pbc)

  n= 418, number of events= 186 

            coef exp(coef) se(coef)     z Pr(>|z|)    
age      0.01476   1.01487  0.00749  1.97    0.049 *  
sexf    -0.34672   0.70700  0.21079 -1.64    0.100 .  
albumin -1.40664   0.24496  0.17096 -8.23  2.2e-16 ***
---
Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1 

        exp(coef) exp(-coef) lower .95 upper .95
age         1.015      0.985     1.000     1.030
sexf        0.707      1.414     0.468     1.069
albumin     0.245      4.082     0.175     0.342

Concordance= 0.703  (se = 0.023 )
Rsquare= 0.162   (max possible= 0.992 )
Likelihood ratio test= 73.9  on 3 df,   p=6.66e-16
Wald test            = 83.1  on 3 df,   p=0
Score (logrank) test = 82.2  on 3 df,   p=0

Gonen and Heller Concordance Index for Cox models

  • Gonen M, et al. Concordance probability and discriminatory power in proportional hazards regression. Biometrika 2005;92:965-970.

CPE package

library(CPE)
## Model with age and sex
phcpe(coxph.age.sex, CPE.SE = TRUE)
$CPE
[1] 0.5724

$CPE.SE
[1] 0.02023
## Model with age, sex, and albumin
phcpe(coxph.age.sex.albumin, CPE.SE = TRUE)
$CPE
[1] 0.6615

$CPE.SE
[1] 0.01531

clinfun package

library(clinfun)
## Model with age and sex
coxphCPE(coxph.age.sex)
       CPE smooth.CPE     se.CPE 
   0.57244    0.57236    0.02023 
## Model with age, sex, and albumin
coxphCPE(coxph.age.sex.albumin)
       CPE smooth.CPE     se.CPE 
   0.66150    0.66128    0.01531 

Cumulative case/dynamic control ROC/AUC using survivalROC package (Heagerty, 2000)

library(survivalROC)

## Define a function
fun.survivalROC <- function(lp, t) {
    res <- with(pbc,
                survivalROC(Stime        = time,
                            status       = event,
                            marker       = get(lp),
                            predict.time = t,
                            method       = "KM"))       # KM method without smoothing

    ## Plot ROCs
    with(res, plot(TP ~ FP, type = "l", main = sprintf("t = %.0f, AUC = %.2f", t, AUC)))
    abline(a = 0, b = 1, lty = 2)

    res
}

## 2 x 5 layout
layout(matrix(1:10, byrow = T, ncol = 5))

## Model with age and sex
res.survivalROC.age.sex <- lapply(1:10 * 365.25, function(t) {
    fun.survivalROC(lp = "lp.age.sex", t)
})

plot of chunk unnamed-chunk-11


## Model with age, sex, and albumin
res.survivalROC.age.sex.albumin <- lapply(1:10 * 365.25, function(t) {
    fun.survivalROC(lp = "lp.age.sex.albumin", t)
})

plot of chunk unnamed-chunk-11

Incident case/dynamic control ROC/AUC using risksetROC package (Heagerty, 2005)

  • Heagerty PJ, et al., Survival model predictive accuracy and ROC curves., Biometrics. 2005 Mar;61(1):92-105.
  • http://www.ncbi.nlm.nih.gov/pubmed/15737082
  • http://faculty.washington.edu/heagerty/Software/SurvROC/

  • risksetAUC(): This function creates risksetAUC from a survival data set.

  • risksetROC(): This function creates risksetROC from a survival data set.

  • This package calculates the incidence-based time-dependent ROC among the risk set (subpopulation) at time t.

  • Cases are those who died at time t (incident cases).

  • Controls are those who survived until time t (dynamic controls)

  • Use of incident cases rather than cumulative cases allows for assessment of time-dependent predictors.

1- to 10-year ROCs

library(risksetROC)

## Define a function
fun.risksetROC <- function(lp, t) {
    res <- with(pbc,
                risksetROC(Stime        = time,
                           status       = event,
                           marker       = get(lp),
                           predict.time = t,
                           plot         = FALSE))

    ## Plot ROCs
    with(res, plot(TP ~ FP, type = "l", main = sprintf("t = %.0f, AUC = %.2f", t, AUC)))
    abline(a = 0, b = 1, lty = 2)

    res
}

## 2 x 5 layout
layout(matrix(1:10, byrow = T, ncol = 5))

## Model with age and sex
risksetROC.age.sex <- lapply(365.25 * 1:10, function(t) {
    fun.risksetROC(lp = "lp.age.sex", t)
})

plot of chunk unnamed-chunk-12


## Model with age, sex, and albumin
risksetROC.age.sex.albumin <- lapply(365.25 * 1:10, function(t) {
    fun.risksetROC(lp = "lp.age.sex.albumin", t)
})

plot of chunk unnamed-chunk-12

Up to 10-year AUCs

## 1 x 2 layout
layout(matrix(1:2, byrow = T, ncol = 2))

## Model with age and sex
risksetAUC.age.sex <- with(pbc,
                           risksetAUC(Stime        = time,
                                      status       = event,
                                      marker       = lp.age.sex,
                                      tmax         = 10 * 365.25))
title(sprintf("t = %.0f, iAUC = %.2f", 10 * 365.25, risksetAUC.age.sex$Cindex))

## Model with age, sex, and albumin
risksetAUC.age.sex.albumin <- with(pbc,
                           risksetAUC(Stime        = time,
                                      status       = event,
                                      marker       = lp.age.sex.albumin,
                                      tmax         = 10 * 365.25))
title(sprintf("t = %.0f, iAUC = %.2f", 10 * 365.25, risksetAUC.age.sex.albumin$Cindex))

plot of chunk unnamed-chunk-13

Various methods provided survAUC package (Potapov et al)

These can calculate multiple time-dependent ROC at once, and also compute summary measures of a time-dependent AUC curve (iAUC)

These need a training dataset and a test dataset. The same data can be given to both, and it works although I am not sure if this is correct.

Functions

  • AUC.cd(): AUC estimator proposed by Chambless and Diao
  • AUC.hc(): AUC estimator proposed by Hung and Chiang
  • AUC.sh(): AUC estimator proposed by Song and Zhou
  • AUC.uno(): AUC estimator proposed by Uno et al.
  • BeggC(): C-statistic by Begg et al.
  • GHCI(): Gonen and Heller’s Concordance Index for Cox models
  • IntAUC(): Integration of time-dependent AUC curves
  • OXS(): R2-type coefficients for Cox proportional hazards models
  • plot.survAUC(): Plot method for survAUC and survErr Objects
  • predErr(): Distance-based estimators of survival predictive accuracy
  • schemper(): Distance-based estimator of survival predictive accuracy proposed by Schemper and Henderson
  • UnoC() C-statistic by Uno et al.

Time-dependent AUCs for the age sex model are calculated by various methods. (1- to 10-year AUCs)

library(survAUC)

## *Cumulative case*/dynamic control AUC by Chambless and Diao (Stat Med 2006;25:3474-3486.)
res.AUC.cd <- AUC.cd(Surv.rsp     = pbc$Surv,
                     Surv.rsp.new = pbc$Surv,
                     lp           = coxph.age.sex$linear.predictors,
                     lpnew        = coxph.age.sex$linear.predictors,
                     times        = 1:10 * 365.25)
res.AUC.cd$iauc
[1] 0.5895
plot(res.AUC.cd)

plot of chunk unnamed-chunk-14


## *Cumulative case*/dynamic control AUC by Hung and Chiang (Can J Stat 2010;38:8-26)
res.AUC.hc <- AUC.hc(Surv.rsp     = pbc$Surv,
                     Surv.rsp.new = pbc$Surv,
                     ## lp           = coxph.age.sex$linear.predictors,
                     lpnew        = coxph.age.sex$linear.predictors,
                     times        = 1:10 * 365.25)
res.AUC.hc$iauc
[1] 0.6005
plot(res.AUC.hc)

plot of chunk unnamed-chunk-14


## *Incident case* or *Cumulative case*/dynamic control AUC by Song and Zhou (Biometrics 2011;67:906-16)
res.AUC.sh <- AUC.sh(Surv.rsp     = pbc$Surv,
                     Surv.rsp.new = pbc$Surv,
                     lp           = coxph.age.sex$linear.predictors,
                     lpnew        = coxph.age.sex$linear.predictors,
                     times        = 1:10 * 365.25)
res.AUC.sh$iauc
[1] 0.573
plot(res.AUC.sh)

plot of chunk unnamed-chunk-14


## *Cumulative case*/dynamic control AUC by Uno et al.
## (http://biostats.bepress.com/cgi/viewcontent.cgi?article=1041&context=harvardbiostat)
res.AUC.uno <- AUC.uno(Surv.rsp     = pbc$Surv,
                       Surv.rsp.new = pbc$Surv,
                       ## lp           = coxph.age.sex$linear.predictors,
                       lpnew        = coxph.age.sex$linear.predictors,
                       times        = 1:10 * 365.25)
res.AUC.uno$iauc
[1] 0.5811
plot(res.AUC.uno)

plot of chunk unnamed-chunk-14

Summary measures (10 years when applicable)

## C-statistic by Uno et al. (Stat Med 2011;30:1105-1117)
res.UnoC <- UnoC(Surv.rsp     = pbc$Surv,
                 Surv.rsp.new = pbc$Surv,
                 ## lp           = coxph.age.sex$linear.predictors,
                 lpnew        = coxph.age.sex$linear.predictors,
                 time        = 10 * 365.25)       # Upper limit, not time points
res.UnoC
[1] 0.5608

## C-statistic by Begg et al. (Stat Med 2000;19:1997-2014)
res.BeggC <- BeggC(Surv.rsp     = pbc$Surv,
                   Surv.rsp.new = pbc$Surv,
                   lp           = coxph.age.sex$linear.predictors,
                   lpnew        = coxph.age.sex$linear.predictors)
res.BeggC
[1] 0.5519

## Gonen and Heller’s Concordance Index for Cox PH models (Biometrika 2005;92:965-970)
res.GHCI <- GHCI(## Surv.rsp     = pbc$Surv,
                 ## Surv.rsp.new = pbc$Surv,
                 ## lp           = coxph.age.sex$linear.predictors,
                 lpnew        = coxph.age.sex$linear.predictors)
res.GHCI
[1] 0.5717

*R2-type Coefficients for Cox models *

## O'Quigley et al. (Stat Med 2005;24:479-489)
res.OXS <- OXS(Surv.rsp = pbc$Surv,
               lp       = coxph.age.sex$linear.predictors,
               lp0      = coxph.null$linear.predictors)
res.OXS
[1] 0.06092

## Nagelkerke (Biometrika 1991;78:691-692)
res.Nagelk <- Nagelk(Surv.rsp = pbc$Surv,
                     lp       = coxph.age.sex$linear.predictors,
                     lp0      = coxph.null$linear.predictors)
res.Nagelk
[1] 0.02779

## Xu et al. (Journal of Nonparametric Statistics 1999;12:83-107)
res.XO <- XO(Surv.rsp = pbc$Surv,
             lp       = coxph.age.sex$linear.predictors,
             lp0      = coxph.null$linear.predictors)
res.XO
[1] 0.04166

Uno methods for C-statistics and IDI/NRI implemented in survC1 and survIDINRI

## Create numeric variable for sex
pbc$female <- as.numeric(pbc$sex == "f")

C-statistics (10-years follow up) using survC1 package

  • Hajime Uno, Tianxi Cai, Michael J. Pencina, Ralph B. DAgostino, and LJ Wei. On the C-statistics for evaluating overall adequacy of risk prediction procedures with censored survival data. Statistics in Medicine 2011, 30:1105-16.
library(survC1)
## C-statistic for age sex model
unoC.age.female         <- Est.Cval(mydata = pbc[,c("time","event","age","female")],           tau = 10 * 365.25)
unoC.age.female$Dhat
[1] 0.5618
## C-statistic for age sex albumin model
unoC.age.female.albumin <- Est.Cval(mydata = pbc[,c("time","event","age","female","albumin")], tau = 10 * 365.25)
unoC.age.female.albumin$Dhat
[1] 0.672

Comparison of C-statistics

  mydata: Input data. The 1st column should be time-to-event, and the
          2nd column is event indicator (1=event, 0=censor).
   covs0: A matrix that consists of a set of predictors for a base
          model (Model 0)
   covs1: A matrix that consists of a set of predictors for a new model
          (Model 1)
     tau: Truncation time. The resulting C tells how well the given
          prediction model works in predicting events that occur in the
          time range from 0 to ‘tau’. Note that the survival function
          for the underlying censoring time distribution needs to be
          positive at ‘tau’.
     itr: Iteration of perturbation-resampling.
## Comaprison of C-statistics
uno.C.delta <- Inf.Cval.Delta(mydata = pbc[,c("time","event")],
                              covs0  = pbc[,c("age","female")],           # age sex model
                              covs1  = pbc[,c("age","female","albumin")], # age sex albumin model
                              tau    = 10 * 365.25,                       # Trucation time (max time to consider)
                              itr = 1000)                                 # Iteration of perturbation-resampling
uno.C.delta
          Est      SE Lower95 Upper95
Model1 0.6720 0.02137 0.63009  0.7138
Model0 0.5618 0.02266 0.51736  0.6061
Delta  0.1102 0.02367 0.06384  0.1566

IDI, continous NRI, and median improvement (10-years follow up) using survIDINRI

  • Uno H, Tian L, Cai T, Kohane IS, Wei LJ. A unified inference procedure for a class of measures to assess improvement in risk prediction systems with survival data, Statistics in Medicine 2012. doi:10.1002/sim.5647
library(survIDINRI)
res.IDI.INF <- IDI.INF(indata = pbc[,c("time","event")],
                       covs0 = pbc[,c("age","female")],           # age sex model
                       covs1 = pbc[,c("age","female","albumin")], # age sex albumin model
                       t0 = 10 * 365.25,
                       npert = 300, npert.rand = NULL, seed1 = NULL, alpha = 0.05)

## M1 IDI; M2 continuous NRI; M3 median improvement
IDI.INF.OUT(res.IDI.INF)
    Est. Lower Upper
M1 0.101 0.027 0.162
M2 0.254 0.015 0.453
M3 0.094 0.018 0.180
## M1 red area; M2 distance between black points; M3 distance between gray points
IDI.INF.GRAPH(res.IDI.INF)

plot of chunk unnamed-chunk-20

Sunday, August 28, 2022

estimable contrast

 An old technical document from sas captures the relationship among Type I, II and III sum of squares.

Thursday, January 08, 2015

qq plot interpretation

SAS doc gives the references and a few simple rules, and R community gives more examples around these rules. I think a better way to remember these associations is to understand the mechanism. In this way, wikipedia is doing a better job:
A simple case is where one has two data sets of the same size. In that case, to make the Q–Q plot, one orders each set in increasing order, then pairs off and plots the corresponding values
So it is almost a scatter plot of observed value vs simulated value from a distribution. Still it is not always easy to interpret, especially from a small dataset

Besides, the qqPlot function from r package 'car' includes CIs for observed quantiles by default, or more specialized qqplot for log10(p-values) from pQQ function out of Haplin

Tuesday, November 25, 2014

glimmix models with permutation

Assume a factor coded under a genotypic model as the main effect without interaction
Estimate ‘dominant’  geno  -1 0.5 0.5  ‘recessive’ geno  -0.5 -0.5 1  ‘additive’   geno   0  0.3333333 0.6666667/adj=simulate;

The function may be available in other SAS Proc

Thursday, November 06, 2014

genome build

from here

What is GRCh37?

Friday, October 31, 2014

R Onto-Tools suite iPathwayGuide

From iPathwayGuide, our pathway analysis approach is based on novel Impact Analysis method. This paper, published in Genome Research, explains the underlying analytic method and also demonstrates how Impact Analysis avoids costly false positive results as generated by other methods such as Over-Representation Analysis. The tool is also available from ROntoTools

Tuesday, October 28, 2014

advanced r wiki

From hadley, who is like a god in R.
The discussion of environment is really helpful. I am aware of other excellent resources, but still did not get it from there

Tuesday, October 14, 2014

graphic output for non-pdf file under unix without X11

bitmap('convergence%03d.png', height=11, width=8, res=600);
plot(post[, selectNode2Plot])
dev.off();
Created by Pretty R at inside-R.org

Wednesday, October 01, 2014

sas datasets IO with R

R code lines below are run without SAS installed.

Wednesday, September 17, 2014

consistency property of lasso

This is about the meaning of | as in the "irrepresentable condition"  from On Model Selection Consistency of Lasso
It seems the operation is |x| = abs(sum(x)). I have never seen such a norm. But their Section "3.1 Simulation Example 1" makes me think so: 
... in two settings: (a) β1 = 2, β2 = 3 ; and (b) β1 =  2, β2 = 3. In both settings, X(1) = (X1;X2), X(2) = X3 and through (2), it is easy to get C21inv(C11)= (2/3,2/3). Therefore Strong Irrepresentable Condition fails for setting (a) and holds for setting (b).
Their Proof for Corollary 1 makes me think so:
There is a similar use here.

Sunday, August 17, 2014

oracle

Fan and Li (2001), the SCAD estimator, with appropriate choice of the regularization (tuning) parameter, possesses a sparsity property, i.e., it estimates zero components of the true parameter vector exactly as zero with probability approaching one as sample size increases while still being consistent for the non-zero components...In other words, with appropriate choice of the regularization parameter, the asymptotic distribution of the SCAD estimator based on the overall model and that of the SCAD estimator derived from the most parsimonious correct model coincide. Fan and Li (2001) have dubbed this property the “oracle property”....It is well-known for Hodges’ estimator that the maximal (scaled) mean squared error grows without bound as sample size increases (e.g., Lehmann and Casella (1998), p.442), whereas the standard maximum likelihood estimator has constant finite quadratic risk. In this note we show that a similar unbounded risk result is in fact true for any estimator possessing the sparsity property. This means that there is a substantial price to be paid for sparsity even though the oracle property (misleadingly) seems to suggest otherwise. 
In "Modern statistical estimation via oracle inequalities":
Theorem 4.1. The James–Stein estimate obeys .

In other words, the James–Stein estimator is almost as good as the ideal estimator in a mean-squared error sense.The inequality (4.2) is an oracle inequality. An oracle inequality relates the performance of a real estimator with that of an ideal estimator which relies on perfect information supplied by an oracle, and which is not available in practice.

Tuesday, June 10, 2014

mean, median and mode

this page explains a mean minimizes the $\ell^2$ norm of the residual:$\min_{m_2} \sum_i (m_2-d_i)^2$ 
a median minimizes its $\ell^1$ norm and a mode minimizes the zero norm of the residual, namely $\ell^0=\vert m_0-d_i\vert^0$.See the wikipedia page about median.

from here, it was further explained that
Inder Jeet Taneja’s book draft has a nice survey of the results: if you fix the upper and lower boundary, and maximize entropy, you’ll get the uniform distribution. If you fix the mean and the expected L2 norm (d^2) between the mean and the distribution, maximizing the entropy you’ll get the Gaussian. If you fix the expected L1 norm (|d|) between the mean and the distribution, maximizing the entropy you’ll get the Laplace (also referred to as Double Exponential). Moreover, log(1+d^2) norm will yield the Cauchy distribution – a special case of the standard heavy-tailed Student distribution.

Thursday, June 05, 2014

check points when reviewing a genetic screening report


  1. title and footnote, ensuring it describes the analysis population, the outcome variable and the class of genetic markers; 
  2. eyeball examples:
    • 1 example of x chr snp
    • 1 example of autosomal snp with only 2 genotypes
    • 1 example of top association 
    • 1 example of a random association
  3. use the excel output to check the value ranges for each column, pay attention to
    • extreme values
    • empty cells
    • characters indicating missing: -,NA, 0
  4. Cosmetic issues
    • decimal places
    • check line ends for character cut off

Saturday, December 21, 2013

interaction coding

In sas: we have

Nested Effects


Nested effects are generated in the same manner as crossed effects. Hence, the design columns generated by the following statements are the same (but the ordering of the columns is different):
model y=a b(a);   (B nested within A)
model y=a a*b;   (omitted main effect for B)

Thursday, December 19, 2013

merge

These should be common sense about merge for a sas programmer. I need refresh my knowledge here.
“REPEAT OF BY-VALUES"