Sunday, August 28, 2022

estimable contrast

 An old technical document from sas captures the relationship among Type I, II and III sum of squares.

Thursday, January 08, 2015

qq plot interpretation

SAS doc gives the references and a few simple rules, and R community gives more examples around these rules. I think a better way to remember these associations is to understand the mechanism. In this way, wikipedia is doing a better job:
A simple case is where one has two data sets of the same size. In that case, to make the Q–Q plot, one orders each set in increasing order, then pairs off and plots the corresponding values
So it is almost a scatter plot of observed value vs simulated value from a distribution. Still it is not always easy to interpret, especially from a small dataset

Besides, the qqPlot function from r package 'car' includes CIs for observed quantiles by default, or more specialized qqplot for log10(p-values) from pQQ function out of Haplin

Tuesday, November 25, 2014

glimmix models with permutation

Assume a factor coded under a genotypic model as the main effect without interaction
Estimate ‘dominant’  geno  -1 0.5 0.5  ‘recessive’ geno  -0.5 -0.5 1  ‘additive’   geno   0  0.3333333 0.6666667/adj=simulate;

The function may be available in other SAS Proc

Thursday, November 06, 2014

genome build

from here

What is GRCh37?

Friday, October 31, 2014

R Onto-Tools suite iPathwayGuide

From iPathwayGuide, our pathway analysis approach is based on novel Impact Analysis method. This paper, published in Genome Research, explains the underlying analytic method and also demonstrates how Impact Analysis avoids costly false positive results as generated by other methods such as Over-Representation Analysis. The tool is also available from ROntoTools

Tuesday, October 28, 2014

advanced r wiki

From hadley, who is like a god in R.
The discussion of environment is really helpful. I am aware of other excellent resources, but still did not get it from there

Tuesday, October 14, 2014

graphic output for non-pdf file under unix without X11

bitmap('convergence%03d.png', height=11, width=8, res=600);
plot(post[, selectNode2Plot])
dev.off();
Created by Pretty R at inside-R.org

Wednesday, October 01, 2014

sas datasets IO with R

R code lines below are run without SAS installed.

Wednesday, September 17, 2014

consistency property of lasso

This is about the meaning of | as in the "irrepresentable condition"  from On Model Selection Consistency of Lasso
It seems the operation is |x| = abs(sum(x)). I have never seen such a norm. But their Section "3.1 Simulation Example 1" makes me think so: 
... in two settings: (a) β1 = 2, β2 = 3 ; and (b) β1 =  2, β2 = 3. In both settings, X(1) = (X1;X2), X(2) = X3 and through (2), it is easy to get C21inv(C11)= (2/3,2/3). Therefore Strong Irrepresentable Condition fails for setting (a) and holds for setting (b).
Their Proof for Corollary 1 makes me think so:
There is a similar use here.

Sunday, August 17, 2014

oracle

Fan and Li (2001), the SCAD estimator, with appropriate choice of the regularization (tuning) parameter, possesses a sparsity property, i.e., it estimates zero components of the true parameter vector exactly as zero with probability approaching one as sample size increases while still being consistent for the non-zero components...In other words, with appropriate choice of the regularization parameter, the asymptotic distribution of the SCAD estimator based on the overall model and that of the SCAD estimator derived from the most parsimonious correct model coincide. Fan and Li (2001) have dubbed this property the “oracle property”....It is well-known for Hodges’ estimator that the maximal (scaled) mean squared error grows without bound as sample size increases (e.g., Lehmann and Casella (1998), p.442), whereas the standard maximum likelihood estimator has constant finite quadratic risk. In this note we show that a similar unbounded risk result is in fact true for any estimator possessing the sparsity property. This means that there is a substantial price to be paid for sparsity even though the oracle property (misleadingly) seems to suggest otherwise. 
In "Modern statistical estimation via oracle inequalities":
Theorem 4.1. The James–Stein estimate obeys .

In other words, the James–Stein estimator is almost as good as the ideal estimator in a mean-squared error sense.The inequality (4.2) is an oracle inequality. An oracle inequality relates the performance of a real estimator with that of an ideal estimator which relies on perfect information supplied by an oracle, and which is not available in practice.

Tuesday, June 10, 2014

mean, median and mode

this page explains a mean minimizes the $\ell^2$ norm of the residual:$\min_{m_2} \sum_i (m_2-d_i)^2$ 
a median minimizes its $\ell^1$ norm and a mode minimizes the zero norm of the residual, namely $\ell^0=\vert m_0-d_i\vert^0$.See the wikipedia page about median.

from here, it was further explained that
Inder Jeet Taneja’s book draft has a nice survey of the results: if you fix the upper and lower boundary, and maximize entropy, you’ll get the uniform distribution. If you fix the mean and the expected L2 norm (d^2) between the mean and the distribution, maximizing the entropy you’ll get the Gaussian. If you fix the expected L1 norm (|d|) between the mean and the distribution, maximizing the entropy you’ll get the Laplace (also referred to as Double Exponential). Moreover, log(1+d^2) norm will yield the Cauchy distribution – a special case of the standard heavy-tailed Student distribution.

Thursday, June 05, 2014

check points when reviewing a genetic screening report


  1. title and footnote, ensuring it describes the analysis population, the outcome variable and the class of genetic markers; 
  2. eyeball examples:
    • 1 example of x chr snp
    • 1 example of autosomal snp with only 2 genotypes
    • 1 example of top association 
    • 1 example of a random association
  3. use the excel output to check the value ranges for each column, pay attention to
    • extreme values
    • empty cells
    • characters indicating missing: -,NA, 0
  4. Cosmetic issues
    • decimal places
    • check line ends for character cut off

Saturday, December 21, 2013

interaction coding

In sas: we have

Nested Effects


Nested effects are generated in the same manner as crossed effects. Hence, the design columns generated by the following statements are the same (but the ordering of the columns is different):
model y=a b(a);   (B nested within A)
model y=a a*b;   (omitted main effect for B)

Thursday, December 19, 2013

merge

These should be common sense about merge for a sas programmer. I need refresh my knowledge here.
“REPEAT OF BY-VALUES"

Tuesday, October 29, 2013

chain information impacts summary statistics from coda

It seems chain information is not critical in mean, sd and quantiles; and not critical even for time series SE when convergence.

library(coda) 
x1 <- cbind(rnorm(100, 1, 1), rnorm(100,  1,1), rnorm(100,1 ,1 ),rnorm(100,1,1), cumsum(rnorm(100,1,1)))
x2 <- cbind(rnorm(100, 1,10), rnorm(100, 10,1), rnorm(100,10,10),rnorm(100,1,1), cumsum(rnorm(100,1,1)))
 
colnames(x1) <- colnames(x2) <- c('unequalSD','unequalMean','unequalboth','equal','autocorr')
 
o1 <- mcmc.list(mcmc(x1), mcmc(x2)) #maintain chain information. to add more chain: o1[[3]] <- mcmc(x3);
summary(o1)
 
summary(mcmc(rbind(x1,x2))) #no chain information
Created by Pretty R at inside-R.org

Friday, September 06, 2013

Quickly insert all sheet names in cells with VBA

from here:

If you want to inset all sheet names in cells, VBA macro is a good choice.
Step 1: Hold down the Alt + F11 keys in Excel, and it opens the Microsoft Visual Basic for Applications window.
Step 2: Click Insert > Module, and paste the following macro in the Module Window.
VBA for inserting all worksheets' names in cells:
Sub SheetNames()
Columns(1).Insert
For i = 1 To Sheets.Count
Cells(i, 1) = Sheets(i).Name
Next i
End Sub

Friday, June 14, 2013

seriation

Seriation, also the name for an R package 
is to arrange all objects in a set in a linear order given available data and some loss or merit function in order to reveal structural information. Together with cluster analysis and variable selection, seriation is an important problem in the field of combinatorial data analysis
This post discussed the application of seriation on plotting binary matrices, together with other clustering algorithms.

Saturday, May 18, 2013

Prioritizing GWAS Results

This review paper mentioned 3 analytic methods to prioritizing GWAS results (for follow-up studies):
  • Meta-analysis seeks to pool information from multiple GWAS to increase the chances of finding true positives among the false positives and provides a way to combine associations across GWAS, even when the original data are unavailable.

  • Testing for epistasis within a single GWAS study can identify the stronger results that are revealed when genes interact.

  • Pathway analysis of GWAS results is used to prioritize genes and pathways within a biological context. Following a GWAS, association results can be assigned to pathways and tested in aggregate with computational tools and pathway databases.

Wednesday, May 15, 2013


The default plot function from glmnet is quite basic and plain. The following function tries to improve the plots
#this function is to achieve the same purpose of plot.glmnet()
#with the following enhancement:

# 1) show variable number instead of column number
# 2) show the value of lambda that gives smallest cross validation error with a solid line
# 3) show the largest lambda such that the error is within 1 se of the minimal cross validation error with a dashed line
 
#input variables: a glmnet object from glmnet(), a cv. glmnet object from cv.glmnet(), and a possible plot title
 

Friday, April 26, 2013

rjags conflicts with igraph

#Start a new R session with --vanilla;
 
library(igraph) # version 0.6-2
n1 <- read.graph ('network.txt', format="ncol", directed=F, name=T); 
a1 <- get.adjacency="" n1="" pre="" type="both">
if (is.connected(n1, mode="weak")){ 
    #do something 
} 
 
library(rjags) #version 3.5
data(LINE)
LINE$recompile() #example analysis included in rjags
 
#Start a new R session with --vanilla, and the following line runs smoothly;
 
library(rjags) #version 3.5
data(LINE)
LINE$recompile()
Created by Pretty R at inside-R.org

Thursday, April 11, 2013

inline rjags

from here


Friday, February 22, 2013

igraph tutorial

http://horicky.blogspot.com/2012/04/basic-graph-analytics-using-igraph.html
http://igraph.sourceforge.net/igraphbook/
leaves = V(g)[degree(g, mode="out")==0]

Wednesday, February 20, 2013

from here:
The methods under consideration included observed case MMRM, per protocol visits MMRM, interval last observation carried forward (LOCF) MMRM, and a hybrid of the per protocol visits and interval LOCF MMRM approaches.Simulation results reveal that the method that best controls the type I error rate is the per protocol visits method.

Thursday, January 17, 2013

HapMap 3: more people ~ more genetic variation

from here.
Centre d’Etude du Polymorphisme Humain collected in Utah, USA, with ancestry from northern and western Europe (CEU)
Han Chinese in Beijing, China (CHB)
Japanese in Tokyo, Japan (JPT)
Yoruba in Ibadan, Nigeria (YRI)

African ancestry in the southwestern USA (ASW)
Chinese in metropolitan Denver, Colorado, USA (CHD)
Gujarati Indians in Houston, Texas, USA (GIH)
Luhya in Webuye, Kenya (LWK)
Maasai in Kinyawa, Kenya (MKK)
Mexican ancestry in Los Angeles, California, USA (MXL)
Tuscans in Italy (Toscani in Italia, TSI)
So memorize some of those abbreviations! One particular difference across these populations is that some are parent-offspring trios, and some are not. So the CEU sample are trios, while the TSI are not.

Sunday, December 30, 2012

R factor

Convert a character vector into 1:xxx

f1 <- class="st" span="span" unlist="unlist">strsplit("cabbage", split=""))
as.numeric(unclass(factor(f1))) # levels are coded 1:n according to their alphabetical order
as.numeric(unclass(factor(f1, levels=unique(f1)))) # levels are coded 1:n according to their first appearence

ggplot2 trick

Here's some code from Hadley's book that assumes you've created ggplot objects a, b, and c. It puts a in the top row, with b and c in the bottom row.
grid.newpage()
pushViewport(viewport(layout=grid.layout(2,2)))
vplayout<-function(x,y){
    viewport(layout.pos.row=x,layout.pos.col=y) 
}
print(a,vp=vplayout(1,1:2))
print(b,vp=vplayout(2,1))
print(c,vp=vplayout(2,2))
Created by Pretty R at inside-R.org

  • different line style from geo_smooth in different facet_wrap
p1 <- ggplot(ds2, aes(x=visit, y=change))+ geom_point(shape=1, 
aes(color=treat), position=position_jitter(width=0.4,height=0)) 
+ facet_grid(snp~race)+ opts(title = 'rsxxxx')
 
p1 <- p1 + geom_smooth(data=subset(ds2, race == 'CA'), method='lm', 
se=F, color='blue', linetype = ifelse(pValues[pValues$'dbsnprsid' 
== 'rsxxxx', 'slope1_p'] < 0.05 ,"longdash", "solid" ))
 
Created by Pretty R at inside-R.org
  •  background color in the plotting area
+ theme_grey() - the default theme, with a grey background
+ theme_bw() - a theme with a white background 
Created by Pretty R at inside-R.org


Thursday, December 20, 2012

Saturday, August 25, 2012

sas dataset merging

from sas 9.1.3 language reference

Thursday, May 10, 2012

mean corr is equal to sample corr

at least for independent pairs. For prove see here.

Thursday, February 16, 2012

winbugs notes

  1. in winbugs script, the file path cannot be longer than 108 characters; otherwise, there will be an incompatible copy error message
  2. from R2winbugs, even seed is set to NULL (the default), chains from winbugs are identical given identical initial values and data sets. Guess winbugs figure out a fixed seed based upon these input.
  3.  a good reference for specify covariance matrix prior for multivariate normal. However, there are comments that promote scaled inverse Wishart instead of inverse Wishart as the prior for precision. An illustration against inverse Wishart can be found here
  4. scaled inversed wishart recommended by Gelman. Need set df = K+ 1 to make individual correlation coefficient uniform on [-1,1], where K is the dimension of the Cov matrix (although df=K is the minimum allowed df). A winbugs implementation can be found here: it actually scales the y variable, which indirectly scale the covariance matrix.
  5. From winbugs manual: "Note that WinBUGS simulates each node in turn: this can make convergence very slow and the program very inefficient for models with strongly related parameters, such as hidden-Markov and other time series structures." Thus, strongly related parameters should be put into one multivariate node.
  6. For parameters for repeated measures, reference coding of effect (treating a visit as a reference, and other visits as deviation from the reference) seems to perform worse than cell mean effect coding (just a separate parameter for each visit). This is probably because the later gives rise to a orthogonal design matrix. 
  7. to convert a object (eg. wb) returned from function 'bugs' in R2WinBUGS (codaPkg=FALSE ), we can use coda1 <- as.mcmc.list(wb). See the documentation of 'codaPkg' for more details.
  8. openbugs is not necessary better than winbugs in perfomance and scalibility. 
    1. openbugs has no automatic graphic output, which is good when we are not worried about convergence and when we run it through many cycles (eg. in simulations)
    2. openbugs runs in unix.
  9. Missing values are treated as a stochastic node and effectively 'imputed' by winbugs. However, such imputed data will not inflate precision (since they are not treated as data, but parameters by winbugs). It is possible to take this imputation procedure out: run multiple imputation and feed full data into winbugs to mimic the internal imputation procedure implemented by winbugs.We will have additional data from imputation (hence increase precision), but the variation among different imputed datasets, which will be fed one by one and posterior distributions merged, may compensate for the increased precision.
  10. A good resource for DIC

Tuesday, February 07, 2012

mmrm model example

proc mixed data=dsin;
class treat;
model change = baseline treat genotype treat*genotype;
estimate 'additive effect' genotype 1 /e;
run;

proc mixed data=dsin;
class visit;
model change = baseline visit genotype visit*genotype;
estimate 'additive effect at the last visit' genotype 1 genotype*visit 0 0 0 0 1/e;

run;

Thursday, January 26, 2012

Hello, Bugs world!

Here is my first Bayesian code modified from the sample code for the Chapter 3 of Bayesian Adaptive Methods for Clinical Trials. The R code depends on BRugs that interfaces with Openbugs (manual is here).

Monday, January 16, 2012

IBS alternative

The publicly available IBS (Identity by State) macro in SAS is quite slow, and has a bug that IBS is not adjusted for call rate. I made an alternative implementation which is about 1/2 or 1/3 faster. However, SAS apparently handles the standard correlation calculations much much faster, and resulting Pearson correlation, Kendall correlation and Spearman correlation may give similar measurements to IBS (code is here, data include about 1000 markers clustered in candidate genes in 400 subjects, primary Caucasians). Therefor I guess in certain Kernel application, we can use usual correlation to replace IBS.

Thursday, January 12, 2012

64bit perl debugger

It turns out Active Perl does not support 64bit debugger in 64 bit windows under PDK. From here:
The PDK's debugger was written for 32-bit Windows only. It's not available on ANY other platform, hence the 64-bit Windows PDK does not include a debugger. This has been documented in the release notes since the first 64-bit Windows PDK appeared (PDK 8.0).
There are no plans to update the PDK debugger, since a viable (and better) alternative exists in Komodo IDE. If you want a debugger for 64-bit, the multi-platform, multi-language Komodo IDE debugger is our suggested choice.
But still we can install 32 bit Active Perl for windows and use an older version of PDK (version 6 works).

sample code - MMRM


the following code implements the analyses suggested at the lower right corner of page 10 (i.e., page 312) of this paper for genetic predictors.

Tuesday, January 10, 2012

pairs() and more

Getting genetics done has a piece of code to make a more informative pairwise scatter plot than pairs()
 
# panel.smooth function is built in.
# panel.cor puts correlation in upper panels, size proportional to correlation
panel.cor <- function(x, y, digits=2, prefix="", cex.cor, ...)
{
    usr <- par("usr"); on.exit(par(usr))
    par(usr = c(0, 1, 0, 1))
    r <- abs(cor(x, y))
    txt <- format(c(r, 0.123456789), digits=digits)[1]
    txt <- paste(prefix, txt, sep="")
    if(missing(cex.cor)) cex.cor <- 0.8/strwidth(txt)
    text(0.5, 0.5, txt, cex = cex.cor * r)
}
 
# Plot #2: same as above, but add loess smoother in lower and correlation in upper
pairs(~Sepal.Length+Sepal.Width+Petal.Length+Petal.Width, 
data=iris,lower.panel=panel.smooth, upper.panel=panel.cor, 
      pch=20, main="Iris Scatterplot Matrix")
 
From the comments, we see alternatives are pairs.panels() from package psych and chart.Correlation() from package PerformanceAnalytics.

Saturday, December 24, 2011

cgm to ps

http://technologytales.com/tag/ralcgm/
RALCGM is a handy command line tool that can covert from CGM to Postscript on its own without any need for ImageMagick at all.

Monday, October 17, 2011

NGS reading list (Oct 2011)

http://www.ncbi.nlm.nih.gov/pmc/articles/PMC2581791/
http://cs.calstatela.edu/wiki/images/5/5f/Sequence_census_methods_for_functional_genomics.pdf
http://raindancetech.com/documents/Next-Generation-Sequencing_Nature-Genetics_Jan-2010.pdf

Wednesday, October 05, 2011

Thursday, September 22, 2011

test linear hypothesis in proc phreg

The idea of LSmean becomes so integrated part of statistical reporting that we feel so clumsy in proc phreg from sas 9.1.3 or earlier, where only numerical predictors are allowed. In the following test, a few examples of TEST statement for some common linear hypotheses are presented. The results have been comfirmed using proc phreg in sas 9.2, where a CLASS statement is available. Assuming a genotype effect with 3 levels (AA, AB and BB) and a treatment effect with 2 levels (0, 1) (and it seems p-value for the genetic effect across treatment arm will be different if the 2 treatment levels are codes as 1 and 2), a Cox proportional hazards model of survival time as a function of genotype, treatment and genotype-by-treatment interaction is fitted.

Friday, September 09, 2011

conditional exact logistic regression

Conditional Logistic Regression:
Taking the stratification into account by "conditioning out" (and not estimating) the stratum-specific intercepts gives consistent and asymptotically normal MLEs for the slope coefficients. If your nuisance parameters are not just stratum-specific intercepts, you can perform an exact conditional logistic regression.
The likelihood function to be maximized feels really like that in proc phreg with a discrete option. Each item (for a stratum) in the likelihood function is the conditional probability of observing subject 1 to h who actually had an event conditioning on h event happened (for any h subjects) in the stratum. The concept of sufficient statistic is not explicitly. Conditional logistic regression is called with the strata statement. More examples are from here and here.

Exact Conditional Logistic Regression
Exact conditional inference is based on generating the conditional distribution for the sufficient statistics of the parameters of interest. This distribution is called the permutation or exact conditional distribution. One example from sas suggests that we move from conditional regression to exact conditional regression when "you believe the data set is too small or too sparse for the usual asymptotics to hold.":
  • by "sparse", they probably refer to the problem of complete separation of quasi-complete separation, and the exact method is one way to handle this, especially when the firth option is only available in sas 9.1.3;
  • by "usual asymptotics", they probably refer to the issue of  large-sample asymptotic normality when we do not have a large sample size compared to the number of parameters to estimate.

Thursday, August 25, 2011

%plotit macro

a good document with graphic examples are here.
this one is more technical.

Friday, August 19, 2011

use %scan function to cycle through a list of items


There are many examples on line.  But I have difficulty to break from the loop following them in one SAS system.

Monday, August 15, 2011

model selection for mmrm

it seems glmselect cannot handle this. See a proposed procedure here.

Tuesday, August 02, 2011

Geometric Statistics, geometric CV, intra-subject variation

found from onbiostatistics and here. Additional discussion can be found here.
also for a log normal variable x, there are
y=log(x)~N(mu, sigma**2)
E(x)=exp(mu+sigma**2/2) and
sd(x)=exp(mu+sigma**2/2)*sqrt(exp(sigma**2)-1)
cv(x)=sqrt(exp(sigma**2)-1)

Tuesday, July 05, 2011

left join A and B may still lose obs in A

Imaging data set A has the demog information, and data set B has multiple outcome variable and covariate information. B has less subjects than A. To create data set C which is merged from A and B and contains all subjects in A (even the subject is not in B), the following sql statement would be fine
proc sql,
create table c (drop=subjid) as;
select a.*, b.* from a left join b (rename=(pat_id=subjid))
on a.pat_id=b.subjid;
quit;
However, since B has multiple outcome variables, there can be an intent to filter out one outcome variable like the following

proc sql,
create table c (drop=subjid) as;
select a.*, b.* from a left join b (rename=(pat_id=subjid))
on a.pat_id=b.subjid
where b.outcomeName='outcome A';
quit;
This is wrong, because the field of outcomeName will be missing (and filtered out) for subjects in A but not in B. The solution can be either to filter in the separate data step, or the following sql statement

proc sql,
create table c (drop=subjid) as;
select a.*, b.* from a left join b (rename=(pat_id=subjid) where=(outcomeName='outcome A'))
on a.pat_id=b.subjid;

quit;

Sunday, June 12, 2011

create a design matrix

Usage Note 23217: If I know the model, how can I create the corresponding design matrix (dummy, indicator, or design variables) in a data set?

Friday, May 20, 2011

check whether a variable exists in a sas dataset

VARNUM checks whether the variable &name exists

Thursday, May 19, 2011

Modify the default ODS graphics behavior from statistical procedures

Those ODS graphics in the statistical procedures are predefined template using Graphics template language(GTL). Using ods trace on to find the name of that graph template. Then you have two ways to modify them.

split data



1000 size sample, variable: ID_no from 1-1000
how to split it to two new data sets: one with even numbers for ID_no, the othe one with odds number for ID_no?

data set1 set2;
set yourdata;
if mod(id, 2) = 0 then output set1;
else output set2;
run;

estimate statements

In your LSMEANS statement, add an option e, SAS should print out an estimate statement for you. That will be an excellent example.

write a sas dataset to a flat file


filename myfile "d:\temp\a.txt";
data _null_;
set mydata;
file myfile;
put id 1-3 var2 4-10 var3 11-15 ;
run;

change the default number of digits in P values

proc template;
edit Common.PValue;
format=best16.;
end;
run;

download data with URL


FILENAME myurl URL "http://ichart.finance.yahoo.com/table.csv?s=&tic";

DATA &tic;
INFILE myurl FIRSTOBS=2 missover dsd;
format date yymmdd10.;
INPUT Date: yymmdd10. Open High Low Close Volume Adj_Close
;
if date>=today()-180;

RUN;

randomly select 300 samples

Proc Surveyselect
data=old data method=sys/srs/etc. sampsize=300 out=new data;
run;

superscript to ODS RTF


Google is your best friend:
http://www2.sas.com/proceedings/forum2008/033-2008.pdf

ods rtf file="C:\test.rtf";
ods escapechar= '\';
proc tabulate data=sashelp.class style={cellwidth=100};
class sex;
var age;
table sex,age;
label age= "age\{super 1}";
run;
ods rtf close;

select statement example

SELECT (payclass);
     WHEN ('monthly') amt=salary;
     WHEN ('hourly') DO;
            amt=hrlywage*min(hrs,40);
            IF hrs>40 THEN PUT 'Check Timecard';
      END; /* end of do */
      OTHERWISE PUT 'Problem Observation';
END;

Monday, May 16, 2011

concatenate a selected list of datasets

libname test 'C:\test2';

proc sql;
   select 'test.'||memname into : dlist separated BY ' '  from dictionary.tables
   where libname='TEST' and memname contains 'selection_rule';
quit;

data one;
set &dlist;
run;

Saturday, April 23, 2011

Thursday, April 21, 2011

built-in multiple testing mechanism in SAS

proc multtest by Peter H. Westfall and Russell D. Wolfinger
the adjust option in proc mixed, or proc glm
exact statement from PROC NPAR1WAY

Saturday, March 19, 2011

R - Good programming practice

  1. Instead of x[ind ,], use x[ind, , drop = FALSE]

Monday, March 14, 2011

SAS SQL union and intersection

http://support.sas.com/documentation/cdl/en/sqlproc/62086/HTML/default/a001361224.htm

Monday, February 28, 2011

“ANYDATE” INFORMATS

http://www.lexjansen.com/wuss/2006/SAS_essentials/ESS-Carroll.pdf
“ANYDATE” INFORMATS
Sometimes you are very lucky in that the raw data you receive contains dates that are the same format. Sometimes you will encounter a messy data file where the dates are all different types of formats. The “anydate” informats are designed to allow you to read in a variety of date forms including:
• DATE, DATETIME, and TIME
• DDMMYY, MMDDYY, and YYMMDD
• JULIAN, MONYY, and YYQ

Using the anydate informats can be particularly useful when you are reading in data that contains a mixture of date forms and you want certain parts of the dates you are reading in. Anydate informats include:
• ANYDTDTE. Extracts the date portion
• ANYDTDTM. Extracts the datetime portion
• ANYDTTME. Extracts the time portion

Personalized Medicine: A Discussion

http://blogs.forbes.com/matthewherper/2011/02/25/personalized-medicine-a-discussion/

Friday, February 25, 2011

regular expression in SAS

www2.sas.com/proceedings/sugi29/265-29.pdf

Thursday, February 10, 2011

R code to show UU' != I

U matrix as in SVD X=UDV' is only column orthogonal but not necessarily row orthogonal. U matrix, a N by p matrix, does not contain all the eigenvectors of XX', but only the first p vectors. X is also a N by p matrix, and XX' may have as many as N eigenvectors.

Monday, December 06, 2010

load delimited file with :

From here
If your data is longer than the default length, you need to use informats. For example, date or time values, names, and addresses can be longer than eight characters. In such cases, you either need to add an INFORMAT statement to the DATA step or add informats directly in the INPUT statement. However, when the informats are used in the INPUT statement, care must be taken to honor the function of the delimiter to prevent read errors. If you add informats in an INPUT statement, you must add a colon (:) in front of the informat, as shown in this example:

Tuesday, November 23, 2010

Gelman on Bayesian adaptive methods for clinical trials

http://www.stat.columbia.edu/~cook/movabletype/archives/2010/11/bayesian_adapti.html

Wednesday, November 17, 2010

proc means example

proc means data=one nway n mean std median min max noprint;
by treatment;
var variable;
output out=variable n=N mean=mean std=std median=median min=min max=max;
run;

proc transpose data=variable out=t&variable;
var n mean std median min max;
run;

proc sql table update

alter and update

Monday, November 15, 2010

NYC chinese food



纽约大四川饭店

Monday, November 08, 2010

Monday, October 11, 2010

run regression in R

Gelman: I really hate to think that there are people out there running regressions in R and not using display() and coefplot() to look at the output.

Wednesday, October 06, 2010

Metric MDS starting from eigen()

this is an exercise to figure the details of MDS, or more specifically, what the coordinates are that are used in plotting. More explanations can be found here.

geometric interpretation of vector operatiom

here

Tuesday, October 05, 2010

average heterzygosity

from Ascertainment bias in studies of human genome-wide polymorphism
A simple comparison of the HapMap and Perlegen genotype data was done by considering the 5682 windows of 500 kb across the entire genome and, for each window, tallying the SFS and calculating summary statistics such as average heterozygosity for each population and FST for each population pair and for the trio of samples.

The average uncorrected heterozygosity within the three population groups for the HapMap data were 0.281, 0.247, and 0.268 for the Yoruban, Chinese, and European samples. The corresponding figures for the uncorrected Perlegen data are 0.251, 0.211, and 0.229 for the African American, Chinese, and European samples.

histograms are like this.

Monday, October 04, 2010

2D plotting in SAS

This example shows a regression plot with prediction and confidence limits.
proc sgplot data=sashelp.class;
  reg x=height y=weight / CLM CLI;
run;

Tuesday, September 28, 2010

R inferno

Common mistakes in R programming
The R Inferno

Wednesday, September 22, 2010

Critical Chain Project Management

In CCPM two durations are estimated for each: an aggressive duration based on how long the task would take given full focus on the task and no problems, and a “safe” duration given full focus and typical  variation with each task. The differences between aggressive and “safe” durations for each critical task contribute to a pooled “project buffer” which is adjusted for the project as a whole. The end of the project buffer is the team’s “commit date” and the buffer protects the project from uncertainty
Managers and leadership need to provide clear project and task priorities and a work environment that enables single-task focus, so that each task can be completed quickly and with high quality.

Tuesday, September 21, 2010

meta analysis

Jadad scale to measure methological quality of a clinical trial
tool: CMA
publication standards: quorum (eg) and moose (eg)

proc mixed can be used in meta analysis.

Wednesday, September 15, 2010

histogram alternatives

Beanplot

hist() + rug(): add one dimensional scatter plot below the histograms

for discrete data: barplot(table(a)), where a is a discrete vector. Or barplot (tabulate (a))

ecdf (Empirical CDF) summarizes the data into something like a smooth CDF line while graphing all the data points.

dhist in ggplot2

more discussion from Gelman here and also here

Wednesday, August 25, 2010

Pseudoautosomal regions gene nomenclature

http://www.genenames.org/genefamily/par.php

main effect of a continous variable

In both proc mixed and glimmix (see the code example below), the "Solution for Fixed Effects" generated by option /SOLUTION for the continous variable 'binary' does not estimate the main/marginal effect when the value of binary changes from 0 to 1. It is because of the interaction term between binary and visit. To find the main/marginal effect, we can code the variable 'binary' as a class/categorical variable and find this LSMEANS.

Wednesday, June 23, 2010

distances between vector elements

distance<-function(x,y) {(x-y)^2}
 outer(A,A,distance)

Friday, June 18, 2010

sas missing data categories

http://studysas.blogspot.com/2010/04/special-missing-values.html

Thursday, May 20, 2010

SAS command line

"C:\Program Files\SAS\SAS 9.1\sas" 1.sas -log "1.log.txt" -print "1.result.txt"

SAS IO

/*===========================
export;
===========================*/

Thursday, May 13, 2010

matching program

Case control matching, probably implementing the idea of propensity scores
R
matchIT
Matching
optmatch optmatch presentation1; optmatch presentation2;

SAS
gmatch,vmatch,dist
a sugi paper 165-29: Performing a 1:N Case-Control Match on Propensity Score

PaperOn the Estimation and Use of Propensity Scores in Case-Control and Case-Cohort Studies
[using] cases plus controls in a case-control study... should give consistent estimates of the true propensity score under the null hypothesis, but not otherwise.

Tuesday, May 11, 2010

good clinical trial simulation practices

http://cdds.ucsf.edu/research/sddgpreport.php#_Toc457223476

link function for proc logistic

SAS use the following options to explicitly decide whether the endpoint is ordinal or nomial

Monday, May 10, 2010

A Draft Sequence of the Neandertal Genome.

http://www.researchblogging.org/post/gotourl/id/213933
http://www.researchblogging.org/post/gotourl/id/213689
http://www.researchblogging.org/post/gotourl/id/213509

Viagra could help women too

http://www.helenjaques.co.uk/blog/2010/viagra-breast-cancer-drug-delivery/?utm_source=feedburner&utm_medium=feed&utm_campaign=Feed%3A+Insicknessandinhealth+%28In+sickness+and+in+health%29

Researchers in California have shown that sildenafil (Viagra) and a similar drug called vardenafil can improve the delivery of the chemotherapeutic drug Herceptin (trastuzumab) in women with breast cancer that has spread to the brain.

PubMed vs. Google Scholar

http://www.researchblogging.org/post/gotourl/id/214477
http://neurodojo.blogspot.com/2010/05/pubmed-vs-google-scholar.html?utm_source=feedburner&utm_medium=feed&utm_campaign=Feed%3A+Neurodojo+%28NeuroDojo%29

Antidepressants Not Effective for Some Types of Depression

http://www.researchblogging.org/post/gotourl/id/214478
http://brainblogger.com/2010/05/10/antidepressants-not-effective-for-some-types-of-depression/

Friday, April 09, 2010

Kendall/Spearman rank correlation

copied from here.
Spearman's rho comes from Charles Spearman's background of psychology, IQ testing and eugenics - it's an easy to calculate robust measure of whether there is an association.