An old technical document from sas captures the relationship among Type I, II and III sum of squares.
Sunday, August 28, 2022
Thursday, January 08, 2015
qq plot interpretation
SAS doc gives the references and a few simple rules, and R community gives more examples around these rules. I think a better way to remember these associations is to understand the mechanism. In this way, wikipedia is doing a better job:
Besides, the qqPlot function from r package 'car' includes CIs for observed quantiles by default, or more specialized qqplot for log10(p-values) from pQQ function out of Haplin
A simple case is where one has two data sets of the same size. In that case, to make the Q–Q plot, one orders each set in increasing order, then pairs off and plots the corresponding valuesSo it is almost a scatter plot of observed value vs simulated value from a distribution. Still it is not always easy to interpret, especially from a small dataset
Besides, the qqPlot function from r package 'car' includes CIs for observed quantiles by default, or more specialized qqplot for log10(p-values) from pQQ function out of Haplin
Tuesday, November 25, 2014
glimmix models with permutation
Assume a factor coded under a genotypic model as the main effect without interaction
Estimate ‘dominant’ geno -1 0.5 0.5 ‘recessive’ geno -0.5 -0.5 1 ‘additive’ geno 0 0.3333333
0.6666667/adj=simulate;
The function may be available in other SAS Proc
Thursday, November 06, 2014
Friday, October 31, 2014
R Onto-Tools suite iPathwayGuide
From iPathwayGuide, our pathway analysis approach is based on novel Impact Analysis method. This paper, published in Genome Research, explains the underlying analytic method and also demonstrates how Impact Analysis avoids costly false positive results as generated by other methods such as Over-Representation Analysis. The tool is also available from ROntoTools
Tuesday, October 28, 2014
advanced r wiki
From hadley, who is like a god in R.
The discussion of environment is really helpful. I am aware of other excellent resources, but still did not get it from there.
The discussion of environment is really helpful. I am aware of other excellent resources, but still did not get it from there.
Tuesday, October 14, 2014
Wednesday, October 01, 2014
Wednesday, September 17, 2014
consistency property of lasso
This is about the meaning of | as in the "irrepresentable condition" from On Model Selection Consistency of Lasso
It seems the operation is |x| = abs(sum(x)). I have never seen such a norm. But their Section "3.1 Simulation Example 1" makes me think so:
... in two settings: (a) β1 = 2, β2 = 3 ; and (b) β1 = 2, β2 = 3. In both settings, X(1) = (X1;X2), X(2) = X3 and through (2), it is easy to get C21inv(C11)= (2/3,2/3). Therefore Strong Irrepresentable Condition fails for setting (a) and holds for setting (b).Their Proof for Corollary 1 makes me think so:
There is a similar use here.
Sunday, August 17, 2014
oracle
Fan and Li (2001), the SCAD estimator, with appropriate choice of the regularization (tuning) parameter, possesses a sparsity property, i.e., it estimates zero components of the true parameter vector exactly as zero with probability approaching one as sample size increases while still being consistent for the non-zero components...In other words, with appropriate choice of the regularization parameter, the asymptotic distribution of the SCAD estimator based on the overall model and that of the SCAD estimator derived from the most parsimonious correct model coincide. Fan and Li (2001) have dubbed this property the “oracle property”....It is well-known for Hodges’ estimator that the maximal (scaled) mean squared error grows without bound as sample size increases (e.g., Lehmann and Casella (1998), p.442), whereas the standard maximum likelihood estimator has constant finite quadratic risk. In this note we show that a similar unbounded risk result is in fact true for any estimator possessing the sparsity property. This means that there is a substantial price to be paid for sparsity even though the oracle property (misleadingly) seems to suggest otherwise.In "Modern statistical estimation via oracle inequalities":
Theorem 4.1. The James–Stein estimate obeys .
In other words, the James–Stein estimator is almost as good as the ideal estimator in a mean-squared error sense.The inequality (4.2) is an oracle inequality. An oracle inequality relates the performance of a real estimator with that of an ideal estimator which relies on perfect information supplied by an oracle, and which is not available in practice.
Tuesday, June 10, 2014
mean, median and mode
this page explains a mean minimizes the
norm of the residual:
a median minimizes its
norm and a mode minimizes the zero norm of the residual, namely
.See the wikipedia page about median.
from here, it was further explained that
a median minimizes its
from here, it was further explained that
Inder Jeet Taneja’s book draft has a nice survey of the results: if you fix the upper and lower boundary, and maximize entropy, you’ll get the uniform distribution. If you fix the mean and the expected L2 norm (d^2) between the mean and the distribution, maximizing the entropy you’ll get the Gaussian. If you fix the expected L1 norm (|d|) between the mean and the distribution, maximizing the entropy you’ll get the Laplace (also referred to as Double Exponential). Moreover, log(1+d^2) norm will yield the Cauchy distribution – a special case of the standard heavy-tailed Student distribution.
Thursday, June 05, 2014
check points when reviewing a genetic screening report
- title and footnote, ensuring it describes the analysis population, the outcome variable and the class of genetic markers;
- eyeball examples:
- 1 example of x chr snp
- 1 example of autosomal snp with only 2 genotypes
- 1 example of top association
- 1 example of a random association
- use the excel output to check the value ranges for each column, pay attention to
- extreme values
- empty cells
- characters indicating missing: -,NA, 0
- Cosmetic issues
- decimal places
- check line ends for character cut off
Saturday, December 21, 2013
interaction coding
In sas: we have
Nested Effects
Nested effects are generated in the same manner as crossed effects. Hence, the design columns generated by the following statements are the same (but the ordering of the columns is different):
model y=a b(a); (B nested within A) model y=a a*b; (omitted main effect for B)
Thursday, December 19, 2013
Tuesday, October 29, 2013
chain information impacts summary statistics from coda
It seems chain information is not critical in mean, sd and quantiles; and not critical even for time series SE when convergence.
Created by Pretty R at inside-R.org
library(coda)
x1 <- cbind(rnorm(100, 1, 1), rnorm(100, 1,1), rnorm(100,1 ,1 ),rnorm(100,1,1), cumsum(rnorm(100,1,1))) x2 <- cbind(rnorm(100, 1,10), rnorm(100, 10,1), rnorm(100,10,10),rnorm(100,1,1), cumsum(rnorm(100,1,1))) colnames(x1) <- colnames(x2) <- c('unequalSD','unequalMean','unequalboth','equal','autocorr') o1 <- mcmc.list(mcmc(x1), mcmc(x2)) #maintain chain information. to add more chain: o1[[3]] <- mcmc(x3); summary(o1) summary(mcmc(rbind(x1,x2))) #no chain information
Friday, September 06, 2013
Quickly insert all sheet names in cells with VBA
from here:
Step 1: Hold down the Alt + F11 keys in Excel, and it opens the Microsoft Visual Basic for Applications window.
Step 2: Click Insert > Module, and paste the following macro in the Module Window.
VBA for inserting all worksheets' names in cells:
If you want to inset all sheet names in cells, VBA macro is a good choice.
Step 1: Hold down the Alt + F11 keys in Excel, and it opens the Microsoft Visual Basic for Applications window.
Step 2: Click Insert > Module, and paste the following macro in the Module Window.
VBA for inserting all worksheets' names in cells:
Sub SheetNames()
Columns(1).Insert
For i = 1 To Sheets.Count
Cells(i, 1) = Sheets(i).Name
Next i
End Sub
Columns(1).Insert
For i = 1 To Sheets.Count
Cells(i, 1) = Sheets(i).Name
Next i
End Sub
Friday, June 14, 2013
seriation
Seriation, also the name for an R package
is to arrange all objects in a set in a linear order given available data and some loss or merit function in order to reveal structural information. Together with cluster analysis and variable selection, seriation is an important problem in the field of combinatorial data analysisThis post discussed the application of seriation on plotting binary matrices, together with other clustering algorithms.
Saturday, May 18, 2013
Prioritizing GWAS Results
This review paper mentioned 3 analytic methods to prioritizing GWAS results (for follow-up studies):
- Meta-analysis seeks to pool information from multiple GWAS to increase the chances of finding true positives among the false positives and provides a way to combine associations across GWAS, even when the original data are unavailable.
- Testing for epistasis within a single GWAS study can identify the stronger results that are revealed when genes interact.
- Pathway analysis of GWAS results is used to prioritize genes and pathways within a biological context. Following a GWAS, association results can be assigned to pathways and tested in aggregate with computational tools and pathway databases.
Friday, May 17, 2013
Wednesday, May 15, 2013
The default plot function from glmnet is quite basic and plain. The following function tries to improve the plots
#this function is to achieve the same purpose of plot.glmnet() #with the following enhancement: # 1) show variable number instead of column number # 2) show the value of lambda that gives smallest cross validation error with a solid line # 3) show the largest lambda such that the error is within 1 se of the minimal cross validation error with a dashed line #input variables: a glmnet object from glmnet(), a cv. glmnet object from cv.glmnet(), and a possible plot title
Friday, April 26, 2013
rjags conflicts with igraph
#Start a new R session with --vanilla; library(igraph) # version 0.6-2 n1 <- read.graph ('network.txt', format="ncol", directed=F, name=T);
a1 <- get.adjacency="" n1="" pre="" type="both">if (is.connected(n1, mode="weak")){#do something }library(rjags) #version 3.5 data(LINE) LINE$recompile() #example analysis included in rjags #Start a new R session with --vanilla, and the following line runs smoothly; library(rjags) #version 3.5 data(LINE) LINE$recompile()->
Thursday, April 11, 2013
Friday, February 22, 2013
igraph tutorial
http://horicky.blogspot.com/2012/04/basic-graph-analytics-using-igraph.html
http://igraph.sourceforge.net/igraphbook/
leaves = V(g)[degree(g, mode="out")==0]
http://igraph.sourceforge.net/igraphbook/
leaves = V(g)[degree(g, mode="out")==0]
Wednesday, February 20, 2013
from here:
The methods under consideration included observed case MMRM, per protocol visits MMRM, interval last observation carried forward (LOCF) MMRM, and a hybrid of the per protocol visits and interval LOCF MMRM approaches.Simulation results reveal that the method that best controls the type I error rate is the per protocol visits method.
Thursday, January 17, 2013
HapMap 3: more people ~ more genetic variation
from here.
Centre d’Etude du Polymorphisme Humain collected in Utah, USA, with ancestry from northern and western Europe (CEU)
Han Chinese in Beijing, China (CHB)
Japanese in Tokyo, Japan (JPT)
Yoruba in Ibadan, Nigeria (YRI)
African ancestry in the southwestern USA (ASW)
Chinese in metropolitan Denver, Colorado, USA (CHD)
Gujarati Indians in Houston, Texas, USA (GIH)
Luhya in Webuye, Kenya (LWK)
Maasai in Kinyawa, Kenya (MKK)
Mexican ancestry in Los Angeles, California, USA (MXL)
Tuscans in Italy (Toscani in Italia, TSI)
So memorize some of those abbreviations! One particular difference across these populations is that some are parent-offspring trios, and some are not. So the CEU sample are trios, while the TSI are not.
Centre d’Etude du Polymorphisme Humain collected in Utah, USA, with ancestry from northern and western Europe (CEU)
Han Chinese in Beijing, China (CHB)
Japanese in Tokyo, Japan (JPT)
Yoruba in Ibadan, Nigeria (YRI)
African ancestry in the southwestern USA (ASW)
Chinese in metropolitan Denver, Colorado, USA (CHD)
Gujarati Indians in Houston, Texas, USA (GIH)
Luhya in Webuye, Kenya (LWK)
Maasai in Kinyawa, Kenya (MKK)
Mexican ancestry in Los Angeles, California, USA (MXL)
Tuscans in Italy (Toscani in Italia, TSI)
So memorize some of those abbreviations! One particular difference across these populations is that some are parent-offspring trios, and some are not. So the CEU sample are trios, while the TSI are not.
Sunday, December 30, 2012
R factor
Convert a character vector into 1:xxx
f1 <- class="st" span="span" unlist="unlist">strsplit("cabbage", split=""))->
as.numeric(unclass(factor(f1))) # levels are coded 1:n according to their alphabetical order
as.numeric(unclass(factor(f1, levels=unique(f1)))) # levels are coded 1:n according to their first appearence
ggplot2 trick
- aes_string
- when facet_wrap does not work:
grid.newpage() pushViewport(viewport(layout=grid.layout(2,2))) vplayout<-function(x,y){ viewport(layout.pos.row=x,layout.pos.col=y)
}
print(a,vp=vplayout(1,1:2)) print(b,vp=vplayout(2,1)) print(c,vp=vplayout(2,2))
- different line style from geo_smooth in different facet_wrap
p1 <- ggplot(ds2, aes(x=visit, y=change))+ geom_point(shape=1,
aes(color=treat), position=position_jitter(width=0.4,height=0))
+ facet_grid(snp~race)+ opts(title = 'rsxxxx') p1 <- p1 + geom_smooth(data=subset(ds2, race == 'CA'), method='lm',
se=F, color='blue', linetype = ifelse(pValues[pValues$'dbsnprsid'
== 'rsxxxx', 'slope1_p'] < 0.05 ,"longdash", "solid" ))
- background color in the plotting area
Thursday, December 20, 2012
Saturday, August 25, 2012
Thursday, May 10, 2012
Thursday, February 16, 2012
winbugs notes
- in winbugs script, the file path cannot be longer than 108 characters; otherwise, there will be an incompatible copy error message
- from R2winbugs, even seed is set to NULL (the default), chains from winbugs are identical given identical initial values and data sets. Guess winbugs figure out a fixed seed based upon these input.
- a good reference for specify covariance matrix prior for multivariate normal. However, there are comments that promote scaled inverse Wishart instead of inverse Wishart as the prior for precision. An illustration against inverse Wishart can be found here.
- scaled inversed wishart recommended by Gelman. Need set df = K+ 1 to make individual correlation coefficient uniform on [-1,1], where K is the dimension of the Cov matrix (although df=K is the minimum allowed df). A winbugs implementation can be found here: it actually scales the y variable, which indirectly scale the covariance matrix.
- From winbugs manual: "Note that WinBUGS simulates each node in turn: this can make convergence very slow and the program very inefficient for models with strongly related parameters, such as hidden-Markov and other time series structures." Thus, strongly related parameters should be put into one multivariate node.
- For parameters for repeated measures, reference coding of effect (treating a visit as a reference, and other visits as deviation from the reference) seems to perform worse than cell mean effect coding (just a separate parameter for each visit). This is probably because the later gives rise to a orthogonal design matrix.
- to convert a object (eg. wb) returned from function 'bugs' in R2WinBUGS (codaPkg=FALSE
), we can use coda1 <- as.mcmc.list(wb). See the documentation of 'codaPkg' for more details. - openbugs is not necessary better than winbugs in perfomance and scalibility.
- openbugs has no automatic graphic output, which is good when we are not worried about convergence and when we run it through many cycles (eg. in simulations)
- openbugs runs in unix.
- Missing values are treated as a stochastic node and effectively 'imputed' by winbugs. However, such imputed data will not inflate precision (since they are not treated as data, but parameters by winbugs). It is possible to take this imputation procedure out: run multiple imputation and feed full data into winbugs to mimic the internal imputation procedure implemented by winbugs.We will have additional data from imputation (hence increase precision), but the variation among different imputed datasets, which will be fed one by one and posterior distributions merged, may compensate for the increased precision.
- A good resource for DIC
Tuesday, February 07, 2012
mmrm model example
proc mixed data=dsin;
class treat;
model change = baseline treat genotype treat*genotype;
estimate 'additive effect' genotype 1 /e;
run;
class treat;
model change = baseline treat genotype treat*genotype;
estimate 'additive effect' genotype 1 /e;
run;
proc mixed data=dsin;
class visit;
model change = baseline visit genotype
visit*genotype;
estimate 'additive effect at the last visit' genotype
1 genotype*visit 0 0 0 0 1/e;
run;
Thursday, January 26, 2012
Hello, Bugs world!
Here is my first Bayesian code modified from the sample code for the Chapter 3 of Bayesian Adaptive Methods for Clinical Trials. The R code depends on BRugs that interfaces with Openbugs (manual is here).
Monday, January 16, 2012
IBS alternative
The publicly available IBS (Identity by State) macro in SAS is quite slow, and has a bug that IBS is not adjusted for call rate. I made an alternative implementation which is about 1/2 or 1/3 faster. However, SAS apparently handles the standard correlation calculations much much faster, and resulting Pearson correlation, Kendall correlation and Spearman correlation may give similar measurements to IBS (code is here, data include about 1000 markers clustered in candidate genes in 400 subjects, primary Caucasians). Therefor I guess in certain Kernel application, we can use usual correlation to replace IBS.
Thursday, January 12, 2012
64bit perl debugger
It turns out Active Perl does not support 64bit debugger in 64 bit windows under PDK. From here:
The PDK's debugger was written for 32-bit Windows only. It's not available on ANY other platform, hence the 64-bit Windows PDK does not include a debugger. This has been documented in the release notes since the first 64-bit Windows PDK appeared (PDK 8.0).
There are no plans to update the PDK debugger, since a viable (and better) alternative exists in Komodo IDE. If you want a debugger for 64-bit, the multi-platform, multi-language Komodo IDE debugger is our suggested choice.But still we can install 32 bit Active Perl for windows and use an older version of PDK (version 6 works).
sample code - MMRM
the following code implements the analyses suggested at the lower right corner of page 10 (i.e., page 312) of this paper for genetic predictors.
Tuesday, January 10, 2012
pairs() and more
Getting genetics done has a piece of code to make a more informative pairwise scatter plot than pairs()
# panel.smooth function is built in. # panel.cor puts correlation in upper panels, size proportional to correlation panel.cor <- function(x, y, digits=2, prefix="", cex.cor, ...) { usr <- par("usr"); on.exit(par(usr)) par(usr = c(0, 1, 0, 1)) r <- abs(cor(x, y)) txt <- format(c(r, 0.123456789), digits=digits)[1] txt <- paste(prefix, txt, sep="") if(missing(cex.cor)) cex.cor <- 0.8/strwidth(txt) text(0.5, 0.5, txt, cex = cex.cor * r) } # Plot #2: same as above, but add loess smoother in lower and correlation in upper pairs(~Sepal.Length+Sepal.Width+Petal.Length+Petal.Width,
data=iris,lower.panel=panel.smooth, upper.panel=panel.cor, pch=20, main="Iris Scatterplot Matrix")
From the comments, we see alternatives are pairs.panels() from package psych and chart.Correlation() from package PerformanceAnalytics.Saturday, December 24, 2011
cgm to ps
http://technologytales.com/tag/ralcgm/
RALCGM is a handy command line tool that can covert from CGM to Postscript on its own without any need for ImageMagick at all.
RALCGM is a handy command line tool that can covert from CGM to Postscript on its own without any need for ImageMagick at all.
Monday, October 17, 2011
NGS reading list (Oct 2011)
http://www.ncbi.nlm.nih.gov/pmc/articles/PMC2581791/
http://cs.calstatela.edu/wiki/images/5/5f/Sequence_census_methods_for_functional_genomics.pdf
http://raindancetech.com/documents/Next-Generation-Sequencing_Nature-Genetics_Jan-2010.pdf
http://cs.calstatela.edu/wiki/images/5/5f/Sequence_census_methods_for_functional_genomics.pdf
http://raindancetech.com/documents/Next-Generation-Sequencing_Nature-Genetics_Jan-2010.pdf
Wednesday, October 05, 2011
Thursday, September 22, 2011
test linear hypothesis in proc phreg
The idea of LSmean becomes so integrated part of statistical reporting that we feel so clumsy in proc phreg from sas 9.1.3 or earlier, where only numerical predictors are allowed. In the following test, a few examples of TEST statement for some common linear hypotheses are presented. The results have been comfirmed using proc phreg in sas 9.2, where a CLASS statement is available. Assuming a genotype effect with 3 levels (AA, AB and BB) and a treatment effect with 2 levels (0, 1) (and it seems p-value for the genetic effect across treatment arm will be different if the 2 treatment levels are codes as 1 and 2), a Cox proportional hazards model of survival time as a function of genotype, treatment and genotype-by-treatment interaction is fitted.
Tuesday, September 20, 2011
Friday, September 09, 2011
conditional exact logistic regression
Conditional Logistic Regression:
Exact Conditional Logistic Regression
Exact conditional inference is based on generating the conditional distribution for the sufficient statistics of the parameters of interest. This distribution is called the permutation or exact conditional distribution. One example from sas suggests that we move from conditional regression to exact conditional regression when "you believe the data set is too small or too sparse for the usual asymptotics to hold.":
Taking the stratification into account by "conditioning out" (and not estimating) the stratum-specific intercepts gives consistent and asymptotically normal MLEs for the slope coefficients. If your nuisance parameters are not just stratum-specific intercepts, you can perform an exact conditional logistic regression.The likelihood function to be maximized feels really like that in proc phreg with a discrete option. Each item (for a stratum) in the likelihood function is the conditional probability of observing subject 1 to h who actually had an event conditioning on h event happened (for any h subjects) in the stratum. The concept of sufficient statistic is not explicitly. Conditional logistic regression is called with the strata statement. More examples are from here and here.
Exact Conditional Logistic Regression
Exact conditional inference is based on generating the conditional distribution for the sufficient statistics of the parameters of interest. This distribution is called the permutation or exact conditional distribution. One example from sas suggests that we move from conditional regression to exact conditional regression when "you believe the data set is too small or too sparse for the usual asymptotics to hold.":
- by "sparse", they probably refer to the problem of complete separation of quasi-complete separation, and the exact method is one way to handle this, especially when the firth option is only available in sas 9.1.3;
- by "usual asymptotics", they probably refer to the issue of large-sample asymptotic normality when we do not have a large sample size compared to the number of parameters to estimate.
Thursday, August 25, 2011
Friday, August 19, 2011
use %scan function to cycle through a list of items
There are many examples on line. But I have difficulty to break from the loop following them in one SAS system.
Monday, August 15, 2011
Tuesday, August 02, 2011
Geometric Statistics, geometric CV, intra-subject variation
found from onbiostatistics and here. Additional discussion can be found here.
also for a log normal variable x, there are
also for a log normal variable x, there are
y=log(x)~N(mu, sigma**2) E(x)=exp(mu+sigma**2/2) and sd(x)=exp(mu+sigma**2/2)*sqrt(exp(sigma**2)-1)
cv(x)=sqrt(exp(sigma**2)-1)
Tuesday, July 05, 2011
left join A and B may still lose obs in A
Imaging data set A has the demog information, and data set B has multiple outcome variable and covariate information. B has less subjects than A. To create data set C which is merged from A and B and contains all subjects in A (even the subject is not in B), the following sql statement would be fine
proc sql,However, since B has multiple outcome variables, there can be an intent to filter out one outcome variable like the following
create table c (drop=subjid) as;
select a.*, b.* from a left join b (rename=(pat_id=subjid))
on a.pat_id=b.subjid;
quit;
proc sql,This is wrong, because the field of outcomeName will be missing (and filtered out) for subjects in A but not in B. The solution can be either to filter in the separate data step, or the following sql statement
create table c (drop=subjid) as;
select a.*, b.* from a left join b (rename=(pat_id=subjid))
on a.pat_id=b.subjid
where b.outcomeName='outcome A';
quit;
proc sql,
create table c (drop=subjid) as;
select a.*, b.* from a left join b (rename=(pat_id=subjid) where=(outcomeName='outcome A'))
on a.pat_id=b.subjid;
quit;
Sunday, June 12, 2011
create a design matrix
Usage Note 23217: If I know the model, how can I create the corresponding design matrix (dummy, indicator, or design variables) in a data set?
Friday, May 20, 2011
Thursday, May 19, 2011
Modify the default ODS graphics behavior from statistical procedures
Those ODS graphics in the statistical procedures are predefined template using Graphics template language(GTL). Using ods trace on to find the name of that graph template. Then you have two ways to modify them.
split data
1000 size sample, variable: ID_no from 1-1000
how to split it to two new data sets: one with even numbers for ID_no, the othe one with odds number for ID_no?
data set1 set2;
set yourdata;
if mod(id, 2) = 0 then output set1;
else output set2;
run;
estimate statements
In your LSMEANS statement, add an option e, SAS should print out an estimate statement for you. That will be an excellent example.
write a sas dataset to a flat file
filename myfile "d:\temp\a.txt";
data _null_;
set mydata;
file myfile;
put id 1-3 var2 4-10 var3 11-15 ;
run;
change the default number of digits in P values
proc template;
edit Common.PValue;
format=best16.;
end;
run;
edit Common.PValue;
format=best16.;
end;
run;
download data with URL
FILENAME myurl URL "http://ichart.finance.yahoo.com/table.csv?s=&tic";
DATA &tic;
INFILE myurl FIRSTOBS=2 missover dsd;
format date yymmdd10.;
INPUT Date: yymmdd10. Open High Low Close Volume Adj_Close
;
if date>=today()-180;
RUN;
randomly select 300 samples
Proc Surveyselect
data=old data method=sys/srs/etc. sampsize=300 out=new data;
run;
data=old data method=sys/srs/etc. sampsize=300 out=new data;
run;
superscript to ODS RTF
Google is your best friend:
http://www2.sas.com/proceedings/forum2008/033-2008.pdf
ods rtf file="C:\test.rtf";
ods escapechar= '\';
proc tabulate data=sashelp.class style={cellwidth=100};
class sex;
var age;
table sex,age;
label age= "age\{super 1}";
run;
ods rtf close;
select statement example
SELECT (payclass);
WHEN ('monthly') amt=salary;
WHEN ('hourly') DO;
amt=hrlywage*min(hrs,40);
IF hrs>40 THEN PUT 'Check Timecard';
END; /* end of do */
OTHERWISE PUT 'Problem Observation';
END;
WHEN ('monthly') amt=salary;
WHEN ('hourly') DO;
amt=hrlywage*min(hrs,40);
IF hrs>40 THEN PUT 'Check Timecard';
END; /* end of do */
OTHERWISE PUT 'Problem Observation';
END;
Monday, May 16, 2011
concatenate a selected list of datasets
libname test 'C:\test2';
proc sql;
select 'test.'||memname into : dlist separated BY ' ' from dictionary.tables
where libname='TEST' and memname contains 'selection_rule';
quit;
data one;
set &dlist;
run;
proc sql;
select 'test.'||memname into : dlist separated BY ' ' from dictionary.tables
where libname='TEST' and memname contains 'selection_rule';
quit;
data one;
set &dlist;
run;
Saturday, April 23, 2011
cross product
in R, cross product is defined similar to inner product, but is different in other areas.
Thursday, April 21, 2011
built-in multiple testing mechanism in SAS
proc multtest by Peter H. Westfall and Russell D. Wolfinger
the adjust option in proc mixed, or proc glm
exact statement from PROC NPAR1WAY
the adjust option in proc mixed, or proc glm
exact statement from PROC NPAR1WAY
Saturday, March 19, 2011
Monday, March 14, 2011
SAS SQL union and intersection
http://support.sas.com/documentation/cdl/en/sqlproc/62086/HTML/default/a001361224.htm
Monday, February 28, 2011
“ANYDATE” INFORMATS
http://www.lexjansen.com/wuss/2006/SAS_essentials/ESS-Carroll.pdf
“ANYDATE” INFORMATS
Sometimes you are very lucky in that the raw data you receive contains dates that are the same format. Sometimes you will encounter a messy data file where the dates are all different types of formats. The “anydate” informats are designed to allow you to read in a variety of date forms including:
• DATE, DATETIME, and TIME
• DDMMYY, MMDDYY, and YYMMDD
• JULIAN, MONYY, and YYQ
Using the anydate informats can be particularly useful when you are reading in data that contains a mixture of date forms and you want certain parts of the dates you are reading in. Anydate informats include:
• ANYDTDTE. Extracts the date portion
• ANYDTDTM. Extracts the datetime portion
• ANYDTTME. Extracts the time portion
“ANYDATE” INFORMATS
Sometimes you are very lucky in that the raw data you receive contains dates that are the same format. Sometimes you will encounter a messy data file where the dates are all different types of formats. The “anydate” informats are designed to allow you to read in a variety of date forms including:
• DATE, DATETIME, and TIME
• DDMMYY, MMDDYY, and YYMMDD
• JULIAN, MONYY, and YYQ
Using the anydate informats can be particularly useful when you are reading in data that contains a mixture of date forms and you want certain parts of the dates you are reading in. Anydate informats include:
• ANYDTDTE. Extracts the date portion
• ANYDTDTM. Extracts the datetime portion
• ANYDTTME. Extracts the time portion
Personalized Medicine: A Discussion
http://blogs.forbes.com/matthewherper/2011/02/25/personalized-medicine-a-discussion/
Friday, February 25, 2011
Thursday, February 10, 2011
R code to show UU' != I
U matrix as in SVD X=UDV' is only column orthogonal but not necessarily row orthogonal. U matrix, a N by p matrix, does not contain all the eigenvectors of XX', but only the first p vectors. X is also a N by p matrix, and XX' may have as many as N eigenvectors.
Monday, December 06, 2010
load delimited file with :
From here
If your data is longer than the default length, you need to use informats. For example, date or time values, names, and addresses can be longer than eight characters. In such cases, you either need to add an INFORMAT statement to the DATA step or add informats directly in the INPUT statement. However, when the informats are used in the INPUT statement, care must be taken to honor the function of the delimiter to prevent read errors. If you add informats in an INPUT statement, you must add a colon (:) in front of the informat, as shown in this example:
If your data is longer than the default length, you need to use informats. For example, date or time values, names, and addresses can be longer than eight characters. In such cases, you either need to add an INFORMAT statement to the DATA step or add informats directly in the INPUT statement. However, when the informats are used in the INPUT statement, care must be taken to honor the function of the delimiter to prevent read errors. If you add informats in an INPUT statement, you must add a colon (:) in front of the informat, as shown in this example:
Tuesday, November 23, 2010
Gelman on Bayesian adaptive methods for clinical trials
http://www.stat.columbia.edu/~cook/movabletype/archives/2010/11/bayesian_adapti.html
Wednesday, November 17, 2010
proc means example
proc means data=one nway n mean std median min max noprint;
by treatment;
var variable;
output out=variable n=N mean=mean std=std median=median min=min max=max;
run;
by treatment;
var variable;
output out=variable n=N mean=mean std=std median=median min=min max=max;
run;
proc transpose data=variable out=t&variable;
var n mean std median min max;
run;
Monday, November 15, 2010
Monday, November 08, 2010
Monday, October 11, 2010
run regression in R
Gelman: I really hate to think that there are people out there running regressions in R and not using display() and coefplot() to look at the output.
Wednesday, October 06, 2010
Metric MDS starting from eigen()
this is an exercise to figure the details of MDS, or more specifically, what the coordinates are that are used in plotting. More explanations can be found here.
Tuesday, October 05, 2010
average heterzygosity
from Ascertainment bias in studies of human genome-wide polymorphism
A simple comparison of the HapMap and Perlegen genotype data was done by considering the 5682 windows of 500 kb across the entire genome and, for each window, tallying the SFS and calculating summary statistics such as average heterozygosity for each population and FST for each population pair and for the trio of samples.
The average uncorrected heterozygosity within the three population groups for the HapMap data were 0.281, 0.247, and 0.268 for the Yoruban, Chinese, and European samples. The corresponding figures for the uncorrected Perlegen data are 0.251, 0.211, and 0.229 for the African American, Chinese, and European samples.
histograms are like this.
A simple comparison of the HapMap and Perlegen genotype data was done by considering the 5682 windows of 500 kb across the entire genome and, for each window, tallying the SFS and calculating summary statistics such as average heterozygosity for each population and FST for each population pair and for the trio of samples.
The average uncorrected heterozygosity within the three population groups for the HapMap data were 0.281, 0.247, and 0.268 for the Yoruban, Chinese, and European samples. The corresponding figures for the uncorrected Perlegen data are 0.251, 0.211, and 0.229 for the African American, Chinese, and European samples.
histograms are like this.
Monday, October 04, 2010
2D plotting in SAS
This example shows a regression plot with prediction and confidence limits.
proc sgplot data=sashelp.class; reg x=height y=weight / CLM CLI; run;
Tuesday, September 28, 2010
Wednesday, September 22, 2010
Critical Chain Project Management
In CCPM two durations are estimated for each: an aggressive duration based on how long the task would take given full focus on the task and no problems, and a “safe” duration given full focus and typical variation with each task. The differences between aggressive and “safe” durations for each critical task contribute to a pooled “project buffer” which is adjusted for the project as a whole. The end of the project buffer is the team’s “commit date” and the buffer protects the project from uncertainty
Managers and leadership need to provide clear project and task priorities and a work environment that enables single-task focus, so that each task can be completed quickly and with high quality.
Managers and leadership need to provide clear project and task priorities and a work environment that enables single-task focus, so that each task can be completed quickly and with high quality.
Tuesday, September 21, 2010
meta analysis
Jadad scale to measure methological quality of a clinical trial
tool: CMA
publication standards: quorum (eg) and moose (eg)
proc mixed can be used in meta analysis.
tool: CMA
publication standards: quorum (eg) and moose (eg)
proc mixed can be used in meta analysis.
Wednesday, September 15, 2010
histogram alternatives
Beanplot
hist() + rug(): add one dimensional scatter plot below the histograms
for discrete data: barplot(table(a)), where a is a discrete vector. Or barplot (tabulate (a))
ecdf (Empirical CDF) summarizes the data into something like a smooth CDF line while graphing all the data points.
dhist in ggplot2
more discussion from Gelman here and also here
hist() + rug(): add one dimensional scatter plot below the histograms
for discrete data: barplot(table(a)), where a is a discrete vector. Or barplot (tabulate (a))
ecdf (Empirical CDF) summarizes the data into something like a smooth CDF line while graphing all the data points.
dhist in ggplot2
more discussion from Gelman here and also here
Wednesday, August 25, 2010
main effect of a continous variable
In both proc mixed and glimmix (see the code example below), the "Solution for Fixed Effects" generated by option /SOLUTION for the continous variable 'binary' does not estimate the main/marginal effect when the value of binary changes from 0 to 1. It is because of the interaction term between binary and visit. To find the main/marginal effect, we can code the variable 'binary' as a class/categorical variable and find this LSMEANS.
Wednesday, July 28, 2010
Wednesday, June 23, 2010
Friday, June 18, 2010
Thursday, May 20, 2010
Thursday, May 13, 2010
matching program
Case control matching, probably implementing the idea of propensity scores
R
matchIT
Matching
optmatch optmatch presentation1; optmatch presentation2;
SAS
gmatch,vmatch,dist
a sugi paper 165-29: Performing a 1:N Case-Control Match on Propensity Score
PaperOn the Estimation and Use of Propensity Scores in Case-Control and Case-Cohort Studies
[using] cases plus controls in a case-control study... should give consistent estimates of the true propensity score under the null hypothesis, but not otherwise.
R
matchIT
Matching
optmatch optmatch presentation1; optmatch presentation2;
SAS
gmatch,vmatch,dist
a sugi paper 165-29: Performing a 1:N Case-Control Match on Propensity Score
PaperOn the Estimation and Use of Propensity Scores in Case-Control and Case-Cohort Studies
[using] cases plus controls in a case-control study... should give consistent estimates of the true propensity score under the null hypothesis, but not otherwise.
Tuesday, May 11, 2010
good clinical trial simulation practices
http://cdds.ucsf.edu/research/sddgpreport.php#_Toc457223476
Monday, May 10, 2010
A Draft Sequence of the Neandertal Genome.
http://www.researchblogging.org/post/gotourl/id/213933
http://www.researchblogging.org/post/gotourl/id/213689
http://www.researchblogging.org/post/gotourl/id/213509
http://www.researchblogging.org/post/gotourl/id/213689
http://www.researchblogging.org/post/gotourl/id/213509
Viagra could help women too
http://www.helenjaques.co.uk/blog/2010/viagra-breast-cancer-drug-delivery/?utm_source=feedburner&utm_medium=feed&utm_campaign=Feed%3A+Insicknessandinhealth+%28In+sickness+and+in+health%29
Researchers in California have shown that sildenafil (Viagra) and a similar drug called vardenafil can improve the delivery of the chemotherapeutic drug Herceptin (trastuzumab) in women with breast cancer that has spread to the brain.
Researchers in California have shown that sildenafil (Viagra) and a similar drug called vardenafil can improve the delivery of the chemotherapeutic drug Herceptin (trastuzumab) in women with breast cancer that has spread to the brain.
PubMed vs. Google Scholar
http://www.researchblogging.org/post/gotourl/id/214477
http://neurodojo.blogspot.com/2010/05/pubmed-vs-google-scholar.html?utm_source=feedburner&utm_medium=feed&utm_campaign=Feed%3A+Neurodojo+%28NeuroDojo%29
http://neurodojo.blogspot.com/2010/05/pubmed-vs-google-scholar.html?utm_source=feedburner&utm_medium=feed&utm_campaign=Feed%3A+Neurodojo+%28NeuroDojo%29
Antidepressants Not Effective for Some Types of Depression
http://www.researchblogging.org/post/gotourl/id/214478
http://brainblogger.com/2010/05/10/antidepressants-not-effective-for-some-types-of-depression/
http://brainblogger.com/2010/05/10/antidepressants-not-effective-for-some-types-of-depression/
Friday, April 30, 2010
Biomarkers and surrogate endpoints
A good article discussing surrogate endpoints and the gap between scientific theory and clinical benefit.
Friday, April 09, 2010
Kendall/Spearman rank correlation
copied from here.
Spearman's rho comes from Charles Spearman's background of psychology, IQ testing and eugenics - it's an easy to calculate robust measure of whether there is an association.
Spearman's rho comes from Charles Spearman's background of psychology, IQ testing and eugenics - it's an easy to calculate robust measure of whether there is an association.
Subscribe to:
Posts (Atom)