An old technical document from sas captures the relationship among Type I, II and III sum of squares.
Sunday, August 28, 2022
Thursday, January 08, 2015
qq plot interpretation
SAS doc gives the references and a few simple rules, and R community gives more examples around these rules. I think a better way to remember these associations is to understand the mechanism. In this way, wikipedia is doing a better job:
Besides, the qqPlot function from r package 'car' includes CIs for observed quantiles by default, or more specialized qqplot for log10(p-values) from pQQ function out of Haplin
A simple case is where one has two data sets of the same size. In that case, to make the Q–Q plot, one orders each set in increasing order, then pairs off and plots the corresponding valuesSo it is almost a scatter plot of observed value vs simulated value from a distribution. Still it is not always easy to interpret, especially from a small dataset
Besides, the qqPlot function from r package 'car' includes CIs for observed quantiles by default, or more specialized qqplot for log10(p-values) from pQQ function out of Haplin
Tuesday, November 25, 2014
glimmix models with permutation
Assume a factor coded under a genotypic model as the main effect without interaction
Estimate ‘dominant’ geno -1 0.5 0.5 ‘recessive’ geno -0.5 -0.5 1 ‘additive’ geno 0 0.3333333
0.6666667/adj=simulate;
The function may be available in other SAS Proc
Thursday, November 06, 2014
Friday, October 31, 2014
R Onto-Tools suite iPathwayGuide
From iPathwayGuide, our pathway analysis approach is based on novel Impact Analysis method. This paper, published in Genome Research, explains the underlying analytic method and also demonstrates how Impact Analysis avoids costly false positive results as generated by other methods such as Over-Representation Analysis. The tool is also available from ROntoTools
Tuesday, October 28, 2014
advanced r wiki
From hadley, who is like a god in R.
The discussion of environment is really helpful. I am aware of other excellent resources, but still did not get it from there.
The discussion of environment is really helpful. I am aware of other excellent resources, but still did not get it from there.
Tuesday, October 14, 2014
Wednesday, October 01, 2014
Wednesday, September 17, 2014
consistency property of lasso
This is about the meaning of | as in the "irrepresentable condition" from On Model Selection Consistency of Lasso
It seems the operation is |x| = abs(sum(x)). I have never seen such a norm. But their Section "3.1 Simulation Example 1" makes me think so:
... in two settings: (a) β1 = 2, β2 = 3 ; and (b) β1 = 2, β2 = 3. In both settings, X(1) = (X1;X2), X(2) = X3 and through (2), it is easy to get C21inv(C11)= (2/3,2/3). Therefore Strong Irrepresentable Condition fails for setting (a) and holds for setting (b).Their Proof for Corollary 1 makes me think so:
There is a similar use here.
Sunday, August 17, 2014
oracle
Fan and Li (2001), the SCAD estimator, with appropriate choice of the regularization (tuning) parameter, possesses a sparsity property, i.e., it estimates zero components of the true parameter vector exactly as zero with probability approaching one as sample size increases while still being consistent for the non-zero components...In other words, with appropriate choice of the regularization parameter, the asymptotic distribution of the SCAD estimator based on the overall model and that of the SCAD estimator derived from the most parsimonious correct model coincide. Fan and Li (2001) have dubbed this property the “oracle property”....It is well-known for Hodges’ estimator that the maximal (scaled) mean squared error grows without bound as sample size increases (e.g., Lehmann and Casella (1998), p.442), whereas the standard maximum likelihood estimator has constant finite quadratic risk. In this note we show that a similar unbounded risk result is in fact true for any estimator possessing the sparsity property. This means that there is a substantial price to be paid for sparsity even though the oracle property (misleadingly) seems to suggest otherwise.In "Modern statistical estimation via oracle inequalities":
Theorem 4.1. The James–Stein estimate obeys .
In other words, the James–Stein estimator is almost as good as the ideal estimator in a mean-squared error sense.The inequality (4.2) is an oracle inequality. An oracle inequality relates the performance of a real estimator with that of an ideal estimator which relies on perfect information supplied by an oracle, and which is not available in practice.
Tuesday, June 10, 2014
mean, median and mode
this page explains a mean minimizes the
norm of the residual:
a median minimizes its
norm and a mode minimizes the zero norm of the residual, namely
.See the wikipedia page about median.
from here, it was further explained that
a median minimizes its
from here, it was further explained that
Inder Jeet Taneja’s book draft has a nice survey of the results: if you fix the upper and lower boundary, and maximize entropy, you’ll get the uniform distribution. If you fix the mean and the expected L2 norm (d^2) between the mean and the distribution, maximizing the entropy you’ll get the Gaussian. If you fix the expected L1 norm (|d|) between the mean and the distribution, maximizing the entropy you’ll get the Laplace (also referred to as Double Exponential). Moreover, log(1+d^2) norm will yield the Cauchy distribution – a special case of the standard heavy-tailed Student distribution.
Thursday, June 05, 2014
check points when reviewing a genetic screening report
- title and footnote, ensuring it describes the analysis population, the outcome variable and the class of genetic markers;
- eyeball examples:
- 1 example of x chr snp
- 1 example of autosomal snp with only 2 genotypes
- 1 example of top association
- 1 example of a random association
- use the excel output to check the value ranges for each column, pay attention to
- extreme values
- empty cells
- characters indicating missing: -,NA, 0
- Cosmetic issues
- decimal places
- check line ends for character cut off
Saturday, December 21, 2013
interaction coding
In sas: we have
Nested Effects
Nested effects are generated in the same manner as crossed effects. Hence, the design columns generated by the following statements are the same (but the ordering of the columns is different):
model y=a b(a); (B nested within A) model y=a a*b; (omitted main effect for B)
Thursday, December 19, 2013
Tuesday, October 29, 2013
chain information impacts summary statistics from coda
It seems chain information is not critical in mean, sd and quantiles; and not critical even for time series SE when convergence.
Created by Pretty R at inside-R.org
library(coda)
x1 <- cbind(rnorm(100, 1, 1), rnorm(100, 1,1), rnorm(100,1 ,1 ),rnorm(100,1,1), cumsum(rnorm(100,1,1))) x2 <- cbind(rnorm(100, 1,10), rnorm(100, 10,1), rnorm(100,10,10),rnorm(100,1,1), cumsum(rnorm(100,1,1))) colnames(x1) <- colnames(x2) <- c('unequalSD','unequalMean','unequalboth','equal','autocorr') o1 <- mcmc.list(mcmc(x1), mcmc(x2)) #maintain chain information. to add more chain: o1[[3]] <- mcmc(x3); summary(o1) summary(mcmc(rbind(x1,x2))) #no chain information
Friday, September 06, 2013
Quickly insert all sheet names in cells with VBA
from here:
Step 1: Hold down the Alt + F11 keys in Excel, and it opens the Microsoft Visual Basic for Applications window.
Step 2: Click Insert > Module, and paste the following macro in the Module Window.
VBA for inserting all worksheets' names in cells:
If you want to inset all sheet names in cells, VBA macro is a good choice.
Step 1: Hold down the Alt + F11 keys in Excel, and it opens the Microsoft Visual Basic for Applications window.
Step 2: Click Insert > Module, and paste the following macro in the Module Window.
VBA for inserting all worksheets' names in cells:
Sub SheetNames()
Columns(1).Insert
For i = 1 To Sheets.Count
Cells(i, 1) = Sheets(i).Name
Next i
End Sub
Columns(1).Insert
For i = 1 To Sheets.Count
Cells(i, 1) = Sheets(i).Name
Next i
End Sub
Friday, June 14, 2013
seriation
Seriation, also the name for an R package
is to arrange all objects in a set in a linear order given available data and some loss or merit function in order to reveal structural information. Together with cluster analysis and variable selection, seriation is an important problem in the field of combinatorial data analysisThis post discussed the application of seriation on plotting binary matrices, together with other clustering algorithms.
Saturday, May 18, 2013
Prioritizing GWAS Results
This review paper mentioned 3 analytic methods to prioritizing GWAS results (for follow-up studies):
- Meta-analysis seeks to pool information from multiple GWAS to increase the chances of finding true positives among the false positives and provides a way to combine associations across GWAS, even when the original data are unavailable.
- Testing for epistasis within a single GWAS study can identify the stronger results that are revealed when genes interact.
- Pathway analysis of GWAS results is used to prioritize genes and pathways within a biological context. Following a GWAS, association results can be assigned to pathways and tested in aggregate with computational tools and pathway databases.
Friday, May 17, 2013
Wednesday, May 15, 2013
The default plot function from glmnet is quite basic and plain. The following function tries to improve the plots
#this function is to achieve the same purpose of plot.glmnet() #with the following enhancement: # 1) show variable number instead of column number # 2) show the value of lambda that gives smallest cross validation error with a solid line # 3) show the largest lambda such that the error is within 1 se of the minimal cross validation error with a dashed line #input variables: a glmnet object from glmnet(), a cv. glmnet object from cv.glmnet(), and a possible plot title
Subscribe to:
Posts (Atom)