Tuesday, November 25, 2014

glimmix models with permutation

Assume a factor coded under a genotypic model as the main effect without interaction
Estimate ‘dominant’  geno  -1 0.5 0.5  ‘recessive’ geno  -0.5 -0.5 1  ‘additive’   geno   0  0.3333333 0.6666667/adj=simulate;

The function may be available in other SAS Proc

Thursday, November 06, 2014

genome build

from here

What is GRCh37?

Friday, October 31, 2014

R Onto-Tools suite iPathwayGuide

From iPathwayGuide, our pathway analysis approach is based on novel Impact Analysis method. This paper, published in Genome Research, explains the underlying analytic method and also demonstrates how Impact Analysis avoids costly false positive results as generated by other methods such as Over-Representation Analysis. The tool is also available from ROntoTools

Tuesday, October 28, 2014

advanced r wiki

From hadley, who is like a god in R.
The discussion of environment is really helpful. I am aware of other excellent resources, but still did not get it from there

Tuesday, October 14, 2014

graphic output for non-pdf file under unix without X11

bitmap('convergence%03d.png', height=11, width=8, res=600);
plot(post[, selectNode2Plot])
dev.off();
Created by Pretty R at inside-R.org

Wednesday, October 01, 2014

sas datasets IO with R

R code lines below are run without SAS installed.

Wednesday, September 17, 2014

consistency property of lasso

This is about the meaning of | as in the "irrepresentable condition"  from On Model Selection Consistency of Lasso
It seems the operation is |x| = abs(sum(x)). I have never seen such a norm. But their Section "3.1 Simulation Example 1" makes me think so: 
... in two settings: (a) β1 = 2, β2 = 3 ; and (b) β1 =  2, β2 = 3. In both settings, X(1) = (X1;X2), X(2) = X3 and through (2), it is easy to get C21inv(C11)= (2/3,2/3). Therefore Strong Irrepresentable Condition fails for setting (a) and holds for setting (b).
Their Proof for Corollary 1 makes me think so:
There is a similar use here.

Sunday, August 17, 2014

oracle

Fan and Li (2001), the SCAD estimator, with appropriate choice of the regularization (tuning) parameter, possesses a sparsity property, i.e., it estimates zero components of the true parameter vector exactly as zero with probability approaching one as sample size increases while still being consistent for the non-zero components...In other words, with appropriate choice of the regularization parameter, the asymptotic distribution of the SCAD estimator based on the overall model and that of the SCAD estimator derived from the most parsimonious correct model coincide. Fan and Li (2001) have dubbed this property the “oracle property”....It is well-known for Hodges’ estimator that the maximal (scaled) mean squared error grows without bound as sample size increases (e.g., Lehmann and Casella (1998), p.442), whereas the standard maximum likelihood estimator has constant finite quadratic risk. In this note we show that a similar unbounded risk result is in fact true for any estimator possessing the sparsity property. This means that there is a substantial price to be paid for sparsity even though the oracle property (misleadingly) seems to suggest otherwise. 
In "Modern statistical estimation via oracle inequalities":
Theorem 4.1. The James–Stein estimate obeys .

In other words, the James–Stein estimator is almost as good as the ideal estimator in a mean-squared error sense.The inequality (4.2) is an oracle inequality. An oracle inequality relates the performance of a real estimator with that of an ideal estimator which relies on perfect information supplied by an oracle, and which is not available in practice.

Tuesday, June 10, 2014

mean, median and mode

this page explains a mean minimizes the $\ell^2$ norm of the residual:$\min_{m_2} \sum_i (m_2-d_i)^2$ 
a median minimizes its $\ell^1$ norm and a mode minimizes the zero norm of the residual, namely $\ell^0=\vert m_0-d_i\vert^0$.See the wikipedia page about median.

from here, it was further explained that
Inder Jeet Taneja’s book draft has a nice survey of the results: if you fix the upper and lower boundary, and maximize entropy, you’ll get the uniform distribution. If you fix the mean and the expected L2 norm (d^2) between the mean and the distribution, maximizing the entropy you’ll get the Gaussian. If you fix the expected L1 norm (|d|) between the mean and the distribution, maximizing the entropy you’ll get the Laplace (also referred to as Double Exponential). Moreover, log(1+d^2) norm will yield the Cauchy distribution – a special case of the standard heavy-tailed Student distribution.

Thursday, June 05, 2014

check points when reviewing a genetic screening report


  1. title and footnote, ensuring it describes the analysis population, the outcome variable and the class of genetic markers; 
  2. eyeball examples:
    • 1 example of x chr snp
    • 1 example of autosomal snp with only 2 genotypes
    • 1 example of top association 
    • 1 example of a random association
  3. use the excel output to check the value ranges for each column, pay attention to
    • extreme values
    • empty cells
    • characters indicating missing: -,NA, 0
  4. Cosmetic issues
    • decimal places
    • check line ends for character cut off