%% \VignetteIndexEntry{gap: genetic analysis package} %% \Vignettekeywords{genetic data analysis} %% \VignettePackage{gap} \documentclass[article,shortnames]{jss} \usepackage{thumbpdf,graphicx} \author{Jing Hua Zhao\\MRC Epidemiology Unit} \Plainauthor{Jing Hua Zhao} \title{\pkg{gap}: Genetic Analysis Package} \Plaintitle{gap: Genetic Analysis Package} \Abstract{A preliminary attempt at collecting tools and utilities for genetic data as an \proglang{R} package called \pkg{gap} is described. Genomewide association is then described as a specific example, linking the work of \cite{risch96}, \cite{ll97} for family-based and population-based studies, and the counterpart for case-cohort design established by \cite{cz04}. Analysis of staged design as outlined by \cite{skol06} and associate methods are discussed. The package is flexible, customizable, and should prove useful to researchers especially in its application to genomewide association studies.} \Keywords{genetic data analysis, genomewide association, \proglang{R}} \Plainkeywords{genetic data analysis, genomewide association, R} \Volume{23} \Issue{8} \Month{December} \Year{2007} \Submitdate{2007-01-03} \Acceptdate{2007-06-08} \Address{ Jing Hua Zhao\\ MRC Epidemiology Unit\\ Institute of Metabolic Science\\ Addenbrooke's Hospital\\ Hills Road\\ Cambridge CB2 0QQ\\ United Kingdom\\ E-mail: \email{jinghua.zhao@mrc-epid.cam.ac.uk}\\ } \begin{document} \section{Introduction} Approaches to understanding the genetic basis of human diseases have been widely discussed, e.g., \cite{morton83}, \cite{khoury93}, \cite{thomas04}. Methods include the assessment of familial aggregation for heritability, identification of major gene effect, study of cosegregation of genetic marker with putative disease-predisposing loci in the so-called linkage studies and association studies in search of frequency differences between cases and controls and/or correlation between genotype and phenotype as a quantitative trait. Recently, owing to the availability of large number of genetic variants and particularly single nucleotide polymorphisms (SNPs), attention has focused on association designs including both families and unrelated individuals from general populations. Three initiatives of interest are: \begin{enumerate} \item The hapmap project (\url{http://www.hapmap.org/}), a partnership of scientists and funding agencies from Canada, China, Japan, Nigeria, the United Kingdom and the United States to develop a public resource that will help researchers find genes associated with human disease and response to pharmaceuticals. \item The Wellcome Trust Case Control Consortium (WTCCC, \url{http://www.wtccc.org.uk/}), a collaboration of human geneticists who analyse thousands of DNA samples from patients suffering with different diseases to identify common genetic variations for each condition. It is hoped that by identifying these genetic signposts, researchers will be able to understand which people are most at risk, and also produce more effective treatments. The WTCCC searches for the genetic signposts for tuberculosis, coronary heart disease, type 1 diabetes, type 2 diabetes, rheumatoid arthritis, Crohn's disease, bipolar disorder and hypertension. The research is conducted at a number of institutes throughout the UK, including the Wellcome Trust Sanger Institute, Cambridge University and Oxford University. \item The genetic association information network (GAIN, \url{http://www.fnih.org/GAIN2/home_new.shtml}), a public-private partnership of the Foundation for the National Institutes of Health, Inc., which include corporations, private foundations, advocacy groups, concerned individuals, and the National Institutes of Health. This initiative will take the next step in the search to understand the genetic factors influencing risk for complex human diseases. \end{enumerate} Following successes in localization of Mendelian diseases such as Huntington's disease \citep{guesella83}, cystic fibrosis \citep{tsui85}, some recent successes in non-Mendelian diseases come from breast cancer \citep{hall90}, macular degenration \citep{klein05}, and non-insulin-dependent (Type-2) diabetes \citep{grant06}. Here, we will focus specifically on methodological issues concerning the design and analysis of genomewide association studies as largely foreseen by the seminal paper of \cite{risch96}. As it dealt with the power of association studies using family-based designs, an immediate argument for case-control designs as an alternative to family-based designs was made by \cite{ll97} on the basis of (1). its ease of implementation; (2). the increased prospects for extension from established epidemiological cohorts; (3). for late onset diseases such as Type-2 diabetes and hypertension, the difficulty of typing parents for family-based studies; (4). the issue that, for equivalent power, the number of individuals genotyped needs to be doubled for linkage, tripled for singleton, and quadrupled for sib-pair designs, assuming both parents are genotyped in an affected sibling study. Concerns over cost-efficiency have led to the adoption of staged design in some studies. In the most popular two-staged design, the first stage uses only a proportion of individuals and are genotyped at all SNPs, among which a percentage of SNPs showing statistical significance is carried over to the second stage. For analysis of two-staged design, \cite{skol06} recognized that joint analysis could be more powerful. The high cost of genotyping has motivated researchers to seek generic controls for a range of traits as in a case-cohort design, in which a small random sample of the whole cohort and all the diseased subjects are used. It is possible the subcohort also contains some cases but it is fully representative of the population and can be used in conjunction with a range of case definitions. The merits of such a design in genetic association studies have only recently been recognized \citep{langholz99,manolio06}. Some of the terms used in this paper can be found in the paper by \cite{bgm06}, and more generally in a review paper by \cite{ea06} on recent developments in statistical genetics, and in a tutorial specifically for genetic association analysis by \cite{balding06}. Analysis for population association studies generally involves preliminary analyses such as quality control, Hardy-Weinberg equilibrium tests, examination of linkage disequilibrium and recombination, to be followed by tests of association for single and/or multiple SNPs, both may involve case-control or binary phenotype or continuous outcomes. Further issues involve handling of large data, multiple testing, among others. The implementation of analytical approaches to these problems has so far scattered and not integrated with established software systems. Preliminary reviews have shown that \proglang{R} is a strong alternative which offers many desirable facilities and can be used in conjunction with other software systems \citep{zt06b}. It is possible that in the near future, many tools for genetic data analysis will be available on these software systems. We have recently carried out a study of obesity using Affymetrix (\url{http://www.affymetrix.com/}) 500K and Illumina (\url{http://www.illumina.com/}) 317K systems following an earlier pilot using Perlegen (\url{http://www.perlegen.com/}) platform with about 250K SNPs. The samples were based on European Prospective Investigation of Cancer (EPIC) Norfolk study (\url{http://www.srl.cam.ac.uk/epic/}). Our particular concern is cost-efficiency, not only because the experiment is expensive but we wish to take full advantage of it. We adopted a two-stage case-cohort design which is advantageous in cost-efficiency over the standard case-control design. While seeking for appropriate methods for joint analyses, we have recast the problem within the missing data framework. We believe our approach will be invaluable to other colleagues. Here, I describe a genetic analysis package, \pkg{gap}, created and maintained by the author and is available from the Comprehensive \proglang{R} Archive Network (CRAN, \url{http://CRAN.R-project.org/}), taking advantages of the portable environment of \proglang{R} for data management, analysis, graphics and object-oriented programming. The functions implemented for genetic association analysis include Hardy-Weinberg equilibrium tests, measures of linkage disequilibrium between SNPs or multiallelic markers and haplotype analysis. I will give a brief overview of the package, and a specific example concerning design and analysis of genetic association, which was used in the design of the case-cohort study aforementioned. I will also use a dataset from our case-control association study of Type-2 diabetes with chromosome 20, to illustrate joint analysis within \proglang{R}. It is clear that many statistical issues can be handled with ease and used in conjunction with other packages available from CRAN. The calculation as in some original work is also given for reference, with results not seen in the genetic analysis literature. At the end I will provide a brief summary. \section{Overview of implemented methods} This package was designed in 2003 to integrate some programs in \proglang{C}/\proglang{Fortran} and \proglang{SAS} \citep{SASstat} I have written or used over the years, by taking advantage of the ability of \proglang{R} to use foreign language routines in \proglang{C}/\proglang{Fortran} or its facilities similar to \proglang{SAS}. The recent implementations are more native to \proglang{R}. The collection therefore somewhat reflects the evolution of (or more exactly my appreciation of) statistical genetics (or genetic epidemiology) over years. The description of the main functions is given in Table~\ref{pkggap}, while a list of data sets is given in Table~\ref{dataset}. They can be conveniently obtained with \proglang{R} command \code{library(help = "gap")} once \pkg{gap} is installed. \begin{table} \begin{center} \begin{tabular}{ll} \hline Function name & Description\\ \hline \texttt{BFDP} & Bayesian false-discovery probability \\ \texttt{FPRP} & False-positive report probability\\ \texttt{bt} & Bradley-Terry model for contingency table\\ \texttt{ccsize} & Power and sample size for case-cohort design\\ \texttt{chow.test}& Chow's test for heterogeneity in two regressions\\ \texttt{fbsize} & Sample size for family-based linkage and association design\\ \texttt{gc.em} & Gene counting for haplotype analysis\\ \texttt{gcontrol} & genomic control\\ \texttt{gcp} & Permutation tests using \pkg{GENECOUNTING}\\ \texttt{genecounting} & Gene counting for haplotype analysis\\ \texttt{gif} & Kinship coefficient and genetic index of familiality\\ \texttt{hap} & Haplotype reconstruction\\ \texttt{hap.em} & Gene counting for haplotype analysis\\ \texttt{hap.score}& Score statistics for association of traits with haplotypes\\ \texttt{htr} & Haplotype trend regression\\ \texttt{hwe} & Hardy-Weinberg equilibrium test for a multiallelic marker\\ \texttt{hwe.hardy} & Hardy-Weinberg equilibrium test using MCMC\\ \texttt{kbyl} & Linkage disequilibrium statistics for two multiallelic loci\\ \texttt{kin.morgan} & Kinship matrix for simple pedigree\\ \texttt{makeped} & A function to prepare pedigrees in post-\pkg{MAKEPED} format\\ \texttt{mia} & Multiple imputation analysis for hap\\ \texttt{mtdt} & Transmission/disequilibrium test of a multiallelic marker\\ \texttt{muvar} & Means and variances under 1- and 2- locus (biallelic) QTL model\\ \texttt{pbsize} & Power for population-based association design\\ \texttt{pedtodot} & Converting pedigree(s) to \texttt{dot} file(s)\\ \texttt{pfc} & Probability of familial clustering of disease\\ \texttt{pfc.sim} & Probability of familial clustering of disease\\ \texttt{pgc} & Preparing weight for \pkg{GENECOUNTING}\\ \texttt{plot.hap.score} & Plot haplotype frequencies versus haplotype score statistics\\ \texttt{print.hap.score}& Print a \texttt{hap.score} object\\ \texttt{s2k} & Statistics for 2 by K table\\ \texttt{tbyt} & Linkage disequilibrium statistics for two SNPs\\ \texttt{tscc} & Power calculation for two-stage case-control design\\ \texttt{twinan90} & Classic twin models\\ \texttt{whscore} & Whittemore-Halpern scores for allele-sharing\\ \hline \end{tabular} \caption{\label{pkggap} A list of functions in \pkg{gap}.} \end{center} \end{table} \begin{table} \begin{center} \begin{tabular}{ll} \hline Data sets & Description\\ \hline \texttt{aldh2} & ALDH2 markers and alcoholism\\ \texttt{apoeapoc} & APOE/APOC1 markers and schizophrenia\\ \texttt{cf} & Cystic fibrosis data\\ \texttt{crohn} & Crohn's disease data\\ \texttt{fa} & Friedreich ataxia data\\ \texttt{fsnps} & A case-control data involving four SNPs with missing genotype\\ \texttt{hla} & HLA markers and schizophrenia\\ \texttt{mao} & A study of Parkinson's disease and MAO gene\\ \texttt{nep499} & A study of Alzheimer's disease with eight SNPs and APOE\\ \texttt{snca} & A study of Parkinson's disease and SNCA markers\\ \hline \end{tabular} \caption{\label{dataset} A list of data sets in \pkg{gap}.} \end{center} \end{table} Some of the functions are highlighted as follows: \begin{itemize} \item \texttt{BFDP}: Bayesian false-discovery probability according to \cite{wakefield07}. \item \texttt{FPRP}: False-positive report probability according to \cite{wacholder04}. \item \texttt{gif}: kinship coefficient and genetic index of familiality according to \cite{gholamic94}. \item \texttt{genecounting}, \texttt{gcp}, \texttt{gc.em}, \texttt{pgc}: expectation-maximization (EM) method to infer haplotype and case-control haplotype association similar to \proglang{SAS}/\pkg{Genetics} \citep{sasgenetics} as reported in \cite{zhao02}, \cite{zhao04}. \item \texttt{hap}, \texttt{hap.em}, \texttt{mia}: a multiallelic version of progressive EM algorithm for haplotype inference as reported in \cite{zhao04}. \item \texttt{tbyt}, \texttt{kbyl}: linkage disequilibrium statistics between two biallelic or multiallelic markers, including standard error for $D'$ as reported in \cite{zhao04}. \item \texttt{pfc}, \texttt{pfc.sim}: probability of familial clustering of diseases according to \cite{yz02}. \item \texttt{htr}: haplotype-trend regression as accorded to \cite{zaykin02}. \item \texttt{hap.score}, \texttt{print.hap.score}, \texttt{plot.hap.score}: score statistics for haplotype-trait association with revised algorithms according to \cite{srt02}. \item \texttt{pbsize}, \texttt{fbsize}, \texttt{ccsize}: sample size/power for population-based, family-based case-control designs, or population-based case-cohort design, according to \cite{ll97}, \cite{risch96}, \cite{cz04}. \item \texttt{s2k}: statistics from $2\times K$ table according to \cite{hirotsu01}. \item \texttt{hwe}, \texttt{hwe.hardy}: Hardy-Weinberg equilibrium test for multiallelic markers, the latter implements a Markov chain Monte Carlo (MCMC) algorithm according to \cite{guo92}. \item \texttt{makeped}: a utility to prepare for post-\pkg{MAKEPED} format file as required by \pkg{LINKAGE} \citep{lathrop84} or other computer programs, a version distributed with \pkg{LINKAGE} by Wentian Li. \item \texttt{pedtodot}: a utility to produce \pkg{Graphviz} \citep{graphviz} command file based on a \proglang{GAWK} script by David Duffy. \item \texttt{tscc}: power of two-stage case-control design under several models, similar to \cite{skol06}. \item \texttt{muvar}, \texttt{whscore}: utility functions to obtain mean and variance under two-locus model, and score \citep{whittemore94b}. \end{itemize} A brief summary of a given function, including appropriate reference(s), can be obtained with \code{help(function-name, package = "gap")} within \proglang{R}. There are a number of example datasets. For instance, the cystic fibrosis data \citep{kerem89} has been used in many papers on fine-mapping. \section{Example: Study design and analysis of genetic association} Since the range of problems to be tackled by \pkg{gap} is quite broad, I now give specific examples of how some functions have been developed to facilitate our design and analysis of genetic association. The three popular designs outlined earlier are shown in Figure~\ref{3design}. \begin{figure} \begin{center} \includegraphics{figure1.jpg} \caption{\label{3design} Three common genetic association designs involving unrelated individuals (left), nuclear families with affected singletons (middle) and affected sib-pairs (right). Males and females are denoted by squares and circles with affected individuals filled with black and unaffect individuals being empty.} \end{center} \end{figure} Because of the importance of work by \cite{risch96} and \cite{ll97}, functions \texttt{pbsize} and \texttt{fbsize} have been written to implement their methods and correct a programming error in \cite{risch96}. The table to be obtained could be used as a quick reference and the code be tailored to particular designs. Power for case-cohort design established by \cite{cz04} is given in function \texttt{ccsize}. The method by \cite{skol06} for two-stage design has been implemented in function \texttt{tscc}. \subsection{Some preliminaries} A widely used general model consists of a disease susceptible locus with two alleles (A and a) and population frequencies $p$, $q=1-p$, and penetrances as $f_0$, $f_1$ and $f_2$ (i.e., the conditional probability of getting the disease, the subscripts 0, 1 and 2 are copies of the disease-predisposing allele A). Hardy-Weinberg equilibrium states that the distribution of the three genotypes, aa, Aa, and AA are according to $(p+q)^2$, so that the population prevalence of the disease $K = p^2f^2+2pqf_1+q^2f_0$. Other statistics, which are often used, are additive $V_A = 2pq[p(f_2-f_1)+q(f1-f_0)]^2$ and dominance $V_D = p^2q^2(f_2-2f_1+f_0)^2$, variances, respectively. It is also common to specify further the relationship between the three genotypes. As for multiplicative models to be considered here, the genotypic relative risk (GRR) is $\gamma$ and $\gamma^2$ for genotypes Aa and AA compared to aa as baseline, the population disease prevalence, the additive and dominance variances are therefore $K=p^2\gamma^2+2pq\gamma+q^2=(p\gamma+q)^2$, $V_A=2pq(\gamma-1)^2(p\gamma+q)^2$ and $V_D=p^2q^2(\gamma-1)^4$, respectively. These can be used to define offspring/sibling relative risks, which are the probability ratios of a offspring/sibling of an affected parent/child being affected relative to the general population and they are shown to be $\lambda_O=1+{1/2V_A}/{K^2}$ and $\lambda_S=1+{(V_A/2+V_D/4)}/{K^2}$ so $\lambda_O=1+w$ and $\lambda_S=(1+1/2w)^2$, where $w={(pq(\gamma-1)^2)}/{(p\gamma+q)^2}$. Linkage and association designs using family data were the state-of-the-art designs for localising disease-predisposing loci. A key concept relates to genes shared by relatives from common ancestor and is called identity by descent (IBD), which is associated with a model-free approach of linkage whereby allele-sharing between affected members in a pedigree is compared to test for linkage between a marker locus and a disease locus. The simplest allele-sharing method uses affected sib-pairs. Assuming the recombination fraction between the disease locus and a marker to be $\theta$, and defining $\Psi = \theta^2+(1-\theta)^2$, \cite{suarez78} derived the IBD probabilities for affected sib-pair, $P(IBD=0) = {1}/{4}- [(\Psi-1/2)V_A+(2 \Psi - \Psi^2 - 3/4)V_D] / [4(K^2 + V_A/2 + V_D/4)]$, $P(IBD=1) = {1}/{2}- [2(\Psi^2-\Psi+1/4)V_D]/ [4(K^2 + V_A/2 + V_D/4)]$, $P(IBD=2) = {1}/{4}+ [(\Psi-1/2)V_A+(\Psi^2 - 1/4)V_D] / [4(K^2 + V_A/2 + V_D/4)]$, respectively. Under no linkage, the probabilities of affected sib pair sharing 0, 1, and 2 alleles IBD are 1/4, 1/2, and 1/4. This result can be used to form a nonparametric test for linkage. The corresponding expressions under multiplicative model can also be obtained, where probabilities of siblings sharing none or one allele by descent is $z_0=1/(4\lambda_S)$ and $z_1={\lambda_O}/(2\lambda_S)$, the nonshared probability is $Y=1-z_1/2-z_0=1-{(\lambda_O+1)}/(4\lambda_S)={(1+w)}/{(2+w)}$. When a candidate or dense collection of markers is available, one can use nuclear families in which affected offsprings and their parents are genotyped, the transmitted versus nontransmitted alleles from parents to offsprings are compared in the so-called transmission disequilibrium test (TDT), which can be based on nuclear families with affected singletons only. In the following standard results related to normal distribution will be used. For a set of $N$ independent identically distributed random variables $B_i$ with mean and variance being 0 and 1 under the null hypothesis, $\mu$ and $\sigma^2$ under the alternative hypothesis, the statistic $\sum_{i=1}^N B_i/N$ has mean 0 and variance 1 under the null but mean $\sqrt{N}\mu$ and variance $\sigma^2$ under the alternative. The sample size ($N$) for a given significance level $\alpha$ and power $1-\beta$ can be estimated by $(Z_\alpha-\sigma Z_{1-\beta})^2/\mu^2$. In the actual calculation below, the type I error rate ($\alpha$) and type II error rate ($\beta$) are $5\times 10^{-8}$ and 0.2, respectively. \subsection{Power/sample size for family and case-control/case-cohort designs} We recall some results on affected sib-pair linkage analysis, and define the alleles shared and nonshared from the $i$th parent as a random variable $B_i, i=1,\ldots,N$, scoring $1$ and $-1$. The mean ($\mu$) and variance ($\sigma^2$) of $B_i$ are 0 and 1 under the null hypothesis as the shared and nonshared each has probability 0.5, and $2Y-1$ and $4Y(1-Y)$ under the alternative. Assuming sharing of alleles from both parent to be independent, the required sample size ($N$) for affected sib-pair under $\theta=0$ and no linkage disequilibrium is ${(Z_\alpha-\sigma Z_{1-\beta})^2}/{2\mu^2}$, where $Y$ and $w$ are defined as before. As with TDT, we assume that the disease locus and a nearby locus are in complete disequilibrium, the number of transmissions of allele A are scored from heterozygous parents, where the probability ($h$) of a parent of an affected child being heterozygous is given by ${pq(\gamma+1)}/{(p\gamma+q)}$ and ${pq(\gamma+1)^2}/{[2(\gamma p+q)^2+pq(\gamma-1)^2]}$ for singletons and sib-pairs, respectively. For singletons, a random variable $B_i$ is defined taking values $1/\sqrt{h}$ if parent is heterozygous and transmits A, $0$ if parent is homozygous, $-1/\sqrt{h}$ if parent is heterozygous and transmits a. The mean and variance of $B_i$ are 0 and 1 under the null hypothesis, $\sqrt{h}{(\gamma-1)}/{(\gamma+1)}$ and $1-h[{(\gamma-1)}/{(\gamma+1)}]^2$ under the alternative. When sib-pairs instead of singletons are used in TDT analysis, the same formula for sample size calculation can be applied and the required number of families is half the expected number since there are two independent affected sibs. The results of \code{example("fbsize", package = "gap")} are shown Table~\ref{asptdt}. Column $N_L$ contains the correct calculation corresponding to the original paper. The Alzheimer's disease model is based on \cite{sph97}. \begin{table}[h] \begin{center} \begin{tabular}{ccccccccccc} \hline &&\multicolumn{2}{c}{Linkage}&&\multicolumn{4}{c}{Association} \\ \cline{3-4} \cline{6-8} $\gamma$& $p$ &$Y$ & $N_{L}$ & $P_A$& $H_1$ & $N_{\mathit{tdt}}$ & $H_2$ &$N_{\mathit{asp}/\mathit{tdt}}$&$\lambda_o$ & $\lambda_s$ \\ \hline \\ 4.00 & 0.01 & 0.520 & 6400 & 0.800 & 0.048 & 1098 & 0.112 & 235 & 1.08 & 1.09\\ & 0.10 & 0.597 & 277 & 0.800 & 0.346 & 151 & 0.537 & 48 & 1.48 & 1.54\\ & 0.50 & 0.576 & 445 & 0.800 & 0.500 & 104 & 0.424 & 62 & 1.36 & 1.39\\ & 0.80 & 0.529 & 3023 & 0.800 & 0.235 & 223 & 0.163 & 162 & 1.12 & 1.13\\ \\ 2.00 & 0.01 & 0.502 & 445839 & 0.667 & 0.029 & 5824 & 0.043 & 1970 & 1.01 & 1.01\\ & 0.10 & 0.518 & 8085 & 0.667 & 0.245 & 696 & 0.323 & 265 & 1.07 & 1.08\\ & 0.50 & 0.526 & 3752 & 0.667 & 0.500 & 340 & 0.474 & 180 & 1.11 & 1.11\\ & 0.80 & 0.512 & 17904 & 0.667 & 0.267 & 640 & 0.217 & 394 & 1.05 & 1.05\\ \\ 1.50 & 0.01 & 0.501 & 6942837 & 0.600 & 0.025 & 19321 & 0.031 & 7777 & 1.00 & 1.00\\ & 0.10 & 0.505 & 101898 & 0.600 & 0.214 & 2219 & 0.253 & 941 & 1.02 & 1.02\\ & 0.50 & 0.510 & 27041 & 0.600 & 0.500 & 950 & 0.490 & 485 & 1.04 & 1.04\\ & 0.80 & 0.505 & 101898 & 0.600 & 0.286 & 1663 & 0.253 & 941 & 1.02 & 1.02\\ \\ \multicolumn{3}{c}{Alzheimer's:}\\ \\ 4.50 & 0.15 & 0.626 & 163 & 0.818 & 0.460 & 100 & 0.621 & 37 & 1.67 & 1.78\\ \\ \hline \end{tabular} \caption{\label{asptdt} Comparison of linkage and association in nuclear families required for identification of disease gene: $\gamma$=genotypic risk ratio; $p$=frequency of disease allele A; $Y$=probability of allele sharing; $N_{L}$=number of ASP families required for linkage; $P_A$=probability of transmitting disease allele A; $H_1$, $H_2$=proportions of heterozygous parents; $N_{\mathit{tdt}}$=number of family trios; $N_{\mathit{asp}/\mathit{tdt}}$=number of ASP. families} \end{center} \end{table} It turns out that with $\gamma\le 2$, the expected marker-sharing only marginally exceeds 50\% for any allele frequency ($p$). The use of linkage would need large sample size, but direct tests of association with a disease locus itself can still be quite strong although it may involve large amount of statistical testing of associated alleles. Clearly, it is most favorable for diseases that are relatively common, which has important implications for complex traits. Now we consider the case-control design with a statistic directly testing association between marker and disease. Following \cite{ll97} and supposing we have a randomly ascertained population sample, the frequencies of the three disease genotypes AA, Aa and aa in cases are $\pi\gamma^2 p^2$, $2\pi\gamma pq$, and $\pi q^2$, respectively, where $\pi$ is the ``baseline'' probability that an individual with $aa$ genotype being affected. Similarly, the three frequencies in controls are $(1-\pi\gamma^2)p^2$, $2(1-\pi\gamma)pq$ and $(1-\pi)q^2$. A unit $\chi^2$ statistic can be constructed using Table~\ref{exprop} as $X^2=\sum (O-E)^2/E$, where $E$s indicate expected unit frequencies ($\pi p(\gamma p+q)^2$, $\pi q(\gamma p+q)^2$, $p-\pi p(\gamma p+q)^2$ and $q-\pi q(\gamma p+q)^2$), and the discrepancies between observed and expected frequencies all have factor $\pi pq(\gamma p+q)(\gamma-1)$ but with negative sign before the second and the third items. The statistic $X^2$ has $\chi_1^2$ distribution under the null hypothesis ($\gamma=1$), and noncentral $\chi_{1,\delta}^2$ distribution with noncentrality parameter $\delta=[\pi pq(\gamma-1)^2]/[1-\pi (\gamma p+q)^2]$ or $[\gamma^2 p+q-(\gamma p+q)^2]/[1-\pi (\gamma p+q)^2]$ under the alternative ($\gamma>1$). It is equivalent to derive the power by $Y=\sqrt{X^2}\sim N(\sqrt{\delta},1)$. $1-\beta=\Phi(-ZC_1)P(|z_2|>C_2,sign(z_1)=sign(z_2))$ and $P(|z_1|>C_1)P(|z_j|>C_j||z_1|>C_1)$ for replication-based and joint analyses, respectively. As our primary interest is the power for the two types of analyses, we implement it in function \texttt{tscc} whose format is \begin{verbatim} tscc(model, GRR, p1, n1, n2, M, alpha.genome, pi.samples, pi.markers, K) \end{verbatim} which requires specification of disease model (multiplicative, additive, dominant, recessive), genotypic relative risk (GRR), the estimated risk allele frequency in cases ($p_1$), total number of cases ($n_1$) total number of controls ($n_2$), total number of markers ($M$), the false positive rate at genome level ($\alpha_\mathit{genome}$), the proportion of markers to be selected ($\pi_\mathit{markers}$, also used as the false positive rate at stage 1) and the population prevalence ($K$). Note the disease risks involving the three genotypes for additive, dominant and recessive models are calculated as $(1,\mathit{GRR},2\cdot\mathit{GRR}-1)$, $(1,\mathit{GRR},\mathit{GRR})$, $(1,1,\mathit{GRR})$, respectively. Power for a number of scenarios for two-stage designs can be obtained with slight modification of example code in the documentation of \texttt{tscc}. Interestingly, the \proglang{R} implementation is much shorter than the \proglang{C} program by \cite{skol06}. \cite{lin06} recently proposed a method of analysis for two-stage designs. Like other contributions, the focus was on the markers genotyped at both stages. Given that in most studies such as the case-cohort design outlined above, there is a rich collection of covariate information and perhaps an equally important aspect is to use all markers and covariates at both stages in a unified analysis. This amounts to a standard analysis with missing independent variables with which statistical methods have been well developed, e.g., \cite{ibrahim90}, \cite{schafer96}, \cite{vbk99}. The packages for multiple imputation available from CRAN, such as \pkg{cat}, \pkg{mix}, \pkg{norm}, \pkg{pan} \citep{cat, mix, norm, pan}, \pkg{mice} \citep{mice}, \pkg{mitools} \citep{mitools} can be used. A useful point relates to prospective versus retrospective methods. The retrospective counterpart requires more attention but far from clear \citep{ea06}, although the equivalence of retrospective versus prospective methods is known \citep{pp79,sr04}. It has been shown \citep{kt00,tcc05} that prospective likelihood can give biased estimate due to over representation of cases or the extremes of the traits compared with the general population, so that a retrospective model of allele/genotype/haplotype frequencies conditional on the disease phenotype via logistic model is preferred. The model is applicable to a wide range of traits and provides unbiased estimates. This is in contrast to prospective methods \citep{srt02,sm05}. These two aspects are well illustrated with a concrete example. We have recently conducted a study of Type-2 diabetes involving 5,013 Ashkenazi and four UK populations using two-staged design and 4,570 SNPs across a 10Mb region of chromosome 20q. A subset of 2,502 individuals were genotyped on these SNPs at stage one, from which SNPs with significance level 0.1 together and those within regions of interest (e.g., HNF4A) were further genotyped on remaining sample at stage two. In the analysis, a meta-analysis was applied to take into account the heterogeneity between populations. The two-stage method by \cite{lin06} would be rather complicated. A useful prospective formulation for a particular marker not genotyped at stage two is to treat it as missing and considered in a weighted regression on trait, where the weight is associated with the three genotypes, similar to \cite{srt02}. For the retrospective method, one can augment the missing genotype similarly, in much the same way as \cite{bgm06}. However, recall that this is exactly the kind of problem multiple imputation will deal with, such that we simply apply the appropriate packages aforementioned which accounts for the uncertainty due to missing genotypes. This has not been used in staged design although it was recognized recently in haplotype analysis \citep{sorensen06,cordell06}. Now with our data contained in \texttt{chr20}, and stage one marker rs1419383 is to be analyzed, we can use \pkg{mice} as follows, \begin{verbatim} R> data("chr20") R> imp <- mice(chr20) R> fit <- lm.mids(cc ~ rs1419383 + ethnicity, data = imp) R> pool(fit) \end{verbatim} The function \texttt{mice} by default performs five imputations and therefore five analyses whose results are combined with \texttt{pool} command. We can also load Bioconductor (\url{http://www.bioconductor.org/}) package \pkg{RSNPper} \citep{Carey06} and report metadata (e.g., genome annotation) and population information based on hapmap, as follows. \begin{verbatim} R> library("RSNPper") R> mysnp <- SNPinfo("rs1419383") R> popDetails(mysnp) R> geneInfo("HNF4A") \end{verbatim} Markers involved in a multistage design can largely be seen as monotonic missing by design and can be dealt with similarly. For large data from a typical genomewide association involving much more SNPs, we can use database connection mechanisms such as open database connectivity (ODBC). \subsection{The EPIC-Norfolk study of obesity} Our EPIC-Norfolk genomewide association study of obesity serves as a good example which many principles just outlined apply. The study was initially designed as a case-control study with the following definition of cases and controls: cases are those with body mass index (BMI) $> 30 kg/m^2$ and controls with $20 kg/m^2 \leq \mbox{BMI} < 25 kg/m^2$. This leads to all obese individuals of the EPIC-Norfolk cohort (3425), and 3400 controls. Half of the samples are to be genotyped at stage one, to be followed by the remaining half at stage two. When adapting this into a case-cohort design, we wished to know the power/sample size required and this turned to be straightforward to obtain. Part of calculation in \cite{cz04} and the required sample size or power in our case can be obtained with \code{example("ccsize", package = "gap")}. It turned out a comparable case-cohort design has approximately 2500 controls. Interestingly, this also corresponds to about 10\% of the size of our cohort ($n$), which would expect to give stable estimate for a range of covariates measured in the cohort. \section{Summary} This short report shows that the design and implementation of the \pkg{gap} package allows for a wide range of functions available in a unified fashion. The package is still under development and can be seen as a prototype for a future more fully established environment. As a reviewer pointed out that ``A strength of the `\pkg{gap}' package is that it aims to be the seed point for a general compilation of such methods, which means it implements many different kinds of methods, not just the ones developed by the author'', the author would like to take this opportunity to invite comments, suggestions and contributions to consolidate the package. Because of the broad scope of applications, we focus on genomewide association as a particular application. Part of Table~\ref{asptdt}, Table~\ref{ccpower}, and the use of multiple imputation in staged design, have not been seen in the literature. The latter requires more elaborate development with respect to probability weighting. Function \texttt{tscc} is a much more transparent implementation than the original authors' \proglang{C}/\proglang{C++} software. For power calculation, it is desirable to implement other commonly used models such as additive, dominant and recessive models. Recently, \cite{skol07} further examined the design issues associated with two-stage design. I am aware of related efforts which may be useful, such as \pkg{pbatR} \citep{hoffmann06}, \pkg{SNPassoc} \citep{gonzalez07}, \pkg{GenABEL} \citep{ aulchenko07} and \pkg{snpMatrix} \citep{clayton07}. Among others, the \pkg{RSNPper} \citep{Carey06} is useful for accessing SNP metadata while \pkg{biomaRt} \citep{Durinck06} enables a large collection of biological data to be available in \proglang{R}. The focus here is largely single point analysis, but multilocus methods such as haplotype analysis have been implemented in \pkg{gap}. For instance, the function \texttt{hap.score} is an adaptation of \texttt{haplo.score} function based on \cite{srt02} which can easily take haplotype frequency estimates and haplotype assignments from other source and use them within the generalized linear models framework. Although the availability of SNP data is overwhelming, the \pkg{gap} package was conceived such that it will handle not only SNPs but multiallelic loci and X-chromosome data. Examples include \texttt{genecounting} and \texttt{hap}, with the former being flexible enough to include features such as X-linked data and the latter being able to handle large number of SNPs. But they are unable to recode allele labels automatically, so functions \texttt{gc.em} and \texttt{hap.em} are in \texttt{haplo.em} format and used by a modified function \texttt{hap.score} in association testing. Owing to limitation in timing, we have used \proglang{SAS} for many tasks in analyzing our obesity data \citep{zhao07} which partly contributes to the delay in consolidating SNP specific analyses in the package. Nevertheless, it will be of value to adapt some of the utilities developed in \proglang{SAS}. After the package was posted to CRAN, other packages have appeared with somewhat overlapping functions, notably \pkg{powerpkg} \citep{powerpkg}; a recent version of \pkg{genetics} package \citep{warnes03} also contains a function \texttt{power.casectrl}. It would be useful to make some comparisons. Another point relates to the sequence of development of functions for a package. In retrospect, it would have been easier if all codes were implemented in native \proglang{R} language and avoiding foreign language calls, but the latter may be advantageous in speed given some tuning is done. Notably, the following functions need further work: \texttt{bt}, \texttt{kin.morgan}, \texttt{twinan90} \citep{wcn92}, \texttt{gcontrol} \citep{dr99}, \texttt{mtdt} \citep{sham97}, \texttt{s2k} \citep{hirotsu01}. The overall target would involve better memory management, maintaining numerical precision or extensions, and use more native \code{.Call} and \proglang{S4} classes of \proglang{R}. It is likely that existing functions and data would continue to evolve. The major driving force to integrate many tools for genetic data analysis is in line with the observation that statistical genetic methods or genetic epidemiology is increasingly becoming part of the general epidemiology with focus on both genetic and nongenetic factors for diseases, where statistical and computational methods are more established. On this point, I wish to emphasize the benefit of shifting to or working on \proglang{R} from my own experience, which has been a very rewarding one, e.g., \cite{zhao05} and \cite{zhao06} in relation to package \pkg{kinship} \citep{therneau03}. I noted a similar but more formal effort has been launched (\url{http://rgenetics.org/}) but certainly an informal approach also has its place, as for example in the case of \pkg{haplo.stats} \citep{sinnwell07}. The best data structures have yet to be established and collaborative work would be beneficial. I hope eventually this will be part of a bigger effort to fulfill most of the requirements foreseen by many e.g., \cite{guo00}. \section*{Acknowledgments} The \proglang{R} development team, particularly Prof.~Kurt Hornik, gave many inputs during the package development. Dr.~Anthony Long kindly provided me their results similar to Table~\ref{ccpower} based on an approximation algorithm. Dr.~Jian'an Luan provided the chromosome 20 data. Dr.~Mike Weale kindly read through the manuscript and provided useful comments. Comments from anonymous referees and the editor Achim Zeileis also help to improve the manuscript. %\bibliographystyle{jss} \bibliography{jss} \end{document}