\documentclass[11pt, a4paper]{article} \usepackage[a4paper, text={16cm,25cm}]{geometry} %\VignetteIndexEntry{covMcd() -- Generalizing the FastMCD} %\VignetteDepends{robustbase} \SweaveOpts{prefix.string=mcd, eps=FALSE, pdf=TRUE, strip.white=true} \SweaveOpts{width=6, height=4.1} \usepackage{amsmath} \usepackage{amsfonts}% \mathbb \usepackage{mathtools}% -> \floor, \ceil \usepackage[utf8]{inputenc} %% The following is partly R's share/texmf/Rd.sty \usepackage{color} \definecolor{Blue}{rgb}{0,0,0.8} \definecolor{Red}{rgb}{0.7,0,0} \usepackage{hyperref} \hypersetup{% hyperindex,% colorlinks={true},% pagebackref,% linktocpage,% plainpages={false},% linkcolor={Blue},% citecolor={Blue},% urlcolor={Red},% pdfstartview={Fit},% pdfview={XYZ null null null}% } \usepackage{natbib} \usepackage[noae]{Sweave} %---------------------------------------------------- \DeclarePairedDelimiter{\ceil}{\lceil}{\rceil} \DeclarePairedDelimiter{\floor}{\lfloor}{\rfloor} \DeclareMathOperator{\sign}{sign} \newcommand{\abs}[1]{\left| #1 \right|} \newtheorem{definition}{Definition} \newcommand{\byDef}{\mathrm{by\ default}} \newcommand{\R}{{\normalfont\textsf{R}}{}} \newcommand{\code}[1]{\texttt{#1}} \newcommand*{\pkg}[1]{\texttt{#1}} \newcommand*{\CRANpkg}[1]{\href{https://CRAN.R-project.org/package=#1}{\pkg{#1}}} %---------------------------------------------------- \begin{document} \setkeys{Gin}{width=0.9\textwidth} \setlength{\abovecaptionskip}{-5pt} \title{covMcd() -- Considerations about Generalizing the FastMCD} \author{Martin M\"achler} \maketitle %\tableofcontents %% %% Pison, G., Van Aelst, S., and Willems, G. (2002) %% Small Sample Corrections for LTS and MCD. %% Metrika % ~/save/papers/robust-diverse/Pison_VanAelst_Willems.pdf %% <>= # set margins for plots options(SweaveHooks=list(fig=function() par(mar=c(3,3,1.4,0.7), mgp=c(1.5, 0.5, 0))), width = 75) @ \section{Introduction} The context is robust multivariate ``location and scatter'' estimation, which corresponds to estimating the first two moments in cases they exist. We assume data and a model \begin{align} \label{eq:data-model} x_i & \in \mathbb{R}^p, \ \ i=1,2,\dots,n \\ x_i & \sim \mathcal{F}(\mu, \Sigma), \ \ \mathrm{i.i.d.};\ \ \mu \in \mathbb{R}^p, \ \ \Sigma \in \mathbb{R}^{p \times p}, \ \textrm{positive definite}, \end{align} where a conceptual null model is the $p$-dimensional normal distribution. One typical assumption is that $\mathcal{F}$ is a mixture with the majority component (``good data'') being $\mathcal{N}_p(\mu, \Sigma)$ and other components modeling ``the outliers''. In other words, we want estimates $\bigl(\hat{\mu}, \hat{\Sigma}\bigr)$ which should be close to the true ``good data'' $(\mu, \Sigma)$ --- and do not say more here. \section{MCD and ``the Fast'' MCD (= \textsc{fastmcd}) Algorithm} The \CRANpkg{robustbase} \R{} package has featured a function \code{covMcd()} since early on (Feb.~2006) and that has been an interface to the Fortran routine provided by the original authors and (partly) described in \citet{RouPvD99}. %% Rousseeuw, P. J. and van Driessen, K. (1999) %% A fast algorithm for the minimum covariance determinant estimator. %% Technometrics {41}, 212--223. %% >> ~/save/papers/robust-diverse/Rousseeuw_VanD-FastMCD_1999.pdf % ------------------------------------------------------------ We describe shortly how the algorithm works, partly building on the documentation provided in the source (R, S, and Fortran) codes: %% R CMD Rdconv --type=latex ../../man/covMcd.Rd > covMcd.tex The minimum covariance determinant estimator of location and scatter (MCD) implemented in \code{covMcd()} is similar to \R{} function \code{cov.mcd()} in \CRANpkg{MASS}. The (``theoretical'') MCD looks for the $h = h_\alpha (> 1/2)$ out of $n$ observations whose classical covariance matrix has the lowest possible determinant. In more detail, we will use $h = h_\alpha = h(\alpha,n,p) \approx \alpha \cdot (n+p+1)$, where as \citet{RouPvD99} mainly use (the default) $\alpha = \frac 1 2$, where $h = h(1/2, n, p) = \floor[\Big]{\frac{n+p+1}{2}}$. For general $\alpha \ge \frac 1 2$, the \R{} implementation (derived from their original S code) uses $h = h(\alpha,n,p) =$ \code{h.alpha.n(alpha,n,p)} (function in \pkg{robustbase}), which is \begin{eqnarray} \label{eq:def-h} h = h_\alpha = h(\alpha,n,p) := \floor{2 n_2 - n + 2 \alpha (n - n_2)}, \ \mathrm{where\ } n_2 := \floor[\Big]{\frac{n+p+1}{2}}% %= (n+p+1)/2 \ \ (\mathrm{\ where ``/'' denotes \emph{integer} division}) . \end{eqnarray} The fraction $\alpha \ge \frac 1 2$ can be chosen by the user, where $\alpha = \frac 1 2$ is the most robust, and indeed, $h_{1/2} = n_2 = \floor[\Big]{\frac{n+p+1}{2}}$. Even in general, as long as $n \gg p$, $\alpha$ is approximately the \emph{proportion} of the subsample size $h$ in the full sample (size $n$): \begin{equation} \label{eq:h.approx} h \approx \alpha \cdot n \iff \alpha \approx \frac{h}{n}, \end{equation} <>= require(robustbase) n <- c(5, 10, 20, 30, 50, 100, 200, 500) hmat <- function(alpha, p) cbind(n, h.alpha = h.alpha.n (alpha, n,p), h. = floor(alpha * (n + p + 1)), alpha.n = round(alpha * n)) hmat(alpha = 1/2, p = 3) hmat(alpha = 3/4, p = 4) @ The breakdown point (for $h > \frac{n}{2}$) then is \begin{eqnarray} \label{eq:breakdown} \epsilon_{*} = \frac{n-h+1}{n}, \end{eqnarray} which is less than but close to $\frac 1 2$ for $\alpha = \frac 1 2$, and in general, $h/n \approx \alpha$, the breakdown point is approximately, \begin{eqnarray} \label{eq:eps-approx} \epsilon_{*} = \frac{n-h+1}{n} \approx \frac{n-h}{n} = 1 - \frac{h}{n} \approx 1 - \alpha. \end{eqnarray} The raw MCD estimate of location, say $\hat{\mu}_0$, is then the average of these $h$ points, whereas the raw MCD estimate of scatter, $\hat{\Sigma}_0$, is their covariance matrix, multiplied by a consistency factor \code{.MCDcons(p, h/n)}) and (by default) a finite sample correction factor \code{.MCDcnp2(p, n, alpha)}, to make it consistent at the normal model and unbiased at small samples. %% Both rescaling factors (consistency and finite sample) are returned in the length-2 vector %% \code{raw.cnp2}. In practice, for reasonably sized $n$, $p$ and hence $h$, it is not feasible to search the full space of all $n \choose h$ $h$-subsets of $n$ observations. Rather, the implementation of \code{covMcd} uses the Fast MCD algorithm of \citet{RouPvD99} to approximate the minimum covariance determinant estimator, see Section~\ref{sec:fastMCD}. Based on these raw MCD estimates, $\bigl(\hat{\mu}_0, \hat{\Sigma}_0\bigr)$, % (unless argument \code{raw.only} is true), a reweighting step is performed, i.e., \code{V <- cov.wt(x,w)}, where \code{w} are weights determined by ``outlyingness'' with respect to the scaled raw MCD, using the ``Mahalanobis''-like, robust distances $d_i\bigl(\hat{\mu}_0, \hat{\Sigma}_0\bigr)$, see (\ref{eq:Maha}). Again, a consistency factor and %(if \code{use.correction} is true) a finite sample correction factor %(\code{.MCDcnp2.rew(p, n, alpha)}) are applied. The reweighted covariance is typically considerably more efficient than the raw one, see \citet{PisGvAW02}. The two rescaling factors for the reweighted estimates are returned in \code{cnp2}. Details for the computation of the finite sample correction factors can be found in \citet{PisGvAW02}. \section{Fast MCD Algorithm -- General notation}\label{sec:fastMCD} \paragraph{Note:} In the following, apart from the mathematical notation, we also use variable names, e.g., \code{kmini}, used in the Fortran and sometimes \R{} function code, in \R{} package \CRANpkg{robustbase}. Instead of directly searching for $h$-subsets (among ${n \choose h} \approx {n \choose n/2}$) the basic idea is to start with small subsets of size $p+1$, their center $\mu$ and covariance matrix $\Sigma$, and a corresponding $h$-subset of the $h$ observations with smallest (squared) (``Mahalanobis''-like) distances \begin{align} \label{eq:Maha} d_i = d_i(\mu,\Sigma) := (x_i - \mu)' \Sigma^{-1} (x_i - \mu), \ \ i=1,2,\dots,n, \end{align} and then use concentration steps (``C~steps'') to (locally) improve the chosen set by iteratively computing $\mu$, $\Sigma$, new distances $d_i$ and a new set of size $h$ with smallest distances $d_i(\mu,\Sigma)$. Each C~step is proven to decrease the determinant $\det(\Sigma)$ if $\mu$ and $\Sigma$ did change at all. Consequently, convergence to a local minimum is sure, as the number of $h$-subsets is finite. To make the algorithm \emph{fast} for non small sample size $n$ the data set is split into ``groups'' or ``sub-datasets'' as soon as \begin{eqnarray} \label{eq:nmini} n \ge 2 n_0, \ \mathrm{ where}\ \ n_0 := \mathtt{nmini} \ ( = 300, \byDef). \end{eqnarray} i.e., the default cutoff for ``non small'' is at $n = 600$. %% The \emph{number} of such subsets in the original algorithm is maximally 5, and we now use \begin{eqnarray} \label{eq:kmini} k_M = \code{kmini} \ (= 5, \byDef), \end{eqnarray} as upper limit. As above, we assume from now on that $n \ge 2 n_0$, and let \begin{eqnarray} \label{eq:k-def} k := \floor[\Big]{\frac{n}{n_0}} \ge 2 \end{eqnarray} and now distinguish the two cases, \begin{eqnarray} \label{eq:cases} \begin{cases} A. & k < k_M \iff n < k_M \cdot n_0 \\ B. & k \ge k_M \iff n \ge k_M \cdot n_0 \end{cases} \end{eqnarray} \begin{description} \item[In case A] $k$ (\code{= ngroup}) subsets aka ``groups'' or ``sub datasets'' are used, $k \in\{2,3,\dots,k_M-1\}$, of group sizes $n_j$, $j=1,\dots,k$ (see below). Note that case~A may be empty because of $2 \le k < k_M$, namely if $k_M=2$. Hence, in case~A, we have $k_M \ge 3$. \item[in case B] $k_M$ (\code{= ngroup}) groups each of size $n_0$ are built and in the first stage, only a \emph{subset} of $k_M \cdot n_0 \le n$ observations is used. \end{description} In both cases, the disjoint groups (``sub datasets'') are chosen at random from the $n$ observations. %% For the group sizes for case~A, $n_j$, $j=1,\dots,k$, we have \begin{align} n_1 = \; & \floor[\Big]{\frac n k} = \floor[\bigg]{\frac{n}{\floor[\big]{\frac{n}{n_0}}}} \ \ (\; \ge n_0 \label{eq:n1})\\ n_j = \; & n_1,\hspace*{2.8em} j = 2,\dots,j_* \\ n_j = \; & n_1 + 1, \ \ \ j = j_* +1,\dots,k, \label{n1-plus-1}\\ & \qquad \mathrm{where}\ \ j_* := k - r \ \in \{1,\dots,k\}, \label{jstar}\\ & \qquad \mathrm{and}\ \ r := n - k n_1 = \label{r-rest} n - k\floor[\big]{\frac n k} \in \{0,1,\dots,k-1\}, \end{align} where the range of $j_*$, $1,\dots,k$ in (\ref{jstar}) is a consequence of the range of the integer division remainder $r \in \{0,1,\dots,k-1\}$ in (\ref{r-rest}). Consequently, (\ref{n1-plus-1}) maybe empty, namely iff $r=0$ ($\iff n = k \cdot n_1$ is a multiple of $k$): $j_* = k$, and all $n_j \equiv n_1$. Considering the range of $n_j$ in case~A, the minimum $n_1 \ge n_0$ in (\ref{eq:n1}) is easy to verify. What is the maximal value of $n_j$ , i.e., an upper bound for $n_{\max} := n_1+1 \ge \max_j n_j$? \ %% %% This is all correct but not useful: %% From (\ref{eq:n1}), $ n/k - 1 < n_1 \le n/k $, and %% from (\ref{eq:k-def}), $n/n_0 - 1 < k \le n/n_0$. %% Putting these two together, we get %% \begin{eqnarray} %% \label{eq:n1-ineq} %% \frac{n^2}{n_0} - 1 \le n/k - 1 < n_1 \le n/k < \frac{n n_0}{n - n_0}, %% \end{eqnarray} %% (the first $\le$ from $\frac{1}{k} \ge \frac{n_0}{n}$; the last $<$ from %% $\frac{1}{k} < \frac 1{n/n_0 -1} = \frac{n_0}{n-n_0}$.) Also, %% from (\ref{eq:k-def}), $n \ge k n_0$ and $n-n_0 \ge (k-1)n_0$ and since we %% are in case~A, $n < n_0 k_M$, which combines to %% \begin{eqnarray} %% \label{eq:nn0} %% \frac{n n_0}{n - n_0} < \frac{(n_0 k_M) n_0}{(k-1)n_0} = \frac{n_0 k_M}{k-1}. %% \end{eqnarray} Consider $n_{1,\max}(k) = \max_{n, \mathrm{given\ } k} n_1 = \max_{n, \mathrm{given\ } k} \floor{\frac n k}$. Given $k$, the maximal $n$ still fulfilling $\floor[\big]{\frac{n}{n_0}} = k$ is $n = (k+1)n_0 - 1$ where $\floor[\big]{\frac{n}{n_0}} = k + \floor[\big]{1 - \frac{1}{n_0}} = k$. Hence, $n_{1,\max}(k) =\floor[\big]{\frac{(k+1)n_0 - 1}{k}} = n_0 + \floor[\big]{\frac{n_0 - 1}{k}}$, and as $k \ge 2$, the maximum is at $k=2$, $\max n_1 = \max_k n_{1,\max}(k) = n_0 + \floor[\big]{\frac{n_0 - 1}{2}} = \floor[\big]{\frac{3 n_0 - 1}{2}}$. Taken together, as $n_j = n_1+1$ is possible, we have \begin{align} \label{eq:nj-range} n_0 \le & n_1 \le \floor[\Big]{\frac{3 n_0 - 1}{2}} \nonumber\\ n_0 \le & n_j \le \floor[\Big]{\frac{3 n_0 + 1}{2}}, \ \ j \ge 2. \end{align} Note that indeed, $\floor[\big]{\frac{3 n_0 + 1}{2}}$ is the length of the auxiliary vector \code{subndex} in the Fortran code. \bibliographystyle{chicago} \bibliography{robustbase} \end{document}