Regression depth


Rousseeuw Peter J. and Mia Hubert described in the article "Regression Depth" a new statistical concept called regression depth. In this section, we define this concept and why it is interesting in a geometrical point of view.


The problem

In bivariate statistics there is a need to estimate the expression of the two statistical variables. An usual tool is to use simple linear regression. The simple regression line is the line that minimize the sum of the square residuals of a bivariate system.

Simple linear regression
Given a dataset $Z_n = \{(x_i, y_i), i = 1,...,n \} \subset \rm I\!R^2$, the simple regression line $f$ is given by \[ f \equiv y = \alpha x + \beta \] where $\alpha$ and $\beta$ are chosen such that they minimize $LS(\alpha, \beta)$ \[ LS(\alpha, \beta) = \sum^n_{i=1} (y_i - \beta - \alpha x_i)^2 \]

We present in the following chapter an alternative to simple regression introduced by Rousseeuw and Hubert.


Fit and non-fit

Our goal is to find a line that fits a given dataset $Z_n = \{(x_i, y_i), i = 1,...,n \} \subset \rm I\!R^2$.

Given a candidate fit $l$ such that $l \equiv y = mx + b$, we define (as used in the previous section) its residuals by $r_i(l) = y_i - m x_i - b$ for every point $(x_i, y_i) \in Z_n$.

Definition: non-fit
A candidate fit $l = (m, b)$ to $Z_n$ is called a non-fit iff there exists a real number $v$ that does not coincide with any $x_i$ and such that $\phi_i(l)$ is satisfied. \[ \phi_i(l) = ((\forall x_i < v, r_i(l) < 0) \wedge (\forall x_i >v, r_i(l) > 0)) \vee ((\forall x_i < v, r_i(l) > 0) \wedge (\forall x_i > v, r_i(l) < 0)) \]

Note: We use $l = (m, b)$ as a notation for a line $l \equiv y = mx + b$.

The following plot shows two non-fits $\Theta$ and $\Psi$ and the corresponding v-value $v_{\Theta}$ and $v_{\Psi}$. From this plot we see that we can interpret that a non-fit is a line for which there exists a tilting point (red cross) around which the line can rotate to verticality without crossing any point of $Z_n$.


Definition of regression depth

Definition: regression depth
The regression depth $rdepth(l, Z_n)$ of any candidate fit $l$ relative to a dataset $Z_n$ is the minimal number of observations from $Z_n$ that should be removed so that $l$ becomes a non-fit.

To illustrate this principle, the following plot is similar to the previous one except a point has been added (in purple). In this example, $\Theta$ is not a non-fit anymore and $rdepth(\Theta, Z_n) = 1$ since the removal of the purple point would make $\Theta$ a non-fit again.


A first algorithm

In this section we present an algorithm computing the regression depth of a given fit. First we need to define some notions.

Definition: $L^-(x), L^+(x), R^-(x), R^+(x)$
Given a line $l = (m, b)$, we define:
  • $L^-(a) = \#\{j \mid x_j \le a$ and $r_j(l) \le 0\}$
  • $L^+(a) = \#\{j \mid x_j \le a$ and $r_j(l) \ge 0\}$
  • $R^-(a) = \#\{j \mid x_j > a$ and $r_j(l) \le 0\}$
  • $R^+(a) = \#\{j \mid x_j > a$ and $r_j(l) \ge 0\}$
Each of the defined regions are depicted in the following plot for a fit $\Theta$ and the point in red.



Now that we have defined these regions, we describe the algorithm as follow.

Algorithm: Regression depth (1)
Input:
  • $l = (m, b)$ a fit
  • $Z_n = \{(x_i, y_i), i = 1,...,n \} \subset \rm I\!R^2$
Steps:
  1. Sort all element of $Z_n$ by their $x$ coordinate such that $x_1 < x_2 < ... < x_n$
  2. Compute $rdepth(l, Z_n) = \min_{1 \le i \le n} (\min \{L^+(x_i) + R^-(x_i), L^-(x_i) + R^+(x_i)\})$

Given the previous algorithm, one can compute the regression depth of a fit in $\mathcal{O}(n \log(n))$ time.


Example

Here is an application implementing the previous algorithm. Points and lines can be added and the regression depth of any line can be obtained by selecting the option and placing your mouse over the line.