# Math behind LinearExplainer with correlation feature perturbation

When we use LinearExplainer(model, prior, feature_perturbation="correlation_dependent") we do not use $$E[f(x) \mid do(X_S = x_S)]$$ to measure the impact of a set $$S$$ of features, but rather use $$E[f(x) \mid X_S = x_s]$$ under the assumption that the random variable $$X$$ (representing the input features) follows a multivariate guassian distribution. To compute SHAP values this way we need to compute conditional expectations under the multivariate guassian distribution for all subset of features. This would be a lot of matrix match for an exponential number of terms, and it hence intractable for models with more than just a few features.

This document briefly outlines the math we have used to precompute all of the required linear algebra using a sampling procedure that can be done just once, and then applied to as many samples as we like. This drastically speed up the computation compared to a brute force approach. Note that all these calculations depend on the fact that we are explaining a linear model $$f(x) = \beta x$$.

The permutation definition of SHAP values in the interventional form used by most explainers is

$\phi_i = \frac{1}{M!} \sum_R E[f(X) \mid do(X_{S_i^R \cup i} = x_{S_i^R \cup i})] - E[f(X) \mid do(X_{S_i^R} = x_{S_i^R})]$

but here we will use the non-interventional conditional expectation form (where we have simplified the notation by dropping the explicit reference to the random variable $$X$$).

$\phi_i = \frac{1}{M!} \sum_R E[f(x) \mid x_{S_i^R \cup i}] - E[f(x) \mid x_{S_i^R}]$

where $$f(x) = \beta x + b$$ with $$\beta$$ a row vector and $$b$$ a scalar.

If we replace f(x) with the linear function definition we get:

:nbsphinx-math:begin{align} phi_i = frac{1}{M!} sum_R E[beta x + b mid x_{S_i^R cup i}] - E[beta x + b mid x_{S_i^R}] \

= beta frac{1}{M!} sum_R E[x mid x_{S_i^R cup i}] - E[x mid x_{S_i^R}]

end{align}

Assume the inputs $$x$$ follow a multivariate normal distribution with mean $$\mu$$ and covariance $$\Sigma$$. Denote the projection matrix that selects a set $$S$$ as $$P_S$$, then we get:

\begin{align} E[x \mid x_S] = [P_{\bar S} \mu + P_{\bar S} \Sigma P_S^T (P_S \Sigma P_S^T)^{-1} ( P_S x - P_S \mu)] P_{\bar S} + x P_S^T P_S \\ = [P_{\bar S} \mu + P_{\bar S} \Sigma P_S^T (P_S \Sigma P_S^T)^{-1} P_S (x - \mu)] P_{\bar S} + x P_S^T P_S \\ = [\mu + \Sigma P_S^T (P_S \Sigma P_S^T)^{-1} P_S (x - \mu)] P_{\bar S}^T P_{\bar S} + x P_S^T P_S \\ = P_{\bar S}^T P_{\bar S} [\mu + \Sigma P_S^T (P_S \Sigma P_S^T)^{-1} P_S (x - \mu)] + P_S^T P_S x \\ = P_{\bar S}^T P_{\bar S} \mu + P_{\bar S}^T P_{\bar S} \Sigma P_S^T (P_S \Sigma P_S^T)^{-1} P_S x - P_{\bar S}^T P_{\bar S} \Sigma P_S^T (P_S \Sigma P_S^T)^{-1} P_S \mu + P_S^T P_S x \\ = [P_{\bar S}^T P_{\bar S} - P_{\bar S}^T P_{\bar S} \Sigma P_S^T (P_S \Sigma P_S^T)^{-1} P_S] \mu + [P_S^T P_S + P_{\bar S}^T P_{\bar S} \Sigma P_S^T (P_S \Sigma P_S^T)^{-1} P_S] x \end{align}

if we let $$R_S = P_{\bar S}^T P_{\bar S} \Sigma P_S^T (P_S \Sigma P_S^T)^{-1} P_S$$ and $$Q_S = P_S^T P_S$$ then we can write

\begin{align} E[x \mid x_S] = [Q_{\bar S} - R_S] \mu + [Q_S + R_S] x \end{align}

or

\begin{align} E[x \mid x_{S_i^R \cup i}] = [Q_{\bar{S_i^R \cup i}} - R_{S_i^R \cup i}] \mu + [Q_{S_i^R \cup i} + R_{S_i^R \cup i}] x \end{align}

leading to the Shapley equation of

\begin{align} \phi_i = \beta \frac{1}{M!} \sum_R [Q_{\bar{S_i^R \cup i}} - R_{S_i^R \cup i}] \mu + [Q_{S_i^R \cup i} + R_{S_i^R \cup i}] x - [Q_{\bar{S_i^R}} - R_{S_i^R}] \mu - [Q_{S_i^R} + R_{S_i^R}] x \\ = \beta \frac{1}{M!} \sum_R ([Q_{\bar{S_i^R \cup i}} - R_{S_i^R \cup i}] - [Q_{\bar{S_i^R}} - R_{S_i^R}]) \mu + ([Q_{S_i^R \cup i} + R_{S_i^R \cup i}] - [Q_{S_i^R} + R_{S_i^R}]) x \\ = \beta \left [ \frac{1}{M!} \sum_R ([Q_{\bar{S_i^R \cup i}} - R_{S_i^R \cup i}] - [Q_{\bar{S_i^R}} - R_{S_i^R}]) \right ] \mu + \beta \left [ \frac{1}{M!} \sum_R ([Q_{S_i^R \cup i} + R_{S_i^R \cup i}] - [Q_{S_i^R} + R_{S_i^R}]) \right ] x \end{align}

$\phi = \beta T x$

This means that we can precompute the transform matrix $$T$$ by drawing random permutations $$R$$ many times and averaging our results. Once we have computed $$T$$ we can explain any number of samples (or models for that matter) by just using matrix multiplication.