shap.models.TopKLM

class shap.models.TopKLM(model, tokenizer, k=10, generate_topk_token_ids=None, batch_size=128, device=None)

Generates scores (log odds) for the top-k tokens for Causal/Masked LM.

__init__(model, tokenizer, k=10, generate_topk_token_ids=None, batch_size=128, device=None)

Take Causal/Masked LM model and tokenizer and build a log odds output model for the top-k tokens.

Parameters:
model: object or function

A object of any pretrained transformer model which is to be explained.

tokenizer: object

A tokenizer object(PreTrainedTokenizer/PreTrainedTokenizerFast).

generation_function_for_topk_token_ids: function

A function which is used to generate top-k token ids. Log odds will be generated for these custom token ids.

batch_size: int

Batch size for model inferencing and computing logodds (default=128).

device: str

By default, it infers if system has a gpu and accordingly sets device. Should be ‘cpu’ or ‘cuda’ or pytorch models.

Returns:
numpy.ndarray

The scores (log odds) of generating top-k token ids using the model.

Methods

__init__(model, tokenizer[, k, ...])

Take Causal/Masked LM model and tokenizer and build a log odds output model for the top-k tokens.

generate_topk_token_ids(X)

Generates top-k token ids for Causal/Masked LM.

get_inputs(X[, padding_side])

The function tokenizes source sentence.

get_lm_logits(X)

Evaluates a Causal/Masked LM model and returns logits corresponding to next word/masked word.

get_logodds(logits)

Calculates log odds from logits.

get_output_names_and_update_topk_token_ids(X)

Gets the token names for top-k token ids for Causal/Masked LM.

load(in_file[, instantiate])

This is meant to be overridden by subclasses and called with super.

save(out_file)

Save the model to the given file stream.

update_cache_X(X)

The function updates original input(X) and top-k token ids for the Causal/Masked LM.

generate_topk_token_ids(X)

Generates top-k token ids for Causal/Masked LM.

Parameters:
X: numpy.ndarray

X is the original input sentence for an explanation row.

Returns:
np.ndarray

An array of top-k token ids.

get_inputs(X, padding_side='right')

The function tokenizes source sentence.

Parameters:
X: numpy.ndarray

X is a batch of text.

Returns:
dict

Dictionary of padded source sentence ids and attention mask as tensors(“pt” or “tf” based on similarity_model_type).

get_lm_logits(X)

Evaluates a Causal/Masked LM model and returns logits corresponding to next word/masked word.

Parameters:
X: numpy.ndarray

An array containing a list of masked inputs.

Returns:
numpy.ndarray

Logits corresponding to next word/masked word.

get_logodds(logits)

Calculates log odds from logits.

This function passes the logits through softmax and then computes log odds for the top-k token ids.

Parameters:
logits: numpy.ndarray

An array of logits generated from the model.

Returns:
numpy.ndarray

Computes log odds for corresponding top-k token ids.

get_output_names_and_update_topk_token_ids(X)

Gets the token names for top-k token ids for Causal/Masked LM.

Parameters:
X: np.ndarray

Input(Text) for an explanation row.

Returns:
list

A list of output tokens.

classmethod load(in_file, instantiate=True)

This is meant to be overridden by subclasses and called with super.

We return constructor argument values when not being instantiated. Since there are no constructor arguments for the Serializable class we just return an empty dictionary.

save(out_file)

Save the model to the given file stream.

update_cache_X(X)

The function updates original input(X) and top-k token ids for the Causal/Masked LM.

It mimics the caching mechanism to update the original input and topk token ids that are to be explained and which updates for every new row of explanation.

Parameters:
X: np.ndarray

Input(Text) for an explanation row.