shap.models.TopKLM

class shap.models.TopKLM(model, tokenizer, k=10, generate_topk_token_ids=None, batch_size=128, device=None)

Generates scores (log odds) for the top-k tokens for Causal/Masked LM.

__init__(model, tokenizer, k=10, generate_topk_token_ids=None, batch_size=128, device=None)

Take Causal/Masked LM model and tokenizer and build a log odds output model for the top-k tokens.

Parameters:

model: object or function: A object of any pretrained transformer model which is to be explained.
tokenizer: object: A tokenizer object(PreTrainedTokenizer/PreTrainedTokenizerFast).
generation_function_for_topk_token_ids: function: A function which is used to generate top-k token ids. Log odds will be generated for these custom token ids.
batch_size: int: Batch size for model inferencing and computing logodds (default=128).
device: str: By default, it infers if system has a gpu and accordingly sets device. Should be ‘cpu’ or ‘cuda’ or pytorch models.

Returns:

numpy.ndarray: The scores (log odds) of generating top-k token ids using the model.

Methods

`__init__`(model, tokenizer[, k, ...])	Take Causal/Masked LM model and tokenizer and build a log odds output model for the top-k tokens.
`generate_topk_token_ids`(X)	Generates top-k token ids for Causal/Masked LM.
`get_inputs`(X[, padding_side])	The function tokenizes source sentence.
`get_lm_logits`(X)	Evaluates a Causal/Masked LM model and returns logits corresponding to next word/masked word.
`get_logodds`(logits)	Calculates log odds from logits.
`get_output_names_and_update_topk_token_ids`(X)	Gets the token names for top-k token ids for Causal/Masked LM.
`load`(in_file[, instantiate])	This is meant to be overridden by subclasses and called with super.
`save`(out_file)	Save the model to the given file stream.
`update_cache_X`(X)	The function updates original input(X) and top-k token ids for the Causal/Masked LM.

generate_topk_token_ids(X)

Generates top-k token ids for Causal/Masked LM.

Parameters:

X: numpy.ndarray: X is the original input sentence for an explanation row.

Returns:

np.ndarray: An array of top-k token ids.

get_inputs(X, padding_side='right')

The function tokenizes source sentence.

Parameters:

X: numpy.ndarray: X is a batch of text.

Returns:

dict: Dictionary of padded source sentence ids and attention mask as tensors(“pt” or “tf” based on similarity_model_type).

get_lm_logits(X)

Evaluates a Causal/Masked LM model and returns logits corresponding to next word/masked word.

Parameters:

X: numpy.ndarray: An array containing a list of masked inputs.

Returns:

numpy.ndarray: Logits corresponding to next word/masked word.

get_logodds(logits)

Calculates log odds from logits.

This function passes the logits through softmax and then computes log odds for the top-k token ids.

Parameters:

logits: numpy.ndarray: An array of logits generated from the model.

Returns:

numpy.ndarray: Computes log odds for corresponding top-k token ids.

get_output_names_and_update_topk_token_ids(X)

Gets the token names for top-k token ids for Causal/Masked LM.

Parameters:

X: np.ndarray: Input(Text) for an explanation row.

Returns:

list: A list of output tokens.

classmethod load(in_file, instantiate=True)

This is meant to be overridden by subclasses and called with super.

We return constructor argument values when not being instantiated. Since there are no constructor arguments for the Serializable class we just return an empty dictionary.

save(out_file): Save the model to the given file stream.

update_cache_X(X)

The function updates original input(X) and top-k token ids for the Causal/Masked LM.

It mimics the caching mechanism to update the original input and topk token ids that are to be explained and which updates for every new row of explanation.

Parameters:

X: np.ndarray: Input(Text) for an explanation row.