shap.maskers.Text

class shap.maskers.Text(tokenizer=None, mask_token=None, collapse_mask_token='auto', output_type='string')

This masks out tokens according to the given tokenizer.

The masked variables are

output_type : “string” (default) or “token_ids”

__init__(tokenizer=None, mask_token=None, collapse_mask_token='auto', output_type='string')

Build a new Text masker given an optional passed tokenizer.

Parameters:
tokenizercallable or None

The tokenizer used to break apart strings during masking. The passed tokenizer must support a minimal subset of the HuggingFace Transformers PreTrainedTokenizerBase API. This minimal subset means the tokenizer must return a dictionary with ‘input_ids’ and then either include an ‘offset_mapping’ entry in the same dictionary or provide a .convert_ids_to_tokens or .decode method.

mask_tokenstring, int, or None

The sub-string or integer token id used to mask out portions of a string. If None it will use the tokenizer’s .mask_token attribute, if defined, or “…” if the tokenizer does not have a .mask_token attribute.

collapse_mask_tokenTrue, False, or “auto”

If True, when several consecutive tokens are masked only one mask token is used to replace the entire series of original tokens.

Methods

__init__([tokenizer, mask_token, ...])

Build a new Text masker given an optional passed tokenizer.

clustering(s)

Compute the clustering of tokens for the given string.

data_transform(s)

Called by explainers to allow us to convert data to better match masking (here this means tokenizing).

feature_names(s)

The names of the features for each mask position for the given input string.

invariants(s)

The names of the features for each mask position for the given input string.

load(in_file[, instantiate])

Load a Text masker from a file stream.

mask_shapes(s)

The shape of the masks we expect.

save(out_file)

Save a Text masker to a file stream.

shape(s)

The shape of what we return as a masker.

token_segments(s)

Returns the substrings associated with each token in the given string.

clustering(s)

Compute the clustering of tokens for the given string.

data_transform(s)

Called by explainers to allow us to convert data to better match masking (here this means tokenizing).

feature_names(s)

The names of the features for each mask position for the given input string.

invariants(s)

The names of the features for each mask position for the given input string.

classmethod load(in_file, instantiate=True)

Load a Text masker from a file stream.

mask_shapes(s)

The shape of the masks we expect.

save(out_file)

Save a Text masker to a file stream.

shape(s)

The shape of what we return as a masker.

Note we only return a single sample, so there is no expectation averaging.

token_segments(s)

Returns the substrings associated with each token in the given string.