shap.datasets.imdb

shap.datasets.imdb(n_points: int | None = None) tuple[list[str], ndarray]

Return the classic IMDB sentiment analysis training data in a nice package.

Used in binary text classification tasks.

Parameters:
n_pointsint, optional

Number of data points to sample. If provided, randomly samples the specified number of points.

Returns:
Xlist of strings

Text data, where each string is a movie review.

ynp.ndarray

The target variable. Contains booleans, where True indicates a positive sentiment and False indicates a negative sentiment.

Notes

Full data is at: http://ai.stanford.edu/~amaas/data/sentiment/aclImdb_v1.tar.gz

Paper to cite when using the data is: http://www.aclweb.org/anthology/P11-1015

Examples

To get the processed text data and labels:

text_data, labels = shap.datasets.imdb()