shap.datasets.imdb

shap.datasets.imdb(n_points=None)

Return the classic IMDB sentiment analysis training data in a nice package.

Parameters:
n_pointsint, optional

Number of data points to sample. If None, the entire dataset is used.

Returns:
Tuple of list containing text data and numpy array representing the labels.

Notes

Full data is at: http://ai.stanford.edu/~amaas/data/sentiment/aclImdb_v1.tar.gz

Paper to cite when using the data is: http://www.aclweb.org/anthology/P11-1015

Examples

To get the processed text data and labels:

text_data, labels = shap.datasets.imdb()