shap.datasets.adult

shap.datasets.adult(display: bool = False, n_points: int | None = None) tuple[DataFrame, ndarray]

Return the Adult census data in a structured format.

Used in binary classification tasks.

Parameters:
displaybool, optional

If True, return the raw data without target and redundant columns.

n_pointsint, optional

Number of data points to sample. If provided, randomly samples the specified number of points.

Returns:
Xpd.DataFrame

If display is True, X contains the raw data without the ‘Education’, ‘Target’, and ‘fnlwgt’ columns. Otherwise, X contains the processed data without the ‘Target’ and ‘fnlwgt’ columns.

ynp.ndarray

The ‘Target’ column returned as an array.

Notes

  • The original data includes the following columns:

    • Age (float) : Age in years.

    • Workclass (category) : Type of employment.

    • fnlwgt (float) : Final weight; the number of units in the target population that the record represents.

    • Education (category) : Highest level of education achieved.

    • Education-Num (float) : Numeric representation of education level.

    • Marital Status (category) : Marital status of the individual.

    • Occupation (category) : Type of occupation.

    • Relationship (category) : Relationship status.

    • Race (category) : Ethnicity of the individual.

    • Sex (category) : Gender of the individual.

    • Capital Gain (float) : Capital gains recorded.

    • Capital Loss (float) : Capital losses recorded.

    • Hours per week (float) : Number of hours worked per week.

    • Country (category) : Country of origin.

    • Target (category) : Binary target variable indicating whether the individual earns more than 50K.

  • The Education’ column is redundant with ‘Education-Num’ and is dropped for simplicity.

  • The ‘Target’ column is converted to binary (True/False) where ‘>50K’ is True and ‘<=50K’ is False.

  • Certain categorical columns are encoded for numerical representation.

Examples

To get the processed data and target labels:

data, target = shap.datasets.adult()

To get the raw data for display:

raw_data, target = shap.datasets.adult(display=True)