{ "cells": [ { "cell_type": "markdown", "metadata": {}, "source": [ "### TabularExplainer for house-price prediction (regression)" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "The class `TabularExplainer` is designed for tabular data, acting as a factory of the supported tabular explainers such as LIME, SHAP and MACE. `TabularExplainer` provides a unified easy-to-use interface for all the supported explainers. In practice, we recommend applying `TabularExplainer` to generate explanations instead of using a specific explainer in the package `omnixai.explainers.tabular`." ] }, { "cell_type": "code", "execution_count": 1, "metadata": {}, "outputs": [], "source": [ "# This default renderer is used for sphinx docs only. Please delete this cell in IPython.\n", "import plotly.io as pio\n", "pio.renderers.default = \"png\"" ] }, { "cell_type": "code", "execution_count": 2, "metadata": {}, "outputs": [], "source": [ "import numpy as np\n", "import pandas as pd\n", "import sklearn\n", "import sklearn.ensemble\n", "from sklearn.datasets import fetch_california_housing\n", "\n", "from omnixai.data.tabular import Tabular\n", "from omnixai.preprocessing.base import Identity\n", "from omnixai.preprocessing.tabular import TabularTransform\n", "from omnixai.explainers.tabular import TabularExplainer\n", "from omnixai.visualization.dashboard import Dashboard" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "The dataset used in this example is for the house-price prediction (https://scikit-learn.org/stable/modules/generated/sklearn.datasets.fetch_california_housing.html). We recommend using `Tabular` to represent a tabular dataset that can be constructed from a pandas dataframe or a numpy array. To create a `Tabular` instance given a pandas dataframe, one needs to specify the dataframe, the categorical feature names (if exists) and the target/label column name (if exists). The package `omnixai.preprocessing` provides several useful preprocessing functions for a `Tabular` data. " ] }, { "cell_type": "code", "execution_count": 3, "metadata": { "scrolled": true }, "outputs": [ { "name": "stdout", "output_type": "stream", "text": [ " MedInc HouseAge AveRooms AveBedrms Population AveOccup Latitude \\\n", "0 8.3252 41.0 6.984127 1.023810 322.0 2.555556 37.88 \n", "1 8.3014 21.0 6.238137 0.971880 2401.0 2.109842 37.86 \n", "2 7.2574 52.0 8.288136 1.073446 496.0 2.802260 37.85 \n", "3 5.6431 52.0 5.817352 1.073059 558.0 2.547945 37.85 \n", "4 3.8462 52.0 6.281853 1.081081 565.0 2.181467 37.85 \n", "... ... ... ... ... ... ... ... \n", "20635 1.5603 25.0 5.045455 1.133333 845.0 2.560606 39.48 \n", "20636 2.5568 18.0 6.114035 1.315789 356.0 3.122807 39.49 \n", "20637 1.7000 17.0 5.205543 1.120092 1007.0 2.325635 39.43 \n", "20638 1.8672 18.0 5.329513 1.171920 741.0 2.123209 39.43 \n", "20639 2.3886 16.0 5.254717 1.162264 1387.0 2.616981 39.37 \n", "\n", " Longitude target \n", "0 -122.23 4.526 \n", "1 -122.22 3.585 \n", "2 -122.24 3.521 \n", "3 -122.25 3.413 \n", "4 -122.25 3.422 \n", "... ... ... \n", "20635 -121.09 0.781 \n", "20636 -121.21 0.771 \n", "20637 -121.22 0.923 \n", "20638 -121.32 0.847 \n", "20639 -121.24 0.894 \n", "\n", "[20640 rows x 9 columns]\n" ] } ], "source": [ "housing = fetch_california_housing()\n", "df = pd.DataFrame(\n", " np.concatenate([housing.data, housing.target.reshape((-1, 1))], axis=1),\n", " columns=list(housing.feature_names) + ['target']\n", ")\n", "tabular_data = Tabular(df, target_column='target')\n", "print(tabular_data)" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "`TabularTransform` is a special transform designed for tabular data. By default, it converts categorical features into one-hot encoding, and keeps continuous-valued features (if one wants to normalize continuous-valued features, set the parameter `cont_transform` in `TabularTransform` to `Standard` or `MinMax`). The `transform` method of `TabularTransform` will transform a `Tabular` instance into a numpy array. If the `Tabular` instance has a target/label column, the last column of the transformed numpy array will be the target/label. \n", "\n", "If some other transformations that are not supported in the library are necessary, one can simply convert the `Tabular` instance into a pandas dataframe by calling `Tabular.to_pd()` and try different transformations with it.\n", "\n", "After data preprocessing, we can train a random forest regressor for this task. " ] }, { "cell_type": "code", "execution_count": 4, "metadata": {}, "outputs": [ { "name": "stdout", "output_type": "stream", "text": [ "Training data shape: (16512, 8)\n", "Test data shape: (4128, 8)\n", "MSError when predicting the mean 1.3515599004967849\n", "Random Forest MSError 0.255011664583401\n" ] } ], "source": [ "transformer = TabularTransform(\n", " target_transform=Identity()\n", ").fit(tabular_data)\n", "x = transformer.transform(tabular_data)\n", "\n", "x_train, x_test, y_train, y_test = \\\n", " sklearn.model_selection.train_test_split(x[:, :-1], x[:, -1], train_size=0.80)\n", "print('Training data shape: {}'.format(x_train.shape))\n", "print('Test data shape: {}'.format(x_test.shape))\n", "\n", "rf = sklearn.ensemble.RandomForestRegressor(n_estimators=200)\n", "rf.fit(x_train, y_train)\n", "print('MSError when predicting the mean', np.mean((y_train.mean() - y_test) ** 2))\n", "print('Random Forest MSError', np.mean((rf.predict(x_test) - y_test) ** 2))\n", "\n", "# Convert the transformed data back to Tabular instances\n", "train_data = transformer.invert(x_train)\n", "test_data = transformer.invert(x_test)" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "To initialize `TabularExplainer`, we need to set the following parameters:\n", "\n", " - `explainers`: The names of the explainers to apply, e.g., [\"lime\", \"shap\", \"sensitivity\", \"pdp\"].\n", " - `data`: The data used to initialize explainers. ``data`` is the training dataset for training the machine learning model. If the training dataset is too large, ``data`` can be a subset of it by applying `omnixai.sampler.tabular.Sampler.subsample`.\n", " - `model`: The ML model to explain, e.g., a scikit-learn model, a tensorflow model, a pytorch model or a black-box prediction function.\n", " - `preprocess`: The preprocessing function converting the raw data (a `Tabular` instance) into the inputs of `model`.\n", " - `postprocess` (optional): The postprocessing function transforming the outputs of ``model`` to a user-specific form, e.g., the predicted probability for each class. The output of `postprocess` should be a numpy array.\n", " - `mode`: The task type, e.g., \"classification\" or \"regression\".\n", " \n", "The preprocessing function takes a `Tabular` instance as its input and outputs the processed features that the ML model consumes. In this example, we simply call `transformer.transform`. If one uses some special transforms on pandas dataframes, the preprocess function has this kind of format: `lambda z: some_transform(z.to_pd())`." ] }, { "cell_type": "code", "execution_count": 5, "metadata": {}, "outputs": [], "source": [ "preprocess = lambda z: transformer.transform(z)" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "We are now ready to create a `TabularExplainer`. `params` in `TabularExplainer` allows us to set parameters for each explainer applied here. For example, \"kernel_width\" for LIME is set to 3. \n", "\n", "In this example, LIME and SHAP generate local explanations while PDP (partial dependence plot) and sensitivity analysis generate global explanations. `explainers.explain` returns the local explanations, and `explainers.explain_global` returns the global explanations. `TabularExplainer` hides all the details behind the explainers, so we can simply call these two methods to generate explanations." ] }, { "cell_type": "code", "execution_count": 6, "metadata": {}, "outputs": [ { "data": { "application/vnd.jupyter.widget-view+json": { "model_id": "f9709c5c5bd94b56976cec22da0a20bc", "version_major": 2, "version_minor": 0 }, "text/plain": [ " 0%| | 0/5 [00:00