{ "cells": [ { "cell_type": "markdown", "metadata": {}, "source": [ "### SHAP for time series anomaly detection" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "The SHAP explainer for time series data supports time series anomaly detection and forecasting. If using this explainer, please cite the original work: https://github.com/slundberg/shap." ] }, { "cell_type": "code", "execution_count": 1, "metadata": {}, "outputs": [], "source": [ "# This default renderer is used for sphinx docs only. Please delete this cell in IPython.\n", "import plotly.io as pio\n", "pio.renderers.default = \"png\"" ] }, { "cell_type": "code", "execution_count": 2, "metadata": {}, "outputs": [], "source": [ "import os\n", "import numpy as np\n", "import pandas as pd\n", "from omnixai.data.timeseries import Timeseries\n", "from omnixai.explainers.timeseries import ShapTimeseries" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "The time series data used here is a sythentic univariate time series dataset. We recommend using `Timeseries` to represent a time series dataset. `Timeseries` contains one univariate/multivariate time series, which can be constructed from a pandas dataframe (the index in the dataframe indicates the timestamps and the columns are the variables)." ] }, { "cell_type": "code", "execution_count": 3, "metadata": {}, "outputs": [ { "name": "stdout", "output_type": "stream", "text": [ " values\n", "timestamp \n", "1970-01-01 00:00:00 1.928031\n", "1970-01-01 00:05:00 -1.156620\n", "1970-01-01 00:10:00 -0.390650\n", "1970-01-01 00:15:00 0.400804\n", "1970-01-01 00:20:00 -0.874490\n", "... ...\n", "1970-02-04 16:55:00 0.362724\n", "1970-02-04 17:00:00 2.657373\n", "1970-02-04 17:05:00 1.472341\n", "1970-02-04 17:10:00 1.033154\n", "1970-02-04 17:15:00 2.950466\n", "\n", "[10000 rows x 1 columns]\n" ] } ], "source": [ "# Load the time series dataset\n", "df = pd.read_csv(os.path.join(\"../data\", \"timeseries.csv\"))\n", "df[\"timestamp\"] = pd.to_datetime(df[\"timestamp\"], unit='s')\n", "df = df.rename(columns={\"horizontal\": \"values\"})\n", "df = df.set_index(\"timestamp\")\n", "df = df.drop(columns=[\"anomaly\"])\n", "print(df)" ] }, { "cell_type": "code", "execution_count": 4, "metadata": {}, "outputs": [], "source": [ "# Split the dataset into training and test splits\n", "train_df = df.iloc[:9150]\n", "test_df = df.iloc[9150:9300]\n", "# A simple threshold for detecting anomaly data points\n", "threshold = np.percentile(train_df[\"values\"].values, 90)" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "The outputs of the detector are anomaly scores instead of anomaly labels (0 or 1). A data point is more anomalous if it has a higher anomaly score. " ] }, { "cell_type": "code", "execution_count": 5, "metadata": {}, "outputs": [], "source": [ "# A simple detector for determining whether a window of time series is anomalous\n", "def detector(ts: Timeseries):\n", " anomaly_scores = np.sum((ts.values > threshold).astype(int))\n", " return anomaly_scores / ts.shape[0]" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "To initialize a SHAP explainer, we need to set:\n", " \n", " - `training_data`: The data used to initialize a SHAP explainer. ``training_data`` can be the training dataset for training the machine learning model.\n", " - `predict_function`: The prediction function corresponding to the model to explain. The input of ``predict_function`` should be an `Timeseries` instance. The outputs of ``predict_function`` are anomaly scores (higher scores imply more anomalous) for anomaly detection or predicted values for forecasting.\n", " - `mode`: The task type, e.g., \"anomaly_detection\" or \"forecasting\"." ] }, { "cell_type": "code", "execution_count": 6, "metadata": {}, "outputs": [], "source": [ "explainer = ShapTimeseries(\n", " training_data=Timeseries.from_pd(train_df),\n", " predict_function=detector,\n", " mode=\"anomaly_detection\"\n", ")\n", "test_x = Timeseries.from_pd(test_df)" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "SHAP generates local explanations, e.g. `explainer.explain` is called given the test instances. `ipython_plot` plots the generated explanations in IPython. Parameter `index` indicates which instance in `test_x` to plot, e.g., `index = 0` means plotting the first instance in `test_x`." ] }, { "cell_type": "code", "execution_count": 7, "metadata": {}, "outputs": [ { "data": { "application/vnd.jupyter.widget-view+json": { "model_id": "a8d08c51ca424a2c8ec4c954e853b46c", "version_major": 2, "version_minor": 0 }, "text/plain": [ " 0%| | 0/1 [00:00