{
 "cells": [
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "### LIME for text classification"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "This is an example of the LIME explainer on text classification. If using this explainer, please cite the original work: https://github.com/marcotcr/lime."
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 1,
   "metadata": {},
   "outputs": [],
   "source": [
    "import numpy as np\n",
    "import sklearn.ensemble\n",
    "from sklearn.datasets import fetch_20newsgroups\n",
    "\n",
    "from omnixai.data.text import Text\n",
    "from omnixai.explainers.nlp import LimeText\n",
    "from omnixai.preprocessing.text import Tfidf"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "We use a `Text` object to represent a batch of texts/sentences. The package `omnixai.preprocessing.text` provides some transforms related to text data such as `Tfidf`."
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 2,
   "metadata": {},
   "outputs": [],
   "source": [
    "# Load the training and text datasets\n",
    "categories = ['alt.atheism', 'soc.religion.christian']\n",
    "newsgroups_train = fetch_20newsgroups(subset='train', categories=categories)\n",
    "newsgroups_test = fetch_20newsgroups(subset='test', categories=categories)\n",
    "\n",
    "x_train = Text(newsgroups_train.data)\n",
    "y_train = newsgroups_train.target\n",
    "x_test = Text(newsgroups_test.data)\n",
    "y_test = newsgroups_test.target\n",
    "class_names = ['atheism', 'christian']\n",
    "# A TFDIF transform\n",
    "transform = Tfidf().fit(x_train)"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "For this classification task, we train a random forest classifier with TF-IDF feature vectors."
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 3,
   "metadata": {},
   "outputs": [
    {
     "name": "stdout",
     "output_type": "stream",
     "text": [
      "Test accuracy: 0.9230769230769231\n"
     ]
    }
   ],
   "source": [
    "train_vectors = transform.transform(x_train)\n",
    "test_vectors = transform.transform(x_test)\n",
    "model = sklearn.ensemble.RandomForestClassifier(n_estimators=500)\n",
    "model.fit(train_vectors, y_train)\n",
    "predict_function = lambda x: model.predict_proba(transform.transform(x))\n",
    "\n",
    "predictions = model.predict(test_vectors)\n",
    "print('Test accuracy: {}'.format(\n",
    "    sklearn.metrics.f1_score(y_test, predictions, average='binary')))"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "To initialize `LimeText`, we need to set the following parameters:\n",
    "\n",
    "  - `predict_function`: The prediction function corresponding to the machine learning model to explain. For classification tasks, the outputs of the `predict_function` are the class probabilities."
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 4,
   "metadata": {},
   "outputs": [
    {
     "data": {
      "text/html": [
       "<div>Instance 0: Class atheism</div>\n",
       "<div><span style='color:rgb(65,254,65)'>Host </span><span style='color:rgb(70,243,70)'>NNTP </span><span style='color:rgb(90,203,90)'>Posting </span><span style='color:rgb(107,170,107)'>edu </span><span style='color:rgb(122,139,122)'>Organization </span><span style='color:rgb(123,138,123)'>an </span><span style='color:rgb(137,123,123)'>Lines </span><span style='color:rgb(123,137,123)'>post </span><span style='color:rgb(134,125,125)'>Subject </span><span style='color:rgb(134,125,125)'>anyone</span></div><br>"
      ],
      "text/plain": [
       "<IPython.core.display.HTML object>"
      ]
     },
     "metadata": {},
     "output_type": "display_data"
    }
   ],
   "source": [
    "idx = 83\n",
    "explainer = LimeText(predict_function=predict_function)\n",
    "explanations = explainer.explain(x_test[idx:idx+4])\n",
    "explanations.ipython_plot(index=0, class_names=class_names)"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 5,
   "metadata": {},
   "outputs": [
    {
     "data": {
      "text/html": [
       "<div>Instance 1: Class christian</div>\n",
       "<div><span style='color:rgb(65,254,65)'>Christians </span><span style='color:rgb(71,241,71)'>sin </span><span style='color:rgb(235,74,74)'>murder </span><span style='color:rgb(78,228,78)'>of </span><span style='color:rgb(81,222,81)'>the </span><span style='color:rgb(210,87,87)'>writes </span><span style='color:rgb(203,90,90)'>Re </span><span style='color:rgb(198,93,93)'>said </span><span style='color:rgb(196,94,94)'>more </span><span style='color:rgb(180,102,102)'>nation</span></div><br>"
      ],
      "text/plain": [
       "<IPython.core.display.HTML object>"
      ]
     },
     "metadata": {},
     "output_type": "display_data"
    }
   ],
   "source": [
    "explanations.ipython_plot(index=1, class_names=class_names)"
   ]
  }
 ],
 "metadata": {
  "kernelspec": {
   "display_name": "Python 3",
   "language": "python",
   "name": "python3"
  },
  "language_info": {
   "codemirror_mode": {
    "name": "ipython",
    "version": 3
   },
   "file_extension": ".py",
   "mimetype": "text/x-python",
   "name": "python",
   "nbconvert_exporter": "python",
   "pygments_lexer": "ipython3",
   "version": "3.7.5"
  }
 },
 "nbformat": 4,
 "nbformat_minor": 2
}