Examples of data objects

The library supports three data types Tabular, Image and Text defined in the modules omnixai.data.tabular, omnixai.data.image and omnixai.data.text, respectively. All the supported explainers take one of these data objects as their inputs when generating explanations, e.g., an explainer vision tasks takes an Image object as its input. These data objects can be constructed easily from pandas dataframes, numpy arrays, Pillow images or strings. This notebook will show how to use these data objects.

Tabular data

Suppose we have a pandas dataframe representing a tabular dataset with both categorical and continuous-valued features:

[1]:
import pandas as pd

df = pd.DataFrame(
    data=[[1, 2, 3, 'male'], [4, 5, 6, 'female']],
    columns=['a', 'b', 'c', 'd']
)

The first three columns are continuous-valued features and the last column is the categorical feature. Given this dataframe, we can easily construct a Tabular instance by specifying the dataframe and the categorical columns:

[2]:
from omnixai.data.tabular import Tabular

x = Tabular(
    data=df,
    categorical_columns=['d']
)
print(x)
   a  b  c       d
0  1  2  3    male
1  4  5  6  female

If we want to construct a Tabular instance with a numpy array, we need to set the feature columns:

[3]:
x = Tabular(
    data=df.values,
    feature_columns=['a', 'b', 'c', 'd'],
    categorical_columns=['d']
)
print(x)
   a  b  c       d
0  1  2  3    male
1  4  5  6  female

Tabular has several useful methods:

[4]:
# Get the data shape
print(f"Shape: {x.shape}")
# Get the raw data values
print(f"Raw values:\n {x.values}")
# Get the categorical feature columns
print(f"Categorical features: {x.categorical_columns}")
# Get the continuous-valued feature columns
print(f"Continuous-valued features: {x.continuous_columns}")
# Get all the feature columns
print(f"All feature columns: {x.feature_columns}")
Shape: (2, 4)
Raw values:
 [[1 2 3 'male']
 [4 5 6 'female']]
Categorical features: ['d']
Continuous-valued features: ['a', 'b', 'c']
All feature columns: ['a', 'b', 'c', 'd']

A Tabular instance can be converted into a pandas dataframe or a numpy array:

[5]:
print(x.to_pd())
print(x.to_numpy())
   a  b  c       d
0  1  2  3    male
1  4  5  6  female
[[1 2 3 'male']
 [4 5 6 'female']]

The dataset represented by Tabular may have a target/label column, e.g., class labels in classification tasks. In the following example, the last column is the target/label column.

[6]:
df = pd.DataFrame(
    data=[[1, 2, 3, 'male', 'yes'], [4, 5, 6, 'female', 'no']],
    columns=['a', 'b', 'c', 'd', 'label']
)

To construct a Tabular instance, besides setting categorical feature columns, we also need to set the target/label column:

[7]:
x = Tabular(
    data=df,
    categorical_columns=['d'],
    target_column='label'
)
print(x)
print(f"Target column: {x.target_column}")
   a  b  c       d label
0  1  2  3    male   yes
1  4  5  6  female    no
Target column: label

To get a subset of the rows of x:

[8]:
print("The first row:")
print(x[0])
print("The second row:")
print(x[1])
print("Swap the two rows:")
print(x[[1, 0]])
The first row:
   a  b  c     d label
0  1  2  3  male   yes
The second row:
   a  b  c       d label
1  4  5  6  female    no
Swap the two rows:
   a  b  c       d label
1  4  5  6  female    no
0  1  2  3    male   yes

Image data

An Image object can be constructed from a numpy array (a batch of images) or a Pillow image. For example, a numpy array contains a batch of MNIST digit images:

[9]:
import torchvision

test_data = torchvision.datasets.MNIST(root='../data', train=False, download=True)
imgs = test_data.data.numpy()
print(imgs.shape)
(10000, 28, 28)

An Image object can be created as follows:

[10]:
from omnixai.data.image import Image

# `batched = True` means `data` contains a batch of images with
# shape `(batch_size, height, width)` or `(batch_size, height, width, channel)`.
images = Image(data=imgs, batched=True)

Here are some useful functions:

[11]:
print(f"Data shape: {images.shape}")
print(f"Image shape: {images.image_shape}")

print(f"The first image (Pillow):")
display(images[0].to_pil())
print(f"The second image (Pillow):")
display(images[1].to_pil())

print("Loop:")
for im in images[:5]:
    print(im.shape)
Data shape: (10000, 28, 28, 1)
Image shape: (28, 28, 1)
The first image (Pillow):
../../_images/tutorials_misc_data_objects_23_1.png
The second image (Pillow):
../../_images/tutorials_misc_data_objects_23_3.png
Loop:
(1, 28, 28, 1)
(1, 28, 28, 1)
(1, 28, 28, 1)
(1, 28, 28, 1)
(1, 28, 28, 1)

We can also convert Image into a numpy array:

[12]:
print(f"Numpy array shape: {images.to_numpy().shape}")
print(f"Numpy array shape: {images.to_numpy(keepdim=True).shape}")
print(f"Numpy array shape: {images.to_numpy(hwc=False, keepdim=True).shape}")
Numpy array shape: (10000, 28, 28)
Numpy array shape: (10000, 28, 28, 1)
Numpy array shape: (10000, 1, 28, 28)

A color image example:

[13]:
from PIL import Image as PilImage

img = Image(PilImage.open('../data/images/camera.jpg').convert('RGB'))
print(f"Data shape: {img.shape}")
print(f"Image shape: {img.image_shape}")
print(f"The image (Pillow):")
# `to_pil` returns a single Pillow image if `batch_size = 1` or a list of Pillow images if `batch_size > 1`.
display(img.to_pil())
Data shape: (1, 224, 224, 3)
Image shape: (224, 224, 3)
The image (Pillow):
../../_images/tutorials_misc_data_objects_27_1.png
[14]:
print(f"Numpy array shape: {img.to_numpy().shape}")
print(f"Numpy array shape: {img.to_numpy(keepdim=True).shape}")
print(f"Numpy array shape: {img.to_numpy(hwc=False, keepdim=True).shape}")
Numpy array shape: (1, 224, 224, 3)
Numpy array shape: (1, 224, 224, 3)
Numpy array shape: (1, 3, 224, 224)

Text data

A Text object represents a batch of texts or sentences stored in a list. For example,

[15]:
from omnixai.data.text import Text

x = Text([
    "What a great movie! if you have no taste.",
    "it was a fantastic performance!",
    "best film ever",
    "such a great show!",
    "it was a horrible movie",
    "i've never watched something as bad"
])

Here are some useful functions:

[16]:
print(f"Number of sentences: {len(x)}")
print(f"The first sentence: {x[0]}")
print(f"Raw strings: {x.values}")
Number of sentences: 6
The first sentence: ['What a great movie! if you have no taste.']
Raw strings: ['What a great movie! if you have no taste.', 'it was a fantastic performance!', 'best film ever', 'such a great show!', 'it was a horrible movie', "i've never watched something as bad"]

Convert texts/sentences into a list of tokens:

[17]:
print(x.to_tokens())
[['what', 'a', 'great', 'movie', '!', 'if', 'you', 'have', 'no', 'taste', '.'], ['it', 'was', 'a', 'fantastic', 'performance', '!'], ['best', 'film', 'ever'], ['such', 'a', 'great', 'show', '!'], ['it', 'was', 'a', 'horrible', 'movie'], ['i', "'ve", 'never', 'watched', 'something', 'as', 'bad']]

Time series data

The Timeseries class represents a time series. The values of metrics/variables are stored in a numpy array with shape (timestamps, num_variables). We can construct a Timeseries instance from a pandas dataframe, where the index indicates the timestamps and the columns are the variables.

[18]:
from omnixai.data.timeseries import Timeseries
df = pd.DataFrame(
   [['2017-12-27', 1263.94091, 394.507, 16.530],
    ['2017-12-28', 1299.86398, 506.424, 14.162],
    ['2017-12-29', 1319.76541, 610.314, 15.173]],
   columns=['Date', 'Consumption', 'Wind', 'Solar']
)
df = df.set_index('Date')
df.index = pd.to_datetime(df.index)
[19]:
ts = Timeseries.from_pd(df)
print(ts)
            Consumption     Wind   Solar
2017-12-27   1263.94091  394.507  16.530
2017-12-28   1299.86398  506.424  14.162
2017-12-29   1319.76541  610.314  15.173

Here are some useful functions:

[20]:
print(f"Length of ts: {len(ts)}")
print(f"Length of ts: {ts.ts_len}")
print(f"Metrics: {ts.columns}")
print(f"Time-series shape: {ts.shape}")
print("Select rows:")
print(ts[[1, 0]])
print("To pandas dataframe:")
print(ts.to_pd())
Length of ts: 3
Length of ts: 3
Metrics: ['Consumption', 'Wind', 'Solar']
Time-series shape: (3, 3)
Select rows:
            Consumption     Wind   Solar
2017-12-28   1299.86398  506.424  14.162
2017-12-27   1263.94091  394.507  16.530
To pandas dataframe:
            Consumption     Wind   Solar
2017-12-27   1263.94091  394.507  16.530
2017-12-28   1299.86398  506.424  14.162
2017-12-29   1319.76541  610.314  15.173