omnixai.data package

base

The base class for all data types.

tabular

The class for tabular data.

image

The class for image data.

text

The class for text data.

timeseries

The class for time series data.

This package provides classes for representing tabular data, image data, text and time series data, i.e., omnixai.data.tabular, omnixai.data.image, omnixai.data.text and omnixai.data.timeseries, respectively.

Given a pandas dataframe df with a set of categorical column names categorical_columns and a target column target_column (e.g., class labels), we can create a Tabular object as follows:

from omnixai.data.tabular import Tabular

tabular = Tabular(
    data=df,                                  # a pandas dataframe
    categorical_columns=categorical_columns,  # a list of categorical feature names
    target_column=target_column)              # a target column name

If df has no categorical columns or no target column, we can set categorical_columns=None or target_column=None, respectively. We can also create a Tabular object with a numpy array x with a list of feature names feature_columns and categorical feature names categorical_columns:

from omnixai.data.tabular import Tabular

tabular = Tabular(
    data=x,                                   # a numpy array
    feature_columns=feature_columns,          # a list of feature names
    categorical_columns=categorical_columns)  # a list of categorical feature names

If there are no feature names, the default feature names will be the indices in the numpy array, e.g., 0, 1, …

The Image class represents a batch of images. Given a batch of images stored in a numpy array x with shape (batch_size, height, width, channel), we can create an Image instance:

from omnixai.data.image import Image

images = Image(
    data=x,             # a numpy array with shape (batch_size, height, width, channel)
    batched=True,       # if x represents a batch of images
    channel_last=True)  # if the last dimension of x is `channel`

If the last dimension is not channel, namely, x has shape (batch_size, channel, height, width), we need to set channel_last=False instead. If the numpy array x has only one image with shape (height, width, channel), we need to set batched=False because the number of dimensions in x is 3 instead of 4.

We can also convert a Pillow image into an Image instance:

from PIL import Image as PilImage
from omnixai.data.image import Image

im = PilImage.open("an_image.jpg")
image = Image(data=im)

The Text class represents a batch of texts or sentences. Given a list of strings texts, we can create an Text instance:

from omnixai.data.text import Text
text = Text(data=["Hello I'm a single sentence",
                  "And another sentence",
                  "And the very very last one"])

The Text class also allows to specify the tokenizer to split each text/sentence into tokens via the tokenizer parameter. If tokenizer is set to None, a default tokenizer nltk.word_tokenize is applied.

The Timeseries class represents a batch of time series. The values of metrics/variables are stored in a numpy array with shape (batch_size, timestamps, num_variables). If there is only one time series, batch_size is 1. We can construct a Timeseries instance from one or a list of pandas dataframes. The index of the dataframe indicates the timestamps and the columns are the variables.

from omnixai.data.timeseries import Timeseries
df = pd.DataFrame(
    [['2017-12-27', 1263.94091, 394.507, 16.530],
     ['2017-12-28', 1299.86398, 506.424, 14.162]],
    columns=['Date', 'Consumption', 'Wind', 'Solar']
)
df = df.set_index('Date')
df.index = pd.to_datetime(df.index)
ts = Timeseries.from_pd(self.df)

omnixai.data.base module

The base class for all data types.

class omnixai.data.base.Data

Bases: object

Abstract base class for differet data types.

abstract property data_type
Returns

A string indicates the data type, e.g., tabular, image, text or time series

abstract values()
Returns

The raw values of the data object.

abstract num_samples()
Returns

The number samples in the dataset.

omnixai.data.tabular module

The class for tabular data.

class omnixai.data.tabular.Tabular(data, feature_columns=None, categorical_columns=None, target_column=None)

Bases: Data

The class represents a tabular dataset that may contain categorical features, continuous-valued features and targets/labels (optional).

Parameters
  • data (Union[DataFrame, ndarray]) – A pandas dataframe or a numpy array containing the raw data. data should have the shape (num_samples, num_features).

  • feature_columns (Optional[List]) – The feature column names. When feature_columns is None, feature_columns will be the column names in the pandas dataframe or the indices in the numpy array.

  • categorical_columns (Optional[List]) – A list of categorical feature names, e.g., a subset of feature column names in a pandas dataframe. If data is a numpy array and feature_columns = None, categorical_columns should be the indices of categorical features.

  • target_column (Union[str, int, None]) – The target/label column name. Set target_column to None if there is no target column.

data_type = 'tabular'
iloc(i)

Returns the row(s) given an index or a set of indices.

Parameters

i (Union[int, slice, list]) – An integer index, slice or list.

Returns

A tabular object with the selected rows.

Return type

Tabular

property shape: tuple

Returns the data shape, e.g., (num_samples, num_features).

Returns

A tuple for the data shape.

Return type

tuple

num_samples()

Returns the number of the examples.

Returns

The number of the examples.

Return type

int

property values: ndarray

Returns the raw values of the data object (without feature column names).

Returns

A numpy array of the data object.

Return type

np.ndarray

property categorical_columns: List

Gets the categorical feature names.

Returns

The list of the categorical feature names.

Return type

Union[List[str], List[int]]

property continuous_columns: List

Gets the continuous-valued feature names.

Returns

The list of the continuous-valued feature names.

Return type

Union[List[str], List[int]]

property feature_columns: List

Gets all feature names.

Returns

The list of all the feature column names except the target column.

Return type

Union[List[str], List[int]]

property target_column: Union[str, int]

Gets the target/label column name.

Returns

The target column name, or None if there is no target column.

Return type

Union[str, int]

property columns: Sequence

Gets all the data columns including both the feature columns and target/label column.

Returns

The list of the column names.

Return type

Sequence

to_pd(copy=True)

Converts Tabular to pd.DataFrame.

Parameters

copyTrue if it returns a data copy, or False otherwise.

Returns

A pandas DataFrame representing the tabular data.

Return type

pd.DataFrame

to_numpy(copy=True)

Converts Tabular to np.ndarray.

Parameters

copyTrue if it returns a data copy, or False otherwise.

Returns

A numpy ndarray representing the tabular data.

Return type

np.ndarray

copy()

Returns a copy of the tabular data.

Returns

The copied tabular data.

Return type

Tabular

remove_target_column()

Removes the target/label column and returns a new Tabular instance.

Returns

The new tabular data without target/label column.

Return type

Tabular

get_target_column()

Returns the target/label column.

Returns

A list of targets or labels.

Return type

List

get_continuous_medians()

Gets the absolute median values of the continuous-valued features.

Returns

A dict storing the absolute median value for each continuous-valued feature.

Return type

Dict

get_continuous_bounds()

Gets the upper and lower bounds of the continuous-valued features.

Returns

The upper and lower bounds, i.e., a tuple of two numpy arrays.

Return type

tuple

omnixai.data.image module

The class for image data.

class omnixai.data.image.Image(data=None, batched=False, channel_last=True)

Bases: Data

The class represents a batch of images. It supports both grayscale and RGB images. It will convert the input images into the (batch_size, h, w, channel) format. If there is only one input image, batch_size will be 1.

Parameters
  • data (Union[ndarray, Image, None]) – The image data, which is either np.ndarray or PIL.Image. If data is a numpy array, it should have the following format: (h, w, channel), (channel, h, w), (batch_size, h, w, channel) or (batch_size, channel, h, w). If data is a PIL.Image, batched and channel_last are ignored. The images contained in data will be automatically converted into a numpy array with shape (batch_size, h, w, channel). If there is only one image, batch_size will be 1.

  • batched (bool) – True if the first dimension of data is the batch size. False if data has one image only.

  • channel_last (bool) – True if the last dimension of data is the color channel or False if the first or second dimension of data is the color channel. If data has no color channel, e.g., grayscale images, this argument is ignored.

data_type = 'image'
property shape: tuple

Returns the raw data shape.

Returns

A tuple for the raw data shape, e.g., (batch_size, h, w, channel).

Return type

tuple

num_samples()

Returns the number of the images.

Returns

The number of the images.

Return type

int

property image_shape: tuple

Returns the image shape.

Returns

A tuple for the image shape, e.g., (h, w, channel).

Return type

tuple

property values: ndarray

Returns the raw values.

Returns

A numpy array of the stored images.

Return type

np.ndarray

to_numpy(hwc=True, copy=True, keepdim=False)

Converts Image into a numpy ndarray.

Parameters
  • hwc – The output has format (batch_size, h, w, channel) if hwc is True or (batch_size, channel, h, w) otherwise.

  • copyTrue if it returns a data copy, or False otherwise.

  • keepdimTrue if the number of dimensions is kept for grayscale images, False if the channel dimension is squeezed.

Returns

A numpy ndarray representing the images.

Return type

np.ndarray

to_pil()

Converts Image into a Pillow image or a list of Pillow images.

Returns

A single Pillow image if batch_size = 1 or a list of Pillow images if batch_size > 1.

Return type

Union[PilImage.Image, List]

copy()

Returns a copy of the image data.

Returns

The copied image data.

Return type

Image

omnixai.data.text module

The class for text data.

class omnixai.data.text.Text(data=None, tokenizer=None)

Bases: Data

The class represents a batch of texts or sentences. The texts or sentences are stored in a list of strings.

Parameters
  • data (Union[List, str, None]) – The text data, either a string or a list of strings.

  • tokenizer (Optional[Callable]) – A tokenizer for splitting texts/sentences into tokens, which should be Callable object. If tokenizer is None, a default nltk tokenizer will be applied.

data_type = 'text'
num_samples()

Returns the number of the texts or sentences.

Returns

The number of the texts or sentences.

Return type

int

property values

Returns the raw text data.

Returns

A list of the sentences/texts.

Return type

List

to_tokens(**kwargs)

Converts sentences/texts into tokens. If tokenizer is None, a default split function, e.g., nltk.word_tokenize is called to split a sentence into tokens. For example, [“omnixai library”, “explainable AI”] will be split into [[“omnixai”, “library”], [“explainable”, “AI”]].

Parameters

kwargs – Additional parameters for the tokenizer

Returns

A batch of tokens.

Return type

List

to_str(copy=True)

Returns a string if it has only one sentence or a list of strings if it contains multiple sentences.

Parameters

copy – Whether to copy the data.

Returns

A single string or a list of strings.

Return type

Union[List, str]

split(sep=None, maxsplit=-1)
copy()

Returns a copy of the text data.

Returns

The copied text data.

Return type

Text

omnixai.data.timeseries module

The class for time series data.

class omnixai.data.timeseries.Timeseries(data, timestamps=None, variable_names=None)

Bases: Data

This class represents a univariate/multivariate time series dataset. The dataset contains a time series whose metric values are stored in a numpy array with shape (timestamps, num_variables).

Parameters
  • data (ndarray) – A numpy array containing a time series. The shape of data is (timestamps, num_variables).

  • timestamps (Optional[List]) – A list of timestamps.

  • variable_names (Optional[List]) – A list of metric/variable names.

data_type = 'timeseries'
property ts_len: int

Returns the length of the time series.

Return type

int

property shape: tuple

Returns the raw data shape, e.g., (timestamps, num_variables).

Returns

A tuple for the raw data shape.

Return type

tuple

num_samples()

Returns 1 because a Timeseries object only contains one time-series.

Returns

Return type

int

property values: ndarray

Returns the raw values of the data object.

Return type

ndarray

Returns

A numpy array of the data object.

property columns: List

Gets the metric/variable names.

Return type

List

Returns

The list of the metric/variable names.

property index: ndarray

Gets the timestamps.

Return type

ndarray

Returns

A list of timestamps.

to_pd()

Converts Timeseries to pd.DataFrame.

Return type

DataFrame

Returns

A pandas dataframe representing the time series.

to_numpy(copy=True)

Converts Timeseries to np.ndarray.

Parameters

copyTrue if it returns a data copy, or False otherwise.

Return type

ndarray

Returns

A numpy ndarray representing the time series.

copy()

Returns a copy of the time series instance.

Returns

The copied time series instance.

Return type

Timeseries

classmethod from_pd(df)

Creates a Timeseries instance from one or multiple pandas dataframes. df is either one pandas dataframe or a list of pandas dataframes. The index of each dataframe should contain the timestamps.

:return A Timeseries instance. :rtype: Timeseries

static get_timestamp_info(ts)

Returns a dict containing timestamp information, e.g., timestamp index name, timestamp values.

Parameters

ts – A Timeseries instance.

Returns

The timestamp information.

static reset_timestamp_index(ts, timestamp_info)

Moves the timestamp index to a column and converts timestamps into floats.

Parameters
  • ts – A Timeseries instance.

  • timestamp_info – The timestamp information.

Returns

A converted Timeseries instance.

static restore_timestamp_index(ts, timestamp_info)

Moves the timestamp column to the index and converts the floats back to timestamps.

Parameters
  • ts – A Timeseries instance with a @timestamp column.

  • timestamp_info – The timestamp information.

Returns

The original time-series dataframe.