omnixai.data package

`base`	The base class for all data types.
`tabular`	The class for tabular data.
`image`	The class for image data.
`text`	The class for text data.
`timeseries`	The class for time series data.

This package provides classes for representing tabular data, image data, text and time series data, i.e., omnixai.data.tabular, omnixai.data.image, omnixai.data.text and omnixai.data.timeseries, respectively.

Given a pandas dataframe df with a set of categorical column names categorical_columns and a target column target_column (e.g., class labels), we can create a Tabular object as follows:

from omnixai.data.tabular import Tabular

tabular = Tabular(
    data=df,                                  # a pandas dataframe
    categorical_columns=categorical_columns,  # a list of categorical feature names
    target_column=target_column)              # a target column name

If df has no categorical columns or no target column, we can set categorical_columns=None or target_column=None, respectively. We can also create a Tabular object with a numpy array x with a list of feature names feature_columns and categorical feature names categorical_columns:

from omnixai.data.tabular import Tabular

tabular = Tabular(
    data=x,                                   # a numpy array
    feature_columns=feature_columns,          # a list of feature names
    categorical_columns=categorical_columns)  # a list of categorical feature names

If there are no feature names, the default feature names will be the indices in the numpy array, e.g., 0, 1, …

The Image class represents a batch of images. Given a batch of images stored in a numpy array x with shape (batch_size, height, width, channel), we can create an Image instance:

from omnixai.data.image import Image

images = Image(
    data=x,             # a numpy array with shape (batch_size, height, width, channel)
    batched=True,       # if x represents a batch of images
    channel_last=True)  # if the last dimension of x is `channel`

If the last dimension is not channel, namely, x has shape (batch_size, channel, height, width), we need to set channel_last=False instead. If the numpy array x has only one image with shape (height, width, channel), we need to set batched=False because the number of dimensions in x is 3 instead of 4.

We can also convert a Pillow image into an Image instance:

from PIL import Image as PilImage
from omnixai.data.image import Image

im = PilImage.open("an_image.jpg")
image = Image(data=im)

The Text class represents a batch of texts or sentences. Given a list of strings texts, we can create an Text instance:

from omnixai.data.text import Text
text = Text(data=["Hello I'm a single sentence",
                  "And another sentence",
                  "And the very very last one"])

The Text class also allows to specify the tokenizer to split each text/sentence into tokens via the tokenizer parameter. If tokenizer is set to None, a default tokenizer nltk.word_tokenize is applied.

The Timeseries class represents a batch of time series. The values of metrics/variables are stored in a numpy array with shape (batch_size, timestamps, num_variables). If there is only one time series, batch_size is 1. We can construct a Timeseries instance from one or a list of pandas dataframes. The index of the dataframe indicates the timestamps and the columns are the variables.

from omnixai.data.timeseries import Timeseries
df = pd.DataFrame(
    [['2017-12-27', 1263.94091, 394.507, 16.530],
     ['2017-12-28', 1299.86398, 506.424, 14.162]],
    columns=['Date', 'Consumption', 'Wind', 'Solar']
)
df = df.set_index('Date')
df.index = pd.to_datetime(df.index)
ts = Timeseries.from_pd(self.df)

omnixai.data.base module

The base class for all data types.

class omnixai.data.base.Data

Bases: object

Abstract base class for differet data types.

abstract property data_type

Returns: A string indicates the data type, e.g., tabular, image, text or time series

abstract values()

Returns: The raw values of the data object.

abstract num_samples()

Returns: The number samples in the dataset.

omnixai.data.tabular module

The class for tabular data.

class omnixai.data.tabular.Tabular(data, feature_columns=None, categorical_columns=None, target_column=None)

Bases: Data

The class represents a tabular dataset that may contain categorical features, continuous-valued features and targets/labels (optional).

Parameters

data (Union[DataFrame, ndarray]) – A pandas dataframe or a numpy array containing the raw data. data should have the shape (num_samples, num_features).
feature_columns (Optional[List]) – The feature column names. When feature_columns is None, feature_columns will be the column names in the pandas dataframe or the indices in the numpy array.
categorical_columns (Optional[List]) – A list of categorical feature names, e.g., a subset of feature column names in a pandas dataframe. If data is a numpy array and feature_columns = None, categorical_columns should be the indices of categorical features.
target_column (Union[str, int, None]) – The target/label column name. Set target_column to None if there is no target column.

data_type = 'tabular'

iloc(i)

Returns the row(s) given an index or a set of indices.

Parameters: i (Union[int, slice, list]) – An integer index, slice or list.
Returns: A tabular object with the selected rows.
Return type: Tabular

property shape: tuple

Returns the data shape, e.g., (num_samples, num_features).

Returns: A tuple for the data shape.
Return type: tuple

num_samples()

Returns the number of the examples.

Returns: The number of the examples.
Return type: int

property values: ndarray

Returns the raw values of the data object (without feature column names).

Returns: A numpy array of the data object.
Return type: np.ndarray

property categorical_columns: List

Gets the categorical feature names.

Returns: The list of the categorical feature names.
Return type: Union[List[str], List[int]]

property continuous_columns: List

Gets the continuous-valued feature names.

Returns: The list of the continuous-valued feature names.
Return type: Union[List[str], List[int]]

property feature_columns: List

Gets all feature names.

Returns: The list of all the feature column names except the target column.
Return type: Union[List[str], List[int]]

property target_column: Union[str, int]

Gets the target/label column name.

Returns: The target column name, or None if there is no target column.
Return type: Union[str, int]

property columns: Sequence

Gets all the data columns including both the feature columns and target/label column.

Returns: The list of the column names.
Return type: Sequence

to_pd(copy=True)

Converts Tabular to pd.DataFrame.

Parameters: copy – True if it returns a data copy, or False otherwise.
Returns: A pandas DataFrame representing the tabular data.
Return type: pd.DataFrame

to_numpy(copy=True)

Converts Tabular to np.ndarray.

Parameters: copy – True if it returns a data copy, or False otherwise.
Returns: A numpy ndarray representing the tabular data.
Return type: np.ndarray

copy()

Returns a copy of the tabular data.

Returns: The copied tabular data.
Return type: Tabular

remove_target_column()

Removes the target/label column and returns a new Tabular instance.

Returns: The new tabular data without target/label column.
Return type: Tabular

get_target_column()

Returns the target/label column.

Returns: A list of targets or labels.
Return type: List

get_continuous_medians()

Gets the absolute median values of the continuous-valued features.

Returns: A dict storing the absolute median value for each continuous-valued feature.
Return type: Dict

get_continuous_bounds()

Gets the upper and lower bounds of the continuous-valued features.

Returns: The upper and lower bounds, i.e., a tuple of two numpy arrays.
Return type: tuple

omnixai.data.image module

The class for image data.

class omnixai.data.image.Image(data=None, batched=False, channel_last=True)

Bases: Data

The class represents a batch of images. It supports both grayscale and RGB images. It will convert the input images into the (batch_size, h, w, channel) format. If there is only one input image, batch_size will be 1.

Parameters

data (Union[ndarray, Image, None]) – The image data, which is either np.ndarray or PIL.Image. If data is a numpy array, it should have the following format: (h, w, channel), (channel, h, w), (batch_size, h, w, channel) or (batch_size, channel, h, w). If data is a PIL.Image, batched and channel_last are ignored. The images contained in data will be automatically converted into a numpy array with shape (batch_size, h, w, channel). If there is only one image, batch_size will be 1.
batched (bool) – True if the first dimension of data is the batch size. False if data has one image only.
channel_last (bool) – True if the last dimension of data is the color channel or False if the first or second dimension of data is the color channel. If data has no color channel, e.g., grayscale images, this argument is ignored.

data_type = 'image'

property shape: tuple

Returns the raw data shape.

Returns: A tuple for the raw data shape, e.g., (batch_size, h, w, channel).
Return type: tuple

num_samples()

Returns the number of the images.

Returns: The number of the images.
Return type: int

property image_shape: tuple

Returns the image shape.

Returns: A tuple for the image shape, e.g., (h, w, channel).
Return type: tuple

property values: ndarray

Returns the raw values.

Returns: A numpy array of the stored images.
Return type: np.ndarray

to_numpy(hwc=True, copy=True, keepdim=False)

Converts Image into a numpy ndarray.

Parameters

hwc – The output has format (batch_size, h, w, channel) if hwc is True or (batch_size, channel, h, w) otherwise.
copy – True if it returns a data copy, or False otherwise.
keepdim – True if the number of dimensions is kept for grayscale images, False if the channel dimension is squeezed.

Returns

A numpy ndarray representing the images.

Return type

np.ndarray

to_pil()

Converts Image into a Pillow image or a list of Pillow images.

Returns: A single Pillow image if batch_size = 1 or a list of Pillow images if batch_size > 1.
Return type: Union[PilImage.Image, List]

copy()

Returns a copy of the image data.

Returns: The copied image data.
Return type: Image

omnixai.data.text module

The class for text data.

class omnixai.data.text.Text(data=None, tokenizer=None)

Bases: Data

The class represents a batch of texts or sentences. The texts or sentences are stored in a list of strings.

Parameters

data (Union[List, str, None]) – The text data, either a string or a list of strings.
tokenizer (Optional[Callable]) – A tokenizer for splitting texts/sentences into tokens, which should be Callable object. If tokenizer is None, a default nltk tokenizer will be applied.

data_type = 'text'

num_samples()

Returns the number of the texts or sentences.

Returns: The number of the texts or sentences.
Return type: int

property values

Returns the raw text data.

Returns: A list of the sentences/texts.
Return type: List

to_tokens(**kwargs)

Converts sentences/texts into tokens. If tokenizer is None, a default split function, e.g., nltk.word_tokenize is called to split a sentence into tokens. For example, [“omnixai library”, “explainable AI”] will be split into [[“omnixai”, “library”], [“explainable”, “AI”]].

Parameters: kwargs – Additional parameters for the tokenizer
Returns: A batch of tokens.
Return type: List

to_str(copy=True)

Returns a string if it has only one sentence or a list of strings if it contains multiple sentences.

Parameters: copy – Whether to copy the data.
Returns: A single string or a list of strings.
Return type: Union[List, str]

split(sep=None, maxsplit=-1)

copy()

Returns a copy of the text data.

Returns: The copied text data.
Return type: Text

omnixai.data.timeseries module

The class for time series data.

class omnixai.data.timeseries.Timeseries(data, timestamps=None, variable_names=None)

Bases: Data

This class represents a univariate/multivariate time series dataset. The dataset contains a time series whose metric values are stored in a numpy array with shape (timestamps, num_variables).

Parameters

data (ndarray) – A numpy array containing a time series. The shape of data is (timestamps, num_variables).
timestamps (Optional[List]) – A list of timestamps.
variable_names (Optional[List]) – A list of metric/variable names.

data_type = 'timeseries'

property ts_len: int

Returns the length of the time series.

Return type: int

property shape: tuple

Returns the raw data shape, e.g., (timestamps, num_variables).

Returns: A tuple for the raw data shape.
Return type: tuple

num_samples()

Returns 1 because a Timeseries object only contains one time-series.

Returns

Return type

int

property values: ndarray

Returns the raw values of the data object.

Return type: ndarray
Returns: A numpy array of the data object.

property columns: List

Gets the metric/variable names.

Return type: List
Returns: The list of the metric/variable names.

property index: ndarray

Gets the timestamps.

Return type: ndarray
Returns: A list of timestamps.

to_pd()

Converts Timeseries to pd.DataFrame.

Return type: DataFrame
Returns: A pandas dataframe representing the time series.

to_numpy(copy=True)

Converts Timeseries to np.ndarray.

Parameters: copy – True if it returns a data copy, or False otherwise.
Return type: ndarray
Returns: A numpy ndarray representing the time series.

copy()

Returns a copy of the time series instance.

Returns: The copied time series instance.
Return type: Timeseries

classmethod from_pd(df)

Creates a Timeseries instance from one or multiple pandas dataframes. df is either one pandas dataframe or a list of pandas dataframes. The index of each dataframe should contain the timestamps.

:return A Timeseries instance. :rtype: Timeseries

static get_timestamp_info(ts)

Returns a dict containing timestamp information, e.g., timestamp index name, timestamp values.

Parameters: ts – A Timeseries instance.
Returns: The timestamp information.

static reset_timestamp_index(ts, timestamp_info)

Moves the timestamp index to a column and converts timestamps into floats.

Parameters

ts – A Timeseries instance.
timestamp_info – The timestamp information.

Returns

A converted Timeseries instance.

static restore_timestamp_index(ts, timestamp_info)

Moves the timestamp column to the index and converts the floats back to timestamps.

Parameters

ts – A Timeseries instance with a @timestamp column.
timestamp_info – The timestamp information.

Returns

The original time-series dataframe.