omnixai.preprocessing package

base

The base class for all the transforms.

encode

The pre-processing functions for categorical and continuous-valued features.

fill

The pre-processing functions for filling NaNs and missing values.

normalize

The pre-processing functions for continuous-valued features.

pipeline

The pipeline for multiple pre-processing transforms.

tabular

The pre-processing function for tabular data.

image

The transformations for image data.

text

The pre-processing functions for text data.

This package provides a number of useful data pre-processing transforms. Each transform inherits from omnixai.preprocessing.base.TransformBase with three main methods:

  • fit(self, x): Estimates the parameters of the transform with data x.

  • transform(self, x): Applies the transform to the input data x.

  • invert(self, x): Applies the inverse transform to the input data x.

For example, omnixai.preprocessing.tabular.TabularTransform provides a convenient way for feature pre-processing on tabular datasets:

from omnixai.data.tabular import Tabular
from omnixai.preprocessing.normalize import MinMax
from omnixai.preprocessing.encode import OneHot
from omnixai.preprocessing.tabular import TabularTransform

x = Tabular(
    data=pd.DataFrame({
        'A': [1, 2, 2, 6],
        'B': [5, 4, 3, 2],
        'C': ['a', 'b', 'c', 'd']
    }),
    categorical_columns=['C']
)
transform = TabularTransform(
    cate_transform=OneHot(),        # One-hot encoding for categorical features
    cont_transform=MinMax()         # Min-max normalization for continuous-valued features
).fit(x)
y = transform.transform(x)          # Transforms tabular data into a numpy array
z = transform.invert(y)             # Applies the inverse transform

Note that some transforms such as FillNaN, FillNaNTabular only have pseudo-inverse transforms that may not recover the original data.

For Image data, one can transform images in a similar way:

from PIL import Image as PilImage
from omnixai.data.image import Image
from omnixai.preprocessing.image import Resize

img = Image(PilImage.open('some_image.jpg'))
transform = Resize(size=(360, 240))             # A transform for resizing images
x = transformer.transform(img)                  # Applies the transform
y = transformer.invert(x)                       # Applies the inverse transform

For Text data, one can apply a TF-IDF transform as follows:

from omnixai.data.text import Text
from omnixai.preprocessing.text import Tfidf

text = Text(
    data=["Hello I'm a single sentence",
          "And another sentence",
          "And the very very last one"]
)
transform = Tfidf().fit(text)                   # Fit a TF-IDF transform
vectors = transform.transform(text)             # Applies the transform for feature vectors

omnixai.preprocessing.base module

The base class for all the transforms.

class omnixai.preprocessing.base.TransformBase

Bases: object

Abstract base class for a data pre-processing transform.

abstract fit(x)

Estimates the parameters of the transform.

Parameters

x – The data for estimating the parameters.

Returns

The current instance.

abstract transform(x)

Applies the transform to the input data.

Parameters

x – The data on which to apply the transform.

Returns

The transformed data.

abstract invert(x)

Applies the inverse transform to the input data.

Parameters

x – The data on which to apply the inverse transform.

Returns

The inverse transformed data.

class omnixai.preprocessing.base.Identity

Bases: TransformBase

Identity transformation.

fit(x)

Estimates the parameters of the transform.

Parameters

x – The data for estimating the parameters.

Returns

The current instance.

transform(x)

Applies the transform to the input data.

Parameters

x – The data on which to apply the transform.

Returns

The transformed data.

invert(x)

Applies the inverse transform to the input data.

Parameters

x – The data on which to apply the inverse transform.

Returns

The inverse transformed data.

omnixai.preprocessing.encode module

The pre-processing functions for categorical and continuous-valued features.

class omnixai.preprocessing.encode.KBins(n_bins, **kwargs)

Bases: TransformBase

Discretizes continuous values into bins.

fit(x)

Estimates the parameters of the transform.

Parameters

x – The data for estimating the parameters.

Returns

The current instance.

transform(x)

Applies the transform to the input data.

Parameters

x – The data on which to apply the transform.

Returns

The transformed data.

invert(x)

Applies the inverse transform to the input data.

Parameters

x – The data on which to apply the inverse transform.

Returns

The inverse transformed data.

class omnixai.preprocessing.encode.OneHot(drop=None, **kwargs)

Bases: TransformBase

One-hot encoding for categorical values.

fit(x)

Estimates the parameters of the transform.

Parameters

x – The data for estimating the parameters.

Returns

The current instance.

transform(x)

Applies the transform to the input data.

Parameters

x – The data on which to apply the transform.

Returns

The transformed data.

invert(x)

Applies the inverse transform to the input data.

Parameters

x – The data on which to apply the inverse transform.

Returns

The inverse transformed data.

property categories

Returns the categories for each feature.

get_feature_names(input_features=None)

Returns the feature names in the transformed data.

class omnixai.preprocessing.encode.Ordinal

Bases: TransformBase

Ordinal encoding for categorical values.

fit(x)

Estimates the parameters of the transform.

Parameters

x – The data for estimating the parameters.

Returns

The current instance.

transform(x)

Applies the transform to the input data.

Parameters

x – The data on which to apply the transform.

Returns

The transformed data.

invert(x)

Applies the inverse transform to the input data.

Parameters

x – The data on which to apply the inverse transform.

Returns

The inverse transformed data.

property categories

Returns the categories for each feature.

class omnixai.preprocessing.encode.LabelEncoder

Bases: TransformBase

Ordinal encoding for targets/labels.

fit(x)

Estimates the parameters of the transform.

Parameters

x – The data for estimating the parameters.

Returns

The current instance.

transform(x)

Applies the transform to the input data.

Parameters

x – The data on which to apply the transform.

Returns

The transformed data.

invert(x)

Applies the inverse transform to the input data.

Parameters

x – The data on which to apply the inverse transform.

Returns

The inverse transformed data.

property categories

Returns the class labels.

omnixai.preprocessing.normalize module

The pre-processing functions for continuous-valued features.

class omnixai.preprocessing.normalize.Standard

Bases: TransformBase

Standard normalization, i.e., zero mean and unit variance.

fit(x)

Estimates the parameters of the transform.

Parameters

x – The data for estimating the parameters.

Returns

The current instance.

transform(x)

Applies the transform to the input data.

Parameters

x – The data on which to apply the transform.

Returns

The transformed data.

invert(x)

Applies the inverse transform to the input data.

Parameters

x – The data on which to apply the inverse transform.

Returns

The inverse transformed data.

class omnixai.preprocessing.normalize.MinMax

Bases: TransformBase

Rescales the values to the range [0, 1].

fit(x)

Estimates the parameters of the transform.

Parameters

x – The data for estimating the parameters.

Returns

The current instance.

transform(x)

Applies the transform to the input data.

Parameters

x – The data on which to apply the transform.

Returns

The transformed data.

invert(x)

Applies the inverse transform to the input data.

Parameters

x – The data on which to apply the inverse transform.

Returns

The inverse transformed data.

class omnixai.preprocessing.normalize.Scale(ratio=1.0)

Bases: TransformBase

Rescales the values to values * ratio.

fit(x)

Estimates the parameters of the transform.

Parameters

x – The data for estimating the parameters.

Returns

The current instance.

transform(x)

Applies the transform to the input data.

Parameters

x – The data on which to apply the transform.

Returns

The transformed data.

invert(x)

Applies the inverse transform to the input data.

Parameters

x – The data on which to apply the inverse transform.

Returns

The inverse transformed data.

omnixai.preprocessing.fill module

The pre-processing functions for filling NaNs and missing values.

class omnixai.preprocessing.fill.FillNaN(value)

Bases: TransformBase

Fill NaNs in a pandas dataframe or a numpy array.

Parameters

value (Union[str, int, float]) – The value to fill NaNs, chosen from [‘mean’, ‘median’] or float values

fit(x)

Estimates the parameters of the transform.

Parameters

x (Union[ndarray, DataFrame]) – The data for estimating the parameters.

Return type

TransformBase

Returns

The current instance.

transform(x)

Applies the transform to the input data.

Parameters

x (Union[ndarray, DataFrame]) – The data on which to apply the transform.

Return type

Union[ndarray, DataFrame]

Returns

The transformed data.

invert(x)

This is a pseudo inverse transform because the positions of the NANs in the original data are not stored.

Parameters

x (Union[ndarray, DataFrame]) – The data on which to apply the inverse transform.

Returns

The inverse transformed data.

Return type

Union[np.ndarray, pd.DataFrame]

class omnixai.preprocessing.fill.FillNaNTabular(value)

Bases: TransformBase

Fill NaNs in a Tabular object.

Parameters

value (Union[str, int, float]) – The value to fill NaNs, chosen from [‘mean’, ‘median’] or float values

fit(x)

Fits a FillNaN transformer.

Parameters

x (Tabular) – A Tabular object.

Returns

Itself.

Return type

FillNaNTabular

transform(x)

Fills NaNs in the continuous-valued features.

Parameters

x (Tabular) – A Tabular object.

Returns

The transformed data.

Return type

Tabular

invert(x)

This is a pseudo inverse transform because the positions of the NANs in the original data are not stored.

Parameters

x (Tabular) – The data on which to apply the inverse transform.

Returns

The inverse transformed data.

Return type

Tabular

omnixai.preprocessing.pipeline module

The pipeline for multiple pre-processing transforms.

class omnixai.preprocessing.pipeline.Pipeline

Bases: object

The pipeline for multiple pre-processing transforms.

name = 'pipeline'
step(transformer)

Adds a new transform into the pipeline.

Parameters

transformer (TransformBase) – A transformer derived from TransformBase

Returns

The current pipeline instance

fit(x)

Estimates the parameters of the all transforms.

Parameters

x – The data for estimating the parameters.

Returns

The current instance.

transform(x)

Applies all the transforms to the input data.

Parameters

x – The data on which to apply the transform.

Returns

The transformed data.

invert(x)

Applies the inverse transforms to the input data.

Parameters

x – The data on which to apply the inverse transform.

Returns

The inverse transformed data.

dump(directory)

Saves the pipeline to the specified file.

Parameters

directory – The directory to save the pipeline

load(directory)

Loads the pipeline from the specified file.

Parameters

directory – The directory to load the pipeline from

omnixai.preprocessing.tabular module

The pre-processing function for tabular data.

class omnixai.preprocessing.tabular.TabularTransform(cate_transform=None, cont_transform=None, target_transform=None)

Bases: TransformBase

Transforms for a data.tabular.Tabular instance.

Parameters
  • cate_transform (Optional[TransformBase]) – The transform for the categorical features, e.g., OneHot, Ordinal. Default is OneHot.

  • cont_transform (Optional[TransformBase]) – The transform for the continuous-valued features, e.g., Identity, Standard, MinMax, Scale. Default is Identity.

  • target_transform (Optional[TransformBase]) – The transform for the target column, e.g., Identity for regression, LabelEncoder for classification. Default is LabelEncoder.

fit(x)

Fits a tabular transformer.

Parameters

x (Tabular) – A Tabular object.

Returns

Itself.

Return type

TabularTransform

transform(x)

Transforms the input tabular instance. The output data concatenates the transformed categorical features, continuous-valued features and targets/labels (if exist) together.

Parameters

x (Tabular) – A Tabular object.

Returns

The transformed data.

Return type

np.ndarray

invert(x)

Converts a numpy array into a Tabular object.

Parameters

x (ndarray) – An input numpy array.

Returns

The inverse Tabular object.

Return type

Tabular

decompose(x)

Decomposes the transformed data into categorical, continuous and target.

Parameters

x (ndarray) – An input numpy array.

Returns

A tuple of categorical, continuous and target data.

Return type

tuple

property categories

Gets the categories for all the features.

Returns

A list of categories, i.e., categories[i] holds the categories expected in the ith column, or None.

property class_names

Returns the class names for a classification task.

Returns

A list of class names or None.

get_feature_names()

Returns the feature names in the transformed data.

omnixai.preprocessing.image module

The transformations for image data.

class omnixai.preprocessing.image.Scale(ratio=0.00392156862745098)

Bases: TransformBase

Rescales image pixel values to values * ratio.

fit(x)

Estimates the parameters of the transform.

Parameters

x (Image) – The data for estimating the parameters.

Return type

TransformBase

Returns

The current instance.

transform(x)

Applies the transform to the input data.

Parameters

x (Image) – The data on which to apply the transform.

Return type

Image

Returns

The transformed data.

invert(x)

Applies the inverse transform to the input data.

Parameters

x (Image) – The data on which to apply the inverse transform.

Return type

Image

Returns

The inverse transformed data.

class omnixai.preprocessing.image.Round2Int

Bases: TransformBase

Rounds float values to integer values.

fit(x)

Estimates the parameters of the transform.

Parameters

x (Image) – The data for estimating the parameters.

Return type

TransformBase

Returns

The current instance.

transform(x)

Applies the transform to the input data.

Parameters

x (Image) – The data on which to apply the transform.

Return type

Image

Returns

The transformed data.

invert(x)

Applies the inverse transform to the input data.

Parameters

x (Image) – The data on which to apply the inverse transform.

Return type

Image

Returns

The inverse transformed data.

class omnixai.preprocessing.image.Normalize(mean, std)

Bases: TransformBase

Normalizes an image with mean and standard deviation.

Parameters
  • mean – A mean for all the channels or a sequence of means for each channel.

  • std – A std for all the channels or a sequence of stds for each channel.

fit(x)

Estimates the parameters of the transform.

Parameters

x (Image) – The data for estimating the parameters.

Return type

TransformBase

Returns

The current instance.

transform(x)

Applies the transform to the input data.

Parameters

x (Image) – The data on which to apply the transform.

Return type

Image

Returns

The transformed data.

invert(x)

Applies the inverse transform to the input data.

Parameters

x (Image) – The data on which to apply the inverse transform.

Return type

Image

Returns

The inverse transformed data.

class omnixai.preprocessing.image.Resize(size, resample=2)

Bases: TransformBase

Resizes the input image to a given size.

Parameters
  • size (Union[Sequence, int]) – The desired output size. If size is a sequence (h, w), the output size will be (h, w). If size is an int, the smaller edge will match this number.

  • resample – The desired resampling strategy.

fit(x)

Estimates the parameters of the transform.

Parameters

x – The data for estimating the parameters.

Return type

TransformBase

Returns

The current instance.

transform(x)

Applies the transform to the input data.

Parameters

x (Image) – The data on which to apply the transform.

Return type

Image

Returns

The transformed data.

invert(x)

Applies the inverse transform to the input data.

Parameters

x (Image) – The data on which to apply the inverse transform.

Return type

Image

Returns

The inverse transformed data.

omnixai.preprocessing.text module

The pre-processing functions for text data.

class omnixai.preprocessing.text.Tfidf(**kwargs)

Bases: TransformBase

The TF-IDF transformation.

fit(x, **kwargs)

Estimates the parameters of the transform.

Parameters

x (Text) – The data for estimating the parameters.

Returns

The current instance.

transform(x)

Applies the transform to the input data.

Parameters

x (Text) – The data on which to apply the transform.

Returns

The transformed data.

invert(x)

Applies the inverse transform to the input data.

Parameters

x – The data on which to apply the inverse transform.

Returns

The inverse transformed data.

get_feature_names()

Returns the feature names in the transformed data.

class omnixai.preprocessing.text.Word2Id(remove_punctuation=True, **kwargs)

Bases: TransformBase

The class for converting words into IDs.

PAD = 0
START = 1
UNK = 2
fit(x, **kwargs)

Estimates the parameters of the transform.

Parameters

x (Text) – The data for estimating the parameters.

Returns

The current instance.

transform(x, **kwargs)

Applies the transform to the input data.

Parameters

x (Text) – The data on which to apply the transform.

Returns

The transformed data.

invert(x)

Applies the inverse transform to the input data.

Parameters

x – The data on which to apply the inverse transform.

Returns

The inverse transformed data.

property vocab_size

omnixai.sampler.tabular module

The class for re-sampling training data.

class omnixai.sampler.tabular.Sampler

Bases: object

The class for re-sampling training data. It includes sub-sampling, under-sampling and over-sampling.

static subsample(tabular_data, fraction, random_state=None)

Samples a subset of the input dataset. It guarantees that all the categorical values are included in the sampled dataframe, i.e., there will be no missing categorical values.

Parameters
  • tabular_data (Tabular) – The input tabular data.

  • fraction (float) – The fraction of the sampled instance.

  • random_state – The random seed.

Returns

A subset extracted from tabular_data.

Return type

Tabular

static undersample(tabular_data, random_state=None)

Undersamples a class-imbalance dataset to make it more balance, i.e., keeping all of the data in the minority class and decreasing the size of the majority class. It guarantees that all the categorical values are included in the sampled dataframe, i.e., there will be no missing categorical values.

Parameters
  • tabular_data (Tabular) – The input tabular data.

  • random_state – The random seed.

Returns

A subset extracted from tabular_data.

Return type

Tabular

static oversample(tabular_data, random_state=None)

Oversamples a class-imbalance dataset to make it more balance, i.e., keeping all of the data in the majority class and increasing the size of the minority class. It guarantees that all the categorical values are included in the sampled dataframe, i.e., there will be no missing categorical values.

Parameters
  • tabular_data (Tabular) – The input tabular data.

  • random_state – The random seed.

Returns

An oversampled dataset.

Return type

Tabular