omnixai.preprocessing package

`base`	The base class for all the transforms.
`encode`	The pre-processing functions for categorical and continuous-valued features.
`fill`	The pre-processing functions for filling NaNs and missing values.
`normalize`	The pre-processing functions for continuous-valued features.
`pipeline`	The pipeline for multiple pre-processing transforms.
`tabular`	The pre-processing function for tabular data.
`image`	The transformations for image data.
`text`	The pre-processing functions for text data.

This package provides a number of useful data pre-processing transforms. Each transform inherits from omnixai.preprocessing.base.TransformBase with three main methods:

fit(self, x): Estimates the parameters of the transform with data x.
transform(self, x): Applies the transform to the input data x.
invert(self, x): Applies the inverse transform to the input data x.

For example, omnixai.preprocessing.tabular.TabularTransform provides a convenient way for feature pre-processing on tabular datasets:

from omnixai.data.tabular import Tabular
from omnixai.preprocessing.normalize import MinMax
from omnixai.preprocessing.encode import OneHot
from omnixai.preprocessing.tabular import TabularTransform

x = Tabular(
    data=pd.DataFrame({
        'A': [1, 2, 2, 6],
        'B': [5, 4, 3, 2],
        'C': ['a', 'b', 'c', 'd']
    }),
    categorical_columns=['C']
)
transform = TabularTransform(
    cate_transform=OneHot(),        # One-hot encoding for categorical features
    cont_transform=MinMax()         # Min-max normalization for continuous-valued features
).fit(x)
y = transform.transform(x)          # Transforms tabular data into a numpy array
z = transform.invert(y)             # Applies the inverse transform

Note that some transforms such as FillNaN, FillNaNTabular only have pseudo-inverse transforms that may not recover the original data.

For Image data, one can transform images in a similar way:

from PIL import Image as PilImage
from omnixai.data.image import Image
from omnixai.preprocessing.image import Resize

img = Image(PilImage.open('some_image.jpg'))
transform = Resize(size=(360, 240))             # A transform for resizing images
x = transformer.transform(img)                  # Applies the transform
y = transformer.invert(x)                       # Applies the inverse transform

For Text data, one can apply a TF-IDF transform as follows:

from omnixai.data.text import Text
from omnixai.preprocessing.text import Tfidf

text = Text(
    data=["Hello I'm a single sentence",
          "And another sentence",
          "And the very very last one"]
)
transform = Tfidf().fit(text)                   # Fit a TF-IDF transform
vectors = transform.transform(text)             # Applies the transform for feature vectors

omnixai.preprocessing.base module

The base class for all the transforms.

class omnixai.preprocessing.base.TransformBase

Bases: object

Abstract base class for a data pre-processing transform.

abstract fit(x)

Estimates the parameters of the transform.

Parameters: x – The data for estimating the parameters.
Returns: The current instance.

abstract transform(x)

Applies the transform to the input data.

Parameters: x – The data on which to apply the transform.
Returns: The transformed data.

abstract invert(x)

Applies the inverse transform to the input data.

Parameters: x – The data on which to apply the inverse transform.
Returns: The inverse transformed data.

class omnixai.preprocessing.base.Identity

Bases: TransformBase

Identity transformation.

fit(x)

Estimates the parameters of the transform.

Parameters: x – The data for estimating the parameters.
Returns: The current instance.

transform(x)

Applies the transform to the input data.

Parameters: x – The data on which to apply the transform.
Returns: The transformed data.

invert(x)

Applies the inverse transform to the input data.

Parameters: x – The data on which to apply the inverse transform.
Returns: The inverse transformed data.

omnixai.preprocessing.encode module

The pre-processing functions for categorical and continuous-valued features.

class omnixai.preprocessing.encode.KBins(n_bins, **kwargs)

Bases: TransformBase

Discretizes continuous values into bins.

fit(x)

Estimates the parameters of the transform.

Parameters: x – The data for estimating the parameters.
Returns: The current instance.

transform(x)

Applies the transform to the input data.

Parameters: x – The data on which to apply the transform.
Returns: The transformed data.

invert(x)

Applies the inverse transform to the input data.

Parameters: x – The data on which to apply the inverse transform.
Returns: The inverse transformed data.

class omnixai.preprocessing.encode.OneHot(drop=None, **kwargs)

Bases: TransformBase

One-hot encoding for categorical values.

fit(x)

Estimates the parameters of the transform.

Parameters: x – The data for estimating the parameters.
Returns: The current instance.

transform(x)

Applies the transform to the input data.

Parameters: x – The data on which to apply the transform.
Returns: The transformed data.

invert(x)

Applies the inverse transform to the input data.

Parameters: x – The data on which to apply the inverse transform.
Returns: The inverse transformed data.

property categories: Returns the categories for each feature.

get_feature_names(input_features=None): Returns the feature names in the transformed data.

class omnixai.preprocessing.encode.Ordinal

Bases: TransformBase

Ordinal encoding for categorical values.

fit(x)

Estimates the parameters of the transform.

Parameters: x – The data for estimating the parameters.
Returns: The current instance.

transform(x)

Applies the transform to the input data.

Parameters: x – The data on which to apply the transform.
Returns: The transformed data.

invert(x)

Applies the inverse transform to the input data.

Parameters: x – The data on which to apply the inverse transform.
Returns: The inverse transformed data.

property categories: Returns the categories for each feature.

class omnixai.preprocessing.encode.LabelEncoder

Bases: TransformBase

Ordinal encoding for targets/labels.

fit(x)

Estimates the parameters of the transform.

Parameters: x – The data for estimating the parameters.
Returns: The current instance.

transform(x)

Applies the transform to the input data.

Parameters: x – The data on which to apply the transform.
Returns: The transformed data.

invert(x)

Applies the inverse transform to the input data.

Parameters: x – The data on which to apply the inverse transform.
Returns: The inverse transformed data.

property categories: Returns the class labels.

omnixai.preprocessing.normalize module

The pre-processing functions for continuous-valued features.

class omnixai.preprocessing.normalize.Standard

Bases: TransformBase

Standard normalization, i.e., zero mean and unit variance.

fit(x)

Estimates the parameters of the transform.

Parameters: x – The data for estimating the parameters.
Returns: The current instance.

transform(x)

Applies the transform to the input data.

Parameters: x – The data on which to apply the transform.
Returns: The transformed data.

invert(x)

Applies the inverse transform to the input data.

Parameters: x – The data on which to apply the inverse transform.
Returns: The inverse transformed data.

class omnixai.preprocessing.normalize.MinMax

Bases: TransformBase

Rescales the values to the range [0, 1].

fit(x)

Estimates the parameters of the transform.

Parameters: x – The data for estimating the parameters.
Returns: The current instance.

transform(x)

Applies the transform to the input data.

Parameters: x – The data on which to apply the transform.
Returns: The transformed data.

invert(x)

Applies the inverse transform to the input data.

Parameters: x – The data on which to apply the inverse transform.
Returns: The inverse transformed data.

class omnixai.preprocessing.normalize.Scale(ratio=1.0)

Bases: TransformBase

Rescales the values to values * ratio.

fit(x)

Estimates the parameters of the transform.

Parameters: x – The data for estimating the parameters.
Returns: The current instance.

transform(x)

Applies the transform to the input data.

Parameters: x – The data on which to apply the transform.
Returns: The transformed data.

invert(x)

Applies the inverse transform to the input data.

Parameters: x – The data on which to apply the inverse transform.
Returns: The inverse transformed data.

omnixai.preprocessing.fill module

The pre-processing functions for filling NaNs and missing values.

class omnixai.preprocessing.fill.FillNaN(value)

Bases: TransformBase

Fill NaNs in a pandas dataframe or a numpy array.

Parameters: value (Union[str, int, float]) – The value to fill NaNs, chosen from [‘mean’, ‘median’] or float values

fit(x)

Estimates the parameters of the transform.

Parameters: x (Union[ndarray, DataFrame]) – The data for estimating the parameters.
Return type: TransformBase
Returns: The current instance.

transform(x)

Applies the transform to the input data.

Parameters: x (Union[ndarray, DataFrame]) – The data on which to apply the transform.
Return type: Union[ndarray, DataFrame]
Returns: The transformed data.

invert(x)

This is a pseudo inverse transform because the positions of the NANs in the original data are not stored.

Parameters: x (Union[ndarray, DataFrame]) – The data on which to apply the inverse transform.
Returns: The inverse transformed data.
Return type: Union[np.ndarray, pd.DataFrame]

class omnixai.preprocessing.fill.FillNaNTabular(value)

Bases: TransformBase

Fill NaNs in a Tabular object.

Parameters: value (Union[str, int, float]) – The value to fill NaNs, chosen from [‘mean’, ‘median’] or float values

fit(x)

Fits a FillNaN transformer.

Parameters: x (Tabular) – A Tabular object.
Returns: Itself.
Return type: FillNaNTabular

transform(x)

Fills NaNs in the continuous-valued features.

Parameters: x (Tabular) – A Tabular object.
Returns: The transformed data.
Return type: Tabular

invert(x)

This is a pseudo inverse transform because the positions of the NANs in the original data are not stored.

Parameters: x (Tabular) – The data on which to apply the inverse transform.
Returns: The inverse transformed data.
Return type: Tabular

omnixai.preprocessing.pipeline module

The pipeline for multiple pre-processing transforms.

class omnixai.preprocessing.pipeline.Pipeline

Bases: object

The pipeline for multiple pre-processing transforms.

name = 'pipeline'

step(transformer)

Adds a new transform into the pipeline.

Parameters: transformer (TransformBase) – A transformer derived from TransformBase
Returns: The current pipeline instance

fit(x)

Estimates the parameters of the all transforms.

Parameters: x – The data for estimating the parameters.
Returns: The current instance.

transform(x)

Applies all the transforms to the input data.

Parameters: x – The data on which to apply the transform.
Returns: The transformed data.

invert(x)

Applies the inverse transforms to the input data.

Parameters: x – The data on which to apply the inverse transform.
Returns: The inverse transformed data.

dump(directory)

Saves the pipeline to the specified file.

Parameters: directory – The directory to save the pipeline

load(directory)

Loads the pipeline from the specified file.

Parameters: directory – The directory to load the pipeline from

omnixai.preprocessing.tabular module

The pre-processing function for tabular data.

class omnixai.preprocessing.tabular.TabularTransform(cate_transform=None, cont_transform=None, target_transform=None)

Bases: TransformBase

Transforms for a data.tabular.Tabular instance.

Parameters

cate_transform (Optional[TransformBase]) – The transform for the categorical features, e.g., OneHot, Ordinal. Default is OneHot.
cont_transform (Optional[TransformBase]) – The transform for the continuous-valued features, e.g., Identity, Standard, MinMax, Scale. Default is Identity.
target_transform (Optional[TransformBase]) – The transform for the target column, e.g., Identity for regression, LabelEncoder for classification. Default is LabelEncoder.

fit(x)

Fits a tabular transformer.

Parameters: x (Tabular) – A Tabular object.
Returns: Itself.
Return type: TabularTransform

transform(x)

Transforms the input tabular instance. The output data concatenates the transformed categorical features, continuous-valued features and targets/labels (if exist) together.

Parameters: x (Tabular) – A Tabular object.
Returns: The transformed data.
Return type: np.ndarray

invert(x)

Converts a numpy array into a Tabular object.

Parameters: x (ndarray) – An input numpy array.
Returns: The inverse Tabular object.
Return type: Tabular

decompose(x)

Decomposes the transformed data into categorical, continuous and target.

Parameters: x (ndarray) – An input numpy array.
Returns: A tuple of categorical, continuous and target data.
Return type: tuple

property categories

Gets the categories for all the features.

Returns: A list of categories, i.e., categories[i] holds the categories expected in the ith column, or None.

property class_names

Returns the class names for a classification task.

Returns: A list of class names or None.

get_feature_names(): Returns the feature names in the transformed data.

omnixai.preprocessing.image module

The transformations for image data.

class omnixai.preprocessing.image.Scale(ratio=0.00392156862745098)

Bases: TransformBase

Rescales image pixel values to values * ratio.

fit(x)

Estimates the parameters of the transform.

Parameters: x (Image) – The data for estimating the parameters.
Return type: TransformBase
Returns: The current instance.

transform(x)

Applies the transform to the input data.

Parameters: x (Image) – The data on which to apply the transform.
Return type: Image
Returns: The transformed data.

invert(x)

Applies the inverse transform to the input data.

Parameters: x (Image) – The data on which to apply the inverse transform.
Return type: Image
Returns: The inverse transformed data.

class omnixai.preprocessing.image.Round2Int

Bases: TransformBase

Rounds float values to integer values.

fit(x)

Estimates the parameters of the transform.

Parameters: x (Image) – The data for estimating the parameters.
Return type: TransformBase
Returns: The current instance.

transform(x)

Applies the transform to the input data.

Parameters: x (Image) – The data on which to apply the transform.
Return type: Image
Returns: The transformed data.

invert(x)

Applies the inverse transform to the input data.

Parameters: x (Image) – The data on which to apply the inverse transform.
Return type: Image
Returns: The inverse transformed data.

class omnixai.preprocessing.image.Normalize(mean, std)

Bases: TransformBase

Normalizes an image with mean and standard deviation.

Parameters

mean – A mean for all the channels or a sequence of means for each channel.
std – A std for all the channels or a sequence of stds for each channel.

fit(x)

Estimates the parameters of the transform.

Parameters: x (Image) – The data for estimating the parameters.
Return type: TransformBase
Returns: The current instance.

transform(x)

Applies the transform to the input data.

Parameters: x (Image) – The data on which to apply the transform.
Return type: Image
Returns: The transformed data.

invert(x)

Applies the inverse transform to the input data.

Parameters: x (Image) – The data on which to apply the inverse transform.
Return type: Image
Returns: The inverse transformed data.

class omnixai.preprocessing.image.Resize(size, resample=2)

Bases: TransformBase

Resizes the input image to a given size.

Parameters

size (Union[Sequence, int]) – The desired output size. If size is a sequence (h, w), the output size will be (h, w). If size is an int, the smaller edge will match this number.
resample – The desired resampling strategy.

fit(x)

Estimates the parameters of the transform.

Parameters: x – The data for estimating the parameters.
Return type: TransformBase
Returns: The current instance.

transform(x)

Applies the transform to the input data.

Parameters: x (Image) – The data on which to apply the transform.
Return type: Image
Returns: The transformed data.

invert(x)

Applies the inverse transform to the input data.

Parameters: x (Image) – The data on which to apply the inverse transform.
Return type: Image
Returns: The inverse transformed data.

omnixai.preprocessing.text module

The pre-processing functions for text data.

class omnixai.preprocessing.text.Tfidf(**kwargs)

Bases: TransformBase

The TF-IDF transformation.

fit(x, **kwargs)

Estimates the parameters of the transform.

Parameters: x (Text) – The data for estimating the parameters.
Returns: The current instance.

transform(x)

Applies the transform to the input data.

Parameters: x (Text) – The data on which to apply the transform.
Returns: The transformed data.

invert(x)

Applies the inverse transform to the input data.

Parameters: x – The data on which to apply the inverse transform.
Returns: The inverse transformed data.

get_feature_names(): Returns the feature names in the transformed data.

class omnixai.preprocessing.text.Word2Id(remove_punctuation=True, **kwargs)

Bases: TransformBase

The class for converting words into IDs.

PAD = 0

START = 1

UNK = 2

fit(x, **kwargs)

Estimates the parameters of the transform.

Parameters: x (Text) – The data for estimating the parameters.
Returns: The current instance.

transform(x, **kwargs)

Applies the transform to the input data.

Parameters: x (Text) – The data on which to apply the transform.
Returns: The transformed data.

invert(x)

Applies the inverse transform to the input data.

Parameters: x – The data on which to apply the inverse transform.
Returns: The inverse transformed data.

property vocab_size

omnixai.sampler.tabular module

The class for re-sampling training data.

class omnixai.sampler.tabular.Sampler

Bases: object

The class for re-sampling training data. It includes sub-sampling, under-sampling and over-sampling.

static subsample(tabular_data, fraction, random_state=None)

Samples a subset of the input dataset. It guarantees that all the categorical values are included in the sampled dataframe, i.e., there will be no missing categorical values.

Parameters

tabular_data (Tabular) – The input tabular data.
fraction (float) – The fraction of the sampled instance.
random_state – The random seed.

Returns

A subset extracted from tabular_data.

Return type

Tabular

static undersample(tabular_data, random_state=None)

Undersamples a class-imbalance dataset to make it more balance, i.e., keeping all of the data in the minority class and decreasing the size of the majority class. It guarantees that all the categorical values are included in the sampled dataframe, i.e., there will be no missing categorical values.

Parameters

tabular_data (Tabular) – The input tabular data.
random_state – The random seed.

Returns

A subset extracted from tabular_data.

Return type

Tabular

static oversample(tabular_data, random_state=None)

Oversamples a class-imbalance dataset to make it more balance, i.e., keeping all of the data in the majority class and increasing the size of the minority class. It guarantees that all the categorical values are included in the sampled dataframe, i.e., there will be no missing categorical values.

Parameters

tabular_data (Tabular) – The input tabular data.
random_state – The random seed.

Returns

An oversampled dataset.

Return type

Tabular