omnixai.preprocessing package
The base class for all the transforms. |
|
The pre-processing functions for categorical and continuous-valued features. |
|
The pre-processing functions for filling NaNs and missing values. |
|
The pre-processing functions for continuous-valued features. |
|
The pipeline for multiple pre-processing transforms. |
|
The pre-processing function for tabular data. |
|
The transformations for image data. |
|
The pre-processing functions for text data. |
This package provides a number of useful data pre-processing transforms. Each transform
inherits from omnixai.preprocessing.base.TransformBase
with three main methods:
fit(self, x): Estimates the parameters of the transform with data
x
.transform(self, x): Applies the transform to the input data
x
.invert(self, x): Applies the inverse transform to the input data
x
.
For example, omnixai.preprocessing.tabular.TabularTransform
provides a convenient way for feature
pre-processing on tabular datasets:
from omnixai.data.tabular import Tabular
from omnixai.preprocessing.normalize import MinMax
from omnixai.preprocessing.encode import OneHot
from omnixai.preprocessing.tabular import TabularTransform
x = Tabular(
data=pd.DataFrame({
'A': [1, 2, 2, 6],
'B': [5, 4, 3, 2],
'C': ['a', 'b', 'c', 'd']
}),
categorical_columns=['C']
)
transform = TabularTransform(
cate_transform=OneHot(), # One-hot encoding for categorical features
cont_transform=MinMax() # Min-max normalization for continuous-valued features
).fit(x)
y = transform.transform(x) # Transforms tabular data into a numpy array
z = transform.invert(y) # Applies the inverse transform
Note that some transforms such as FillNaN, FillNaNTabular only have pseudo-inverse transforms that may not recover the original data.
For Image data, one can transform images in a similar way:
from PIL import Image as PilImage
from omnixai.data.image import Image
from omnixai.preprocessing.image import Resize
img = Image(PilImage.open('some_image.jpg'))
transform = Resize(size=(360, 240)) # A transform for resizing images
x = transformer.transform(img) # Applies the transform
y = transformer.invert(x) # Applies the inverse transform
For Text data, one can apply a TF-IDF transform as follows:
from omnixai.data.text import Text
from omnixai.preprocessing.text import Tfidf
text = Text(
data=["Hello I'm a single sentence",
"And another sentence",
"And the very very last one"]
)
transform = Tfidf().fit(text) # Fit a TF-IDF transform
vectors = transform.transform(text) # Applies the transform for feature vectors
omnixai.preprocessing.base module
The base class for all the transforms.
- class omnixai.preprocessing.base.TransformBase
Bases:
object
Abstract base class for a data pre-processing transform.
- abstract fit(x)
Estimates the parameters of the transform.
- Parameters
x – The data for estimating the parameters.
- Returns
The current instance.
- abstract transform(x)
Applies the transform to the input data.
- Parameters
x – The data on which to apply the transform.
- Returns
The transformed data.
- abstract invert(x)
Applies the inverse transform to the input data.
- Parameters
x – The data on which to apply the inverse transform.
- Returns
The inverse transformed data.
- class omnixai.preprocessing.base.Identity
Bases:
TransformBase
Identity transformation.
- fit(x)
Estimates the parameters of the transform.
- Parameters
x – The data for estimating the parameters.
- Returns
The current instance.
- transform(x)
Applies the transform to the input data.
- Parameters
x – The data on which to apply the transform.
- Returns
The transformed data.
- invert(x)
Applies the inverse transform to the input data.
- Parameters
x – The data on which to apply the inverse transform.
- Returns
The inverse transformed data.
omnixai.preprocessing.encode module
The pre-processing functions for categorical and continuous-valued features.
- class omnixai.preprocessing.encode.KBins(n_bins, **kwargs)
Bases:
TransformBase
Discretizes continuous values into bins.
- fit(x)
Estimates the parameters of the transform.
- Parameters
x – The data for estimating the parameters.
- Returns
The current instance.
- transform(x)
Applies the transform to the input data.
- Parameters
x – The data on which to apply the transform.
- Returns
The transformed data.
- invert(x)
Applies the inverse transform to the input data.
- Parameters
x – The data on which to apply the inverse transform.
- Returns
The inverse transformed data.
- class omnixai.preprocessing.encode.OneHot(drop=None, **kwargs)
Bases:
TransformBase
One-hot encoding for categorical values.
- fit(x)
Estimates the parameters of the transform.
- Parameters
x – The data for estimating the parameters.
- Returns
The current instance.
- transform(x)
Applies the transform to the input data.
- Parameters
x – The data on which to apply the transform.
- Returns
The transformed data.
- invert(x)
Applies the inverse transform to the input data.
- Parameters
x – The data on which to apply the inverse transform.
- Returns
The inverse transformed data.
- property categories
Returns the categories for each feature.
- get_feature_names(input_features=None)
Returns the feature names in the transformed data.
- class omnixai.preprocessing.encode.Ordinal
Bases:
TransformBase
Ordinal encoding for categorical values.
- fit(x)
Estimates the parameters of the transform.
- Parameters
x – The data for estimating the parameters.
- Returns
The current instance.
- transform(x)
Applies the transform to the input data.
- Parameters
x – The data on which to apply the transform.
- Returns
The transformed data.
- invert(x)
Applies the inverse transform to the input data.
- Parameters
x – The data on which to apply the inverse transform.
- Returns
The inverse transformed data.
- property categories
Returns the categories for each feature.
- class omnixai.preprocessing.encode.LabelEncoder
Bases:
TransformBase
Ordinal encoding for targets/labels.
- fit(x)
Estimates the parameters of the transform.
- Parameters
x – The data for estimating the parameters.
- Returns
The current instance.
- transform(x)
Applies the transform to the input data.
- Parameters
x – The data on which to apply the transform.
- Returns
The transformed data.
- invert(x)
Applies the inverse transform to the input data.
- Parameters
x – The data on which to apply the inverse transform.
- Returns
The inverse transformed data.
- property categories
Returns the class labels.
omnixai.preprocessing.normalize module
The pre-processing functions for continuous-valued features.
- class omnixai.preprocessing.normalize.Standard
Bases:
TransformBase
Standard normalization, i.e., zero mean and unit variance.
- fit(x)
Estimates the parameters of the transform.
- Parameters
x – The data for estimating the parameters.
- Returns
The current instance.
- transform(x)
Applies the transform to the input data.
- Parameters
x – The data on which to apply the transform.
- Returns
The transformed data.
- invert(x)
Applies the inverse transform to the input data.
- Parameters
x – The data on which to apply the inverse transform.
- Returns
The inverse transformed data.
- class omnixai.preprocessing.normalize.MinMax
Bases:
TransformBase
Rescales the values to the range [0, 1].
- fit(x)
Estimates the parameters of the transform.
- Parameters
x – The data for estimating the parameters.
- Returns
The current instance.
- transform(x)
Applies the transform to the input data.
- Parameters
x – The data on which to apply the transform.
- Returns
The transformed data.
- invert(x)
Applies the inverse transform to the input data.
- Parameters
x – The data on which to apply the inverse transform.
- Returns
The inverse transformed data.
- class omnixai.preprocessing.normalize.Scale(ratio=1.0)
Bases:
TransformBase
Rescales the values to values * ratio.
- fit(x)
Estimates the parameters of the transform.
- Parameters
x – The data for estimating the parameters.
- Returns
The current instance.
- transform(x)
Applies the transform to the input data.
- Parameters
x – The data on which to apply the transform.
- Returns
The transformed data.
- invert(x)
Applies the inverse transform to the input data.
- Parameters
x – The data on which to apply the inverse transform.
- Returns
The inverse transformed data.
omnixai.preprocessing.fill module
The pre-processing functions for filling NaNs and missing values.
- class omnixai.preprocessing.fill.FillNaN(value)
Bases:
TransformBase
Fill NaNs in a pandas dataframe or a numpy array.
- Parameters
value (
Union
[str
,int
,float
]) – The value to fill NaNs, chosen from [‘mean’, ‘median’] or float values
- fit(x)
Estimates the parameters of the transform.
- Parameters
x (
Union
[ndarray
,DataFrame
]) – The data for estimating the parameters.- Return type
- Returns
The current instance.
- transform(x)
Applies the transform to the input data.
- Parameters
x (
Union
[ndarray
,DataFrame
]) – The data on which to apply the transform.- Return type
Union
[ndarray
,DataFrame
]- Returns
The transformed data.
- invert(x)
This is a pseudo inverse transform because the positions of the NANs in the original data are not stored.
- Parameters
x (
Union
[ndarray
,DataFrame
]) – The data on which to apply the inverse transform.- Returns
The inverse transformed data.
- Return type
Union[np.ndarray, pd.DataFrame]
- class omnixai.preprocessing.fill.FillNaNTabular(value)
Bases:
TransformBase
Fill NaNs in a Tabular object.
- Parameters
value (
Union
[str
,int
,float
]) – The value to fill NaNs, chosen from [‘mean’, ‘median’] or float values
- fit(x)
Fits a FillNaN transformer.
- Parameters
x (
Tabular
) – A Tabular object.- Returns
Itself.
- Return type
- transform(x)
Fills NaNs in the continuous-valued features.
omnixai.preprocessing.pipeline module
The pipeline for multiple pre-processing transforms.
- class omnixai.preprocessing.pipeline.Pipeline
Bases:
object
The pipeline for multiple pre-processing transforms.
- name = 'pipeline'
- step(transformer)
Adds a new transform into the pipeline.
- Parameters
transformer (
TransformBase
) – A transformer derived from TransformBase- Returns
The current pipeline instance
- fit(x)
Estimates the parameters of the all transforms.
- Parameters
x – The data for estimating the parameters.
- Returns
The current instance.
- transform(x)
Applies all the transforms to the input data.
- Parameters
x – The data on which to apply the transform.
- Returns
The transformed data.
- invert(x)
Applies the inverse transforms to the input data.
- Parameters
x – The data on which to apply the inverse transform.
- Returns
The inverse transformed data.
- dump(directory)
Saves the pipeline to the specified file.
- Parameters
directory – The directory to save the pipeline
- load(directory)
Loads the pipeline from the specified file.
- Parameters
directory – The directory to load the pipeline from
omnixai.preprocessing.tabular module
The pre-processing function for tabular data.
- class omnixai.preprocessing.tabular.TabularTransform(cate_transform=None, cont_transform=None, target_transform=None)
Bases:
TransformBase
Transforms for a
data.tabular.Tabular
instance.- Parameters
cate_transform (
Optional
[TransformBase
]) – The transform for the categorical features, e.g., OneHot, Ordinal. Default is OneHot.cont_transform (
Optional
[TransformBase
]) – The transform for the continuous-valued features, e.g., Identity, Standard, MinMax, Scale. Default is Identity.target_transform (
Optional
[TransformBase
]) – The transform for the target column, e.g., Identity for regression, LabelEncoder for classification. Default is LabelEncoder.
- fit(x)
Fits a tabular transformer.
- Parameters
x (
Tabular
) – A Tabular object.- Returns
Itself.
- Return type
- transform(x)
Transforms the input tabular instance. The output data concatenates the transformed categorical features, continuous-valued features and targets/labels (if exist) together.
- Parameters
x (
Tabular
) – A Tabular object.- Returns
The transformed data.
- Return type
np.ndarray
- invert(x)
Converts a numpy array into a Tabular object.
- Parameters
x (
ndarray
) – An input numpy array.- Returns
The inverse Tabular object.
- Return type
- decompose(x)
Decomposes the transformed data into categorical, continuous and target.
- Parameters
x (
ndarray
) – An input numpy array.- Returns
A tuple of categorical, continuous and target data.
- Return type
tuple
- property categories
Gets the categories for all the features.
- Returns
A list of categories, i.e.,
categories[i]
holds the categories expected in the ith column, or None.
- property class_names
Returns the class names for a classification task.
- Returns
A list of class names or None.
- get_feature_names()
Returns the feature names in the transformed data.
omnixai.preprocessing.image module
The transformations for image data.
- class omnixai.preprocessing.image.Scale(ratio=0.00392156862745098)
Bases:
TransformBase
Rescales image pixel values to values * ratio.
- fit(x)
Estimates the parameters of the transform.
- Parameters
x (
Image
) – The data for estimating the parameters.- Return type
- Returns
The current instance.
- transform(x)
Applies the transform to the input data.
- class omnixai.preprocessing.image.Round2Int
Bases:
TransformBase
Rounds float values to integer values.
- fit(x)
Estimates the parameters of the transform.
- Parameters
x (
Image
) – The data for estimating the parameters.- Return type
- Returns
The current instance.
- transform(x)
Applies the transform to the input data.
- class omnixai.preprocessing.image.Normalize(mean, std)
Bases:
TransformBase
Normalizes an image with mean and standard deviation.
- Parameters
mean – A mean for all the channels or a sequence of means for each channel.
std – A std for all the channels or a sequence of stds for each channel.
- fit(x)
Estimates the parameters of the transform.
- Parameters
x (
Image
) – The data for estimating the parameters.- Return type
- Returns
The current instance.
- transform(x)
Applies the transform to the input data.
- class omnixai.preprocessing.image.Resize(size, resample=2)
Bases:
TransformBase
Resizes the input image to a given size.
- Parameters
size (
Union
[Sequence
,int
]) – The desired output size. If size is a sequence (h, w), the output size will be (h, w). If size is an int, the smaller edge will match this number.resample – The desired resampling strategy.
- fit(x)
Estimates the parameters of the transform.
- Parameters
x – The data for estimating the parameters.
- Return type
- Returns
The current instance.
- transform(x)
Applies the transform to the input data.
omnixai.preprocessing.text module
The pre-processing functions for text data.
- class omnixai.preprocessing.text.Tfidf(**kwargs)
Bases:
TransformBase
The TF-IDF transformation.
- fit(x, **kwargs)
Estimates the parameters of the transform.
- Parameters
x (
Text
) – The data for estimating the parameters.- Returns
The current instance.
- transform(x)
Applies the transform to the input data.
- Parameters
x (
Text
) – The data on which to apply the transform.- Returns
The transformed data.
- invert(x)
Applies the inverse transform to the input data.
- Parameters
x – The data on which to apply the inverse transform.
- Returns
The inverse transformed data.
- get_feature_names()
Returns the feature names in the transformed data.
- class omnixai.preprocessing.text.Word2Id(remove_punctuation=True, **kwargs)
Bases:
TransformBase
The class for converting words into IDs.
- PAD = 0
- START = 1
- UNK = 2
- fit(x, **kwargs)
Estimates the parameters of the transform.
- Parameters
x (
Text
) – The data for estimating the parameters.- Returns
The current instance.
- transform(x, **kwargs)
Applies the transform to the input data.
- Parameters
x (
Text
) – The data on which to apply the transform.- Returns
The transformed data.
- invert(x)
Applies the inverse transform to the input data.
- Parameters
x – The data on which to apply the inverse transform.
- Returns
The inverse transformed data.
- property vocab_size
omnixai.sampler.tabular module
The class for re-sampling training data.
- class omnixai.sampler.tabular.Sampler
Bases:
object
The class for re-sampling training data. It includes sub-sampling, under-sampling and over-sampling.
- static subsample(tabular_data, fraction, random_state=None)
Samples a subset of the input dataset. It guarantees that all the categorical values are included in the sampled dataframe, i.e., there will be no missing categorical values.
- static undersample(tabular_data, random_state=None)
Undersamples a class-imbalance dataset to make it more balance, i.e., keeping all of the data in the minority class and decreasing the size of the majority class. It guarantees that all the categorical values are included in the sampled dataframe, i.e., there will be no missing categorical values.
- static oversample(tabular_data, random_state=None)
Oversamples a class-imbalance dataset to make it more balance, i.e., keeping all of the data in the majority class and increasing the size of the minority class. It guarantees that all the categorical values are included in the sampled dataframe, i.e., there will be no missing categorical values.