omnixai.data package
The base class for all data types. |
|
The class for tabular data. |
|
The class for image data. |
|
The class for text data. |
|
The class for time series data. |
This package provides classes for representing tabular data, image data, text and time series data,
i.e., omnixai.data.tabular
, omnixai.data.image
, omnixai.data.text
and
omnixai.data.timeseries
, respectively.
Given a pandas dataframe df
with a set of categorical column names categorical_columns
and
a target column target_column
(e.g., class labels), we can create a Tabular
object as follows:
from omnixai.data.tabular import Tabular
tabular = Tabular(
data=df, # a pandas dataframe
categorical_columns=categorical_columns, # a list of categorical feature names
target_column=target_column) # a target column name
If df
has no categorical columns or no target column, we can set categorical_columns=None
or
target_column=None
, respectively. We can also create a Tabular
object with a numpy array x
with a list of feature names feature_columns
and categorical feature names categorical_columns
:
from omnixai.data.tabular import Tabular
tabular = Tabular(
data=x, # a numpy array
feature_columns=feature_columns, # a list of feature names
categorical_columns=categorical_columns) # a list of categorical feature names
If there are no feature names, the default feature names will be the indices in the numpy array, e.g., 0, 1, …
The Image
class represents a batch of images. Given a batch of images stored in a numpy array x
with shape
(batch_size, height, width, channel), we can create an Image
instance:
from omnixai.data.image import Image
images = Image(
data=x, # a numpy array with shape (batch_size, height, width, channel)
batched=True, # if x represents a batch of images
channel_last=True) # if the last dimension of x is `channel`
If the last dimension is not channel, namely, x
has shape (batch_size, channel, height, width), we need
to set channel_last=False
instead. If the numpy array x
has only one image with shape (height, width, channel),
we need to set batched=False
because the number of dimensions in x
is 3 instead of 4.
We can also convert a Pillow
image into an Image
instance:
from PIL import Image as PilImage
from omnixai.data.image import Image
im = PilImage.open("an_image.jpg")
image = Image(data=im)
The Text
class represents a batch of texts or sentences. Given a list of strings texts
, we can create
an Text
instance:
from omnixai.data.text import Text
text = Text(data=["Hello I'm a single sentence",
"And another sentence",
"And the very very last one"])
The Text
class also allows to specify the tokenizer to split each text/sentence into tokens via the tokenizer
parameter. If tokenizer
is set to None, a default tokenizer nltk.word_tokenize
is applied.
The Timeseries
class represents a batch of time series. The values of metrics/variables are stored in a numpy array
with shape (batch_size, timestamps, num_variables). If there is only one time series, batch_size is 1.
We can construct a Timeseries
instance from one or a list of pandas dataframes. The index of the dataframe
indicates the timestamps and the columns are the variables.
from omnixai.data.timeseries import Timeseries
df = pd.DataFrame(
[['2017-12-27', 1263.94091, 394.507, 16.530],
['2017-12-28', 1299.86398, 506.424, 14.162]],
columns=['Date', 'Consumption', 'Wind', 'Solar']
)
df = df.set_index('Date')
df.index = pd.to_datetime(df.index)
ts = Timeseries.from_pd(self.df)
omnixai.data.base module
The base class for all data types.
- class omnixai.data.base.Data
Bases:
object
Abstract base class for differet data types.
- abstract property data_type
- Returns
A string indicates the data type, e.g., tabular, image, text or time series
- abstract values()
- Returns
The raw values of the data object.
- abstract num_samples()
- Returns
The number samples in the dataset.
omnixai.data.tabular module
The class for tabular data.
- class omnixai.data.tabular.Tabular(data, feature_columns=None, categorical_columns=None, target_column=None)
Bases:
Data
The class represents a tabular dataset that may contain categorical features, continuous-valued features and targets/labels (optional).
- Parameters
data (
Union
[DataFrame
,ndarray
]) – A pandas dataframe or a numpy array containing the raw data. data should have the shape (num_samples, num_features).feature_columns (
Optional
[List
]) – The feature column names. Whenfeature_columns
is None,feature_columns
will be the column names in the pandas dataframe or the indices in the numpy array.categorical_columns (
Optional
[List
]) – A list of categorical feature names, e.g., a subset of feature column names in a pandas dataframe. Ifdata
is a numpy array andfeature_columns = None
,categorical_columns
should be the indices of categorical features.target_column (
Union
[str
,int
,None
]) – The target/label column name. Settarget_column
to None if there is no target column.
- data_type = 'tabular'
- iloc(i)
Returns the row(s) given an index or a set of indices.
- Parameters
i (
Union
[int
,slice
,list
]) – An integer index, slice or list.- Returns
A tabular object with the selected rows.
- Return type
- property shape: tuple
Returns the data shape, e.g., (num_samples, num_features).
- Returns
A tuple for the data shape.
- Return type
tuple
- num_samples()
Returns the number of the examples.
- Returns
The number of the examples.
- Return type
int
- property values: ndarray
Returns the raw values of the data object (without feature column names).
- Returns
A numpy array of the data object.
- Return type
np.ndarray
- property categorical_columns: List
Gets the categorical feature names.
- Returns
The list of the categorical feature names.
- Return type
Union[List[str], List[int]]
- property continuous_columns: List
Gets the continuous-valued feature names.
- Returns
The list of the continuous-valued feature names.
- Return type
Union[List[str], List[int]]
- property feature_columns: List
Gets all feature names.
- Returns
The list of all the feature column names except the target column.
- Return type
Union[List[str], List[int]]
- property target_column: Union[str, int]
Gets the target/label column name.
- Returns
The target column name, or None if there is no target column.
- Return type
Union[str, int]
- property columns: Sequence
Gets all the data columns including both the feature columns and target/label column.
- Returns
The list of the column names.
- Return type
Sequence
- to_pd(copy=True)
Converts Tabular to pd.DataFrame.
- Parameters
copy – True if it returns a data copy, or False otherwise.
- Returns
A pandas DataFrame representing the tabular data.
- Return type
pd.DataFrame
- to_numpy(copy=True)
Converts Tabular to np.ndarray.
- Parameters
copy – True if it returns a data copy, or False otherwise.
- Returns
A numpy ndarray representing the tabular data.
- Return type
np.ndarray
- remove_target_column()
Removes the target/label column and returns a new Tabular instance.
- Returns
The new tabular data without target/label column.
- Return type
- get_target_column()
Returns the target/label column.
- Returns
A list of targets or labels.
- Return type
List
- get_continuous_medians()
Gets the absolute median values of the continuous-valued features.
- Returns
A dict storing the absolute median value for each continuous-valued feature.
- Return type
Dict
- get_continuous_bounds()
Gets the upper and lower bounds of the continuous-valued features.
- Returns
The upper and lower bounds, i.e., a tuple of two numpy arrays.
- Return type
tuple
omnixai.data.image module
The class for image data.
- class omnixai.data.image.Image(data=None, batched=False, channel_last=True)
Bases:
Data
The class represents a batch of images. It supports both grayscale and RGB images. It will convert the input images into the (batch_size, h, w, channel) format. If there is only one input image, batch_size will be 1.
- Parameters
data (
Union
[ndarray
,Image
,None
]) – The image data, which is either np.ndarray or PIL.Image. Ifdata
is a numpy array, it should have the following format: (h, w, channel), (channel, h, w), (batch_size, h, w, channel) or (batch_size, channel, h, w). Ifdata
is a PIL.Image,batched
andchannel_last
are ignored. The images contained indata
will be automatically converted into a numpy array with shape (batch_size, h, w, channel). If there is only one image, batch_size will be 1.batched (
bool
) – True if the first dimension ofdata
is the batch size. False ifdata
has one image only.channel_last (
bool
) – True if the last dimension ofdata
is the color channel or False if the first or second dimension ofdata
is the color channel. Ifdata
has no color channel, e.g., grayscale images, this argument is ignored.
- data_type = 'image'
- property shape: tuple
Returns the raw data shape.
- Returns
A tuple for the raw data shape, e.g., (batch_size, h, w, channel).
- Return type
tuple
- num_samples()
Returns the number of the images.
- Returns
The number of the images.
- Return type
int
- property image_shape: tuple
Returns the image shape.
- Returns
A tuple for the image shape, e.g., (h, w, channel).
- Return type
tuple
- property values: ndarray
Returns the raw values.
- Returns
A numpy array of the stored images.
- Return type
np.ndarray
- to_numpy(hwc=True, copy=True, keepdim=False)
Converts Image into a numpy ndarray.
- Parameters
hwc – The output has format (batch_size, h, w, channel) if hwc is True or (batch_size, channel, h, w) otherwise.
copy – True if it returns a data copy, or False otherwise.
keepdim – True if the number of dimensions is kept for grayscale images, False if the channel dimension is squeezed.
- Returns
A numpy ndarray representing the images.
- Return type
np.ndarray
- to_pil()
Converts Image into a Pillow image or a list of Pillow images.
- Returns
A single Pillow image if batch_size = 1 or a list of Pillow images if batch_size > 1.
- Return type
Union[PilImage.Image, List]
omnixai.data.text module
The class for text data.
- class omnixai.data.text.Text(data=None, tokenizer=None)
Bases:
Data
The class represents a batch of texts or sentences. The texts or sentences are stored in a list of strings.
- Parameters
data (
Union
[List
,str
,None
]) – The text data, either a string or a list of strings.tokenizer (
Optional
[Callable
]) – A tokenizer for splitting texts/sentences into tokens, which should be Callable object. If tokenizer is None, a default nltk tokenizer will be applied.
- data_type = 'text'
- num_samples()
Returns the number of the texts or sentences.
- Returns
The number of the texts or sentences.
- Return type
int
- property values
Returns the raw text data.
- Returns
A list of the sentences/texts.
- Return type
List
- to_tokens(**kwargs)
Converts sentences/texts into tokens. If tokenizer is None, a default split function, e.g., nltk.word_tokenize is called to split a sentence into tokens. For example, [“omnixai library”, “explainable AI”] will be split into [[“omnixai”, “library”], [“explainable”, “AI”]].
- Parameters
kwargs – Additional parameters for the tokenizer
- Returns
A batch of tokens.
- Return type
List
- to_str(copy=True)
Returns a string if it has only one sentence or a list of strings if it contains multiple sentences.
- Parameters
copy – Whether to copy the data.
- Returns
A single string or a list of strings.
- Return type
Union[List, str]
- split(sep=None, maxsplit=-1)
omnixai.data.timeseries module
The class for time series data.
- class omnixai.data.timeseries.Timeseries(data, timestamps=None, variable_names=None)
Bases:
Data
This class represents a univariate/multivariate time series dataset. The dataset contains a time series whose metric values are stored in a numpy array with shape (timestamps, num_variables).
- Parameters
data (
ndarray
) – A numpy array containing a time series. The shape ofdata
is (timestamps, num_variables).timestamps (
Optional
[List
]) – A list of timestamps.variable_names (
Optional
[List
]) – A list of metric/variable names.
- data_type = 'timeseries'
- property ts_len: int
Returns the length of the time series.
- Return type
int
- property shape: tuple
Returns the raw data shape, e.g., (timestamps, num_variables).
- Returns
A tuple for the raw data shape.
- Return type
tuple
- num_samples()
Returns 1 because a Timeseries object only contains one time-series.
- Returns
- Return type
int
- property values: ndarray
Returns the raw values of the data object.
- Return type
ndarray
- Returns
A numpy array of the data object.
- property columns: List
Gets the metric/variable names.
- Return type
List
- Returns
The list of the metric/variable names.
- property index: ndarray
Gets the timestamps.
- Return type
ndarray
- Returns
A list of timestamps.
- to_pd()
Converts Timeseries to pd.DataFrame.
- Return type
DataFrame
- Returns
A pandas dataframe representing the time series.
- to_numpy(copy=True)
Converts Timeseries to np.ndarray.
- Parameters
copy – True if it returns a data copy, or False otherwise.
- Return type
ndarray
- Returns
A numpy ndarray representing the time series.
- copy()
Returns a copy of the time series instance.
- Returns
The copied time series instance.
- Return type
- classmethod from_pd(df)
Creates a Timeseries instance from one or multiple pandas dataframes. df is either one pandas dataframe or a list of pandas dataframes. The index of each dataframe should contain the timestamps.
:return A Timeseries instance. :rtype: Timeseries
- static get_timestamp_info(ts)
Returns a dict containing timestamp information, e.g., timestamp index name, timestamp values.
- Parameters
ts – A Timeseries instance.
- Returns
The timestamp information.
- static reset_timestamp_index(ts, timestamp_info)
Moves the timestamp index to a column and converts timestamps into floats.
- Parameters
ts – A Timeseries instance.
timestamp_info – The timestamp information.
- Returns
A converted Timeseries instance.
- static restore_timestamp_index(ts, timestamp_info)
Moves the timestamp column to the index and converts the floats back to timestamps.
- Parameters
ts – A Timeseries instance with a @timestamp column.
timestamp_info – The timestamp information.
- Returns
The original time-series dataframe.