logai.algorithms.vectorization_algo package

Submodules

logai.algorithms.vectorization_algo.fasttext module

class logai.algorithms.vectorization_algo.fasttext.FastText(params: FastTextParams)

Bases: VectorizationAlgo

This is a wrapper for FastText algorithm from gensim library. For details see https://radimrehurek.com/gensim/models/fasttext.html.

fit(loglines: Series)

Fits a FastText model.

Parameters:: loglines – The parsed loglines.

summary(): Generates model summary.

transform(loglines: Series) → Series

Transforms input loglines to log vectors.

Parameters:: loglines – The input loglines.
Returns:: The transformed log vectors.

class logai.algorithms.vectorization_algo.fasttext.FastTextParams(vector_size: int = 100, window: int = 100, min_count: int = 1, sample: float = 0.01, workers: int = 4, sg: int = 1, epochs: int = 100, max_token_len: int = 100)

Bases: Config

Configuration for FastText vectorizer. For more details on the parameters see https://radimrehurek.com/gensim/models/fasttext.html.

Parameters:

vector_size – The size of vector.
window – The maximum distance between the current and predicted word within a sentence.
min_count – Ignores all words with total frequency lower than this.
sample – The threshold for configuring which higher-frequency words are randomly downsampled.
workers – The number of workers to run
sg – Training algorithm: skip-gram if sg=1, otherwise CBOW.
epochs – The number o epochs.
max_token_len – The max token length.

epochs: int

max_token_len: int

min_count: int

sample: float

sg: int

vector_size: int

window: int

workers: int

logai.algorithms.vectorization_algo.forecast_nn module

class logai.algorithms.vectorization_algo.forecast_nn.ForecastNN(config: ForecastNNVectorizerParams)

Bases: VectorizationAlgo

Vectorizer Class for forecast based neural models for log representation learning.

Parameters:: config – config object specifying parameters of forecast based neural log repersentation learning model.

fit(logrecord: LogRecordObject)

Fit method to train vectorizer.

Parameters:: logrecord – A log record object to train the vectorizer on.

transform(logrecord: LogRecordObject)

Transform method to run vectorizer on logrecord object.

Parameters:: logrecord – A log record object to be vectorized.
Returns:: ForecastNNVectorizedDataset object containing the vectorized dataset.

class logai.algorithms.vectorization_algo.forecast_nn.ForecastNNVectorizedDataset(logline_features, labels, nextlogline_ids, span_ids)

Bases: object

Class for storing vectorized dataset for forecasting based neural models. :param logline_features: (np.array): list of vectorized log-sequences :param labels: (list or pd.Series or np.array): list of labels (anomalous or non-anomalous) for each log sequence. :param nextlogline_ids: (list or pd.Series or np.array): list of ids of next loglines, for each log sequence :param span_ids: (list or pd.Series or np.array): list of ids of log sequences.

features: str = 'features'

session_idx: str = 'session_idx'

window_anomalies: str = 'window_anomalies'

window_labels: str = 'window_labels'

class logai.algorithms.vectorization_algo.forecast_nn.ForecastNNVectorizerParams(feature_type: str | None = None, label_type: str | None = None, sep_token: str = '[SEP]', max_token_len: int | None = None, min_token_count: int | None = None, embedding_dim: int | None = None, output_dir: str = '', vectorizer_metadata_filepath: str = '', vectorizer_model_dirpath: str = '', sequentialvec_config: object | None = None, semanticvec_config: object | None = None)

Bases: Config

Config class for vectorizer for forecast based neural models for log representation learning.

Parameters:

feature_type – The type of log feature representation where the supported types “semantics” and “sequential”.
label_type – The type of label, anomaly or next_log, which corresponds to the supervised and the forecasting based unsupervised setting.
sep_token – The separator token used when constructing the log sequences during log grouping/partitioning. (default = “[SEP]”)
max_token_len – The maximum token length of the input.
min_token_count – The minimum number of occurrences of a token in the training data, for it to be considered in the vocab.
embedding_dim – The embedding dimension of the tokens.
output_dir – The path to output directory where the vectorizer model directory and metadata file would be created.
vectorizer_metadata_filepath – The path to file where the vectorizer metadata would be saved. This would be read by the anomaly detection model and should be set in the metadata_filepath of the forecast_nn based anomaly detector.
vectorizer_model_dirpath – The path to directory containing the vectorizer model.

embedding_dim: int

feature_type: str

label_type: str

max_token_len: int

min_token_count: int

output_dir: str

semanticvec_config: object

sep_token: str

sequentialvec_config: object

vectorizer_metadata_filepath: str

vectorizer_model_dirpath: str

logai.algorithms.vectorization_algo.logbert module

class logai.algorithms.vectorization_algo.logbert.LogBERT(config: LogBERTVectorizerParams)

Bases: VectorizationAlgo

Vectorizer class for logbert.

Parameters:: config – A config object for specifying parameters of log bert vectorizer.

fit(logrecord: LogRecordObject)

Fit method for training vectorizer for logbert.

Parameters:: logrecord – A log record object containing the training dataset over which vectorizer is trained.

transform(logrecord: LogRecordObject)

Transform method for running vectorizer over logrecord object.

Parameters:: logrecord – A log record object containing the dataset to be vectorized.
Returns:: HuggingFace dataset object.

class logai.algorithms.vectorization_algo.logbert.LogBERTVectorizerParams(model_name: str = '', use_fast: bool = True, truncation: bool = True, max_token_len: int = 384, max_vocab_size: int = 5000, train_batch_size: int = 1000, output_dir: str | None = None, tokenizer_dirpath: str | None = None, num_proc: int = 4)

Bases: Config

Config class for logBERT Vectorizer

Parameters:

model_name – name of the model , using HuggingFace standardized naming.
use_fast – whether to use fast tokenization or not.
truncation – whether to truncate the input to max_token_len.
max_token_len – maximum token length of input, if truncation is set to true.
max_vocab_size – maximum size of the vocabulary.
custom_tokens – list of custom tokens.
train_batch_size – batch size during training the vectorizer.
output_dir – path to directory where the output would be saved.
tokenizer_dirpath – path to the tokenizer where the vectorizer (logbert tokenizer) would be saved.
num_proc – number of processes to be used when tokenizing.

custom_tokens = []

max_token_len: int

max_vocab_size: int

model_name: str

num_proc: int

output_dir: str

tokenizer_dirpath: str

train_batch_size: int

truncation: bool

use_fast: bool

logai.algorithms.vectorization_algo.semantic module

class logai.algorithms.vectorization_algo.semantic.Semantic(params: SemanticVectorizerParams)

Bases: VectorizationAlgo

Semantic vectorizer to convert loglines into token ids based on a embedding model and vocabulary (like word2vec, glove and fastText). It supports either pretrained models and pretrained vocabulary or training word embedding models like Word2Vec or FastText on the given training data.

Parameters:: params – A config object for semantic vectorizer.

fit(loglines: Series)

Fit method to train semantic vectorizer.

Parameters:: loglines – A pandas Series object containing the dataset on which semantic vectorizer is trained (and the vocab is built). Each data instance should be a logline or sequence of loglines concatenated by separator token.

summary(): Generate model summary.

transform(loglines: Series) → Series

Transform method to run semantic vectorizer on loglines.

Parameters:: loglines – The pandas Series containing the data to be vectorized. Each data instance should be a logline or sequence of loglines concatenated by separator token.
Returns:: The vectorized log data.

class logai.algorithms.vectorization_algo.semantic.SemanticVectorizerParams(max_token_len: int = 10, min_token_count: int = 1, sep_token: str = '[SEP]', embedding_dim: int = 300, window: int = 3, embedding_type: str = 'fasttext', model_save_dir: str | None = None)

Bases: Config

Configuration of Semantic vectorization of loglines (or sequence of log lines) using models like word2vc, glove and fastText.

Parameters:

max_token_len – maximum token length of the input.
min_token_count – minimum count of occurrences of a token in training data for it to be considered in the vocab.
sep_token – separator token used to separate log lines in input log sequence. Default is “[SEP]”.
embedding_dim – embedding dimension of the learnt token embeddings.
window – window size parameter for word2vec and fastText models.
embedding_type – type of embedding, currently supports glove, word2vec and fastText. Default is “fasttext”.
model_save_dir – path to directory where vectorizer models would be saved.

embedding_dim: int

embedding_type: str

max_token_len: int

min_token_count: int

model_save_dir: str

sep_token: str

window: int

logai.algorithms.vectorization_algo.sequential module

class logai.algorithms.vectorization_algo.sequential.Sequential(params: SequentialVectorizerParams)

Bases: VectorizationAlgo

Sequential Vectorizer to convert a sequence of loglines to sequence of log ids.

Parameters:: params – A config object for storing parameters of Sequential Vectorizer.

fit(loglines: Series)

Fit method for training the sequential vectorizer.

Parameters:: loglines – A pandas Series object containing the dataset on which semantic vectorizer is trained (and the vocab is built). Each data instance should be a logline or sequence of loglines concatenated by separator token.

transform(loglines: Series) → Series

Transform method for applying sequential vectorizer to loglines.

Parameters:: loglines – A pandas Series containing the data to be vectorized. Each data instance should be a logline or sequence of loglines concatenated by separator token.
Returns:: The vectorized loglines.

class logai.algorithms.vectorization_algo.sequential.SequentialVectorizerParams(sep_token: str | None = None, model_save_dir: str | None = None, max_token_len: int | None = None)

Bases: Config

Config for Sequential Vectorizer which converts a sequence of loglines to sequence of log ids.

Parameters:

sep_token – The separator token used to separate log lines in an input log sequence.
model_save_dir – The path to directory where models related to sequential vectorizer would be stored.
max_token_len – The maximum token length of input.

max_token_len: int

model_save_dir: str

sep_token: str

logai.algorithms.vectorization_algo.tfidf module

class logai.algorithms.vectorization_algo.tfidf.TfIdf(params: TfIdfParams, **kwargs)

Bases: VectorizationAlgo

TfIdf based vectorizer for log data. This is a wrapper class of the TF-IDF Vectorizer algorithm from scikit-learn. https://scikit-learn.org/stable/modules/generated/sklearn.feature_extraction.text.TfidfVectorizer.html.

fit(loglines: Series)

Trains a TF-IDF model.

Parameters:: loglines – The input training dataset.

summary(): Generates model summary.

transform(loglines: Series) → Series

Transforms loglines into log vectors.

Parameters:: loglines – The input test dataset.
Returns:: The transformed log vectors.

class logai.algorithms.vectorization_algo.tfidf.TfIdfParams(input: str = 'content', encoding: str = 'utf-8', decode_error: str = 'strict', strip_accents: object | None = None, lowercase: bool = True, preprocessor: object | None = None, tokenizer: object | None = None, analyzer: str = 'word', stop_words: object | None = None, token_pattern: str = '(?u)\\b\\w\\w+\\b', ngram_range: tuple = (1, 1), max_df: float = 1.0, min_df: int = 1, max_features: object | None = None, vocabulary: object | None = None, binary: bool = False, dtype: object = <class 'numpy.float64'>, norm: str = 'l2', use_idf: bool = True, smooth_idf: bool = True, sublinear_tf: bool = False)

Bases: Config

Configuration of TF-IDF vectorizer. For more details of parameters see https://scikit-learn.org/stable/modules/generated/sklearn.feature_extraction.text.TfidfVectorizer.html.

Parameters:

input – {'filename', 'file', 'content'}; If ‘filename’, the sequence passed as an argument to fit is expected to be a list of filenames that need reading to fetch the raw content to analyze.
encoding – If bytes or files are given to analyze, this encoding is used to decode.
decode_error – {'strict', 'ignore', 'replace'}; Instruction on what to do if a byte sequence is given to analyze that contains characters not of the given encoding. By default, it is ‘strict’, meaning that a UnicodeDecodeError will be raised. Other values are ‘ignore’ and ‘replace’.
strip_accents – Remove accents and perform other character normalization during the preprocessing step.
lowercase – Convert all characters to lowercase before tokenizing.
preprocessor – Override the preprocessing (string transformation) stage while preserving the tokenizing and n-grams generation steps.
tokenizer – Override the string tokenization step while preserving the preprocessing and n-grams generation steps.
analyzer – Whether the feature should be made of word or character n-grams.
stop_words – If a string, it is passed to _check_stop_list and the appropriate stop list is returned. ‘english’ is currently the only supported string value.
token_pattern – Regular expression denoting what constitutes a “token”, only used if analyzer == 'word'.
ngram_range – The lower and upper boundary of the range of n-values for different n-grams to be extracted.
max_df – When building the vocabulary ignore terms that have a document frequency strictly higher than the given threshold.
min_df – When building the vocabulary ignore terms that have a document frequency strictly lower than the given threshold.
max_features – If not None, build a vocabulary that only consider the top max_features ordered by term frequency across the corpus.
vocabulary – Either a Mapping (e.g., a dict) where keys are terms and values are indices in the feature matrix, or an iterable over terms. If not given, a vocabulary is determined from the input documents.
binary – If True, all non-zero term counts are set to 1. This does not mean outputs will have only 0/1 values, only that the tf term in tf-idf is binary.
dtype – Type of the matrix returned by fit_transform() or transform().
norm – Each output row will have unit norm, i.e., {'l1', 'l2'}.
use_idf – Enable inverse-document-frequency reweighting. If False, idf(t) = 1.
smooth_idf – Smooth idf weights by adding one to document frequencies.
sublinear_tf – Apply sublinear tf scaling, i.e. replace tf with 1 + log(tf).

analyzer: str

binary: bool

decode_error: str

dtype: object

encoding: str

input: str

lowercase: bool

max_df: float

max_features: object

min_df: int

ngram_range: tuple

norm: str

preprocessor: object

smooth_idf: bool

stop_words: object

strip_accents: object

sublinear_tf: bool

token_pattern: str

tokenizer: object

use_idf: bool

vocabulary: object

logai.algorithms.vectorization_algo.word2vec module

class logai.algorithms.vectorization_algo.word2vec.Word2Vec(params: Word2VecParams)

Bases: VectorizationAlgo

Word2Vec algorithm for converting raw log data into word2vec vectors. This is a wrapper class for the Word2Vec model from gensim library https://radimrehurek.com/gensim/models/word2vec.html

Parameters:: max_token_len – The max token length to vectorize, longer sentences will be chopped.

fit(loglines: Series)

Fits a Word2Vec model.

Parameters:: loglines – Parsed loglines.

summary(): Generates model summary.

transform(loglines: Series) → Series

Transforms input loglines to log vectors.

Parameters:: loglines – The input loglines.
Returns:: The transformed log vectors.

class logai.algorithms.vectorization_algo.word2vec.Word2VecParams(max_token_len: int = 100, min_count: int = 1, vector_size: int = 3, window: int = 3)

Bases: Config

Configuration of Word2Vec vectorization parameters. For more details on the parameters see https://radimrehurek.com/gensim/models/word2vec.html.

Parameters:

max_token_len – The maximum length of tokens.
min_count – Ignores all words with total frequency lower than this.
vector_size – Dimensionality of the feature vectors.
window – The maximum distance between the current and predicted word within a sentence.

max_token_len: int

min_count: int

vector_size: int

window: int

Module contents

class logai.algorithms.vectorization_algo.FastText(params: FastTextParams)

Bases: VectorizationAlgo

This is a wrapper for FastText algorithm from gensim library. For details see https://radimrehurek.com/gensim/models/fasttext.html.

fit(loglines: Series)

Fits a FastText model.

Parameters:: loglines – The parsed loglines.

summary(): Generates model summary.

transform(loglines: Series) → Series

Transforms input loglines to log vectors.

Parameters:: loglines – The input loglines.
Returns:: The transformed log vectors.

class logai.algorithms.vectorization_algo.ForecastNN(config: ForecastNNVectorizerParams)

Bases: VectorizationAlgo

Vectorizer Class for forecast based neural models for log representation learning.

Parameters:: config – config object specifying parameters of forecast based neural log repersentation learning model.

fit(logrecord: LogRecordObject)

Fit method to train vectorizer.

Parameters:: logrecord – A log record object to train the vectorizer on.

transform(logrecord: LogRecordObject)

Transform method to run vectorizer on logrecord object.

Parameters:: logrecord – A log record object to be vectorized.
Returns:: ForecastNNVectorizedDataset object containing the vectorized dataset.

class logai.algorithms.vectorization_algo.LogBERT(config: LogBERTVectorizerParams)

Bases: VectorizationAlgo

Vectorizer class for logbert.

Parameters:: config – A config object for specifying parameters of log bert vectorizer.

fit(logrecord: LogRecordObject)

Fit method for training vectorizer for logbert.

Parameters:: logrecord – A log record object containing the training dataset over which vectorizer is trained.

transform(logrecord: LogRecordObject)

Transform method for running vectorizer over logrecord object.

Parameters:: logrecord – A log record object containing the dataset to be vectorized.
Returns:: HuggingFace dataset object.

class logai.algorithms.vectorization_algo.Semantic(params: SemanticVectorizerParams)

Bases: VectorizationAlgo

Semantic vectorizer to convert loglines into token ids based on a embedding model and vocabulary (like word2vec, glove and fastText). It supports either pretrained models and pretrained vocabulary or training word embedding models like Word2Vec or FastText on the given training data.

Parameters:: params – A config object for semantic vectorizer.

fit(loglines: Series)

Fit method to train semantic vectorizer.

Parameters:: loglines – A pandas Series object containing the dataset on which semantic vectorizer is trained (and the vocab is built). Each data instance should be a logline or sequence of loglines concatenated by separator token.

summary(): Generate model summary.

transform(loglines: Series) → Series

Transform method to run semantic vectorizer on loglines.

Parameters:: loglines – The pandas Series containing the data to be vectorized. Each data instance should be a logline or sequence of loglines concatenated by separator token.
Returns:: The vectorized log data.

class logai.algorithms.vectorization_algo.Sequential(params: SequentialVectorizerParams)

Bases: VectorizationAlgo

Sequential Vectorizer to convert a sequence of loglines to sequence of log ids.

Parameters:: params – A config object for storing parameters of Sequential Vectorizer.

fit(loglines: Series)

Fit method for training the sequential vectorizer.

Parameters:: loglines – A pandas Series object containing the dataset on which semantic vectorizer is trained (and the vocab is built). Each data instance should be a logline or sequence of loglines concatenated by separator token.

transform(loglines: Series) → Series

Transform method for applying sequential vectorizer to loglines.

Parameters:: loglines – A pandas Series containing the data to be vectorized. Each data instance should be a logline or sequence of loglines concatenated by separator token.
Returns:: The vectorized loglines.

class logai.algorithms.vectorization_algo.TfIdf(params: TfIdfParams, **kwargs)

Bases: VectorizationAlgo

TfIdf based vectorizer for log data. This is a wrapper class of the TF-IDF Vectorizer algorithm from scikit-learn. https://scikit-learn.org/stable/modules/generated/sklearn.feature_extraction.text.TfidfVectorizer.html.

fit(loglines: Series)

Trains a TF-IDF model.

Parameters:: loglines – The input training dataset.

summary(): Generates model summary.

transform(loglines: Series) → Series

Transforms loglines into log vectors.

Parameters:: loglines – The input test dataset.
Returns:: The transformed log vectors.

class logai.algorithms.vectorization_algo.Word2Vec(params: Word2VecParams)

Bases: VectorizationAlgo

Word2Vec algorithm for converting raw log data into word2vec vectors. This is a wrapper class for the Word2Vec model from gensim library https://radimrehurek.com/gensim/models/word2vec.html

Parameters:: max_token_len – The max token length to vectorize, longer sentences will be chopped.

fit(loglines: Series)

Fits a Word2Vec model.

Parameters:: loglines – Parsed loglines.

summary(): Generates model summary.

transform(loglines: Series) → Series

Transforms input loglines to log vectors.

Parameters:: loglines – The input loglines.
Returns:: The transformed log vectors.