logai.algorithms.nn_model.logbert package


logai.algorithms.nn_model.logbert.configs module

class logai.algorithms.nn_model.logbert.configs.LogBERTConfig(pretrain_from_scratch: bool = True, model_name: str = 'bert-base-cased', model_dirname: str | None = None, mlm_probability: float = 0.15, mask_ngram: int = 1, max_token_len: int = 384, evaluation_strategy: str = 'steps', num_train_epochs: int = 20, learning_rate: float = 1e-05, logging_steps: int = 10, per_device_train_batch_size: int = 50, per_device_eval_batch_size: int = 256, eval_accumulation_steps: int = 1000, num_eval_shards: int = 10, weight_decay: float = 0.0001, save_steps: int = 50, eval_steps: int = 50, resume_from_checkpoint: bool = True, output_dir: str | None = None, tokenizer_dirpath: str | None = None)

Bases: Config

Config for logBERT model.

  • pretrain_from_scratch – bool = True : whether to do pretraining from scratch or intialize with the HuggingFace pretrained LM.

  • model_name – str = “bert-base-cased” : name of the model using HuggingFace standardized naming.

  • model_dirname – str = None : name of the directory where the model would be saved. Directory of this name would be created inside output_dir, if it does not exist.

  • mlm_probability – float = 0.15 : probability of the tokens to be masked during MLM trainning.

  • mask_ngram – int = 1 : length of ngrams that are masked during inference.

  • max_token_len – int = 384 : maximum token length of the input.

  • learning_rate – float = 1e-5 : learning rate.

  • weight_decay – float = 0.0001 : parameter to use weight decay of the learning rate.

  • per_device_train_batch_size – int = 50 : training batch size per gpu device.

  • per_device_eval_batch_size – int = 256 : evaluation batch size per gpu device.

  • eval_accumulation_steps – int = 1000 : parameter to accumulate the evaluation results over the steps.

  • num_eval_shards – int = 10 : parameter to shard the evaluation data (to avoid any OOM issue).

  • evaluation_strategy – str = “steps” : either steps or epoch, based on whether the unit of the eval_steps parameter is “steps” or “epoch”.

  • num_train_epochs – int = 20 : number of training epochs.

  • logging_steps – int = 10 : number of steps after which the output is logged.

  • save_steps – int = 50 : number of steps after which the model is saved.

  • eval_steps – int = 50 : number of steps after which evaluation is run.

  • resume_from_checkpoint – bool = True : whether to resume from a given model checkpoint. If set to true, it will find the latest checkpoint saved in the dir and use that to load the model.

  • output_dir – str = None : output directory where the model would be saved.

  • tokenizer_dirpath – str = None : path to directory containing the tokenizer.

eval_accumulation_steps: int
eval_steps: int
evaluation_strategy: str
learning_rate: float
logging_steps: int
mask_ngram: int
max_token_len: int
mlm_probability: float
model_dirname: str
model_name: str
num_eval_shards: int
num_train_epochs: int
output_dir: str
per_device_eval_batch_size: int
per_device_train_batch_size: int
pretrain_from_scratch: bool
resume_from_checkpoint: bool
save_steps: int
tokenizer_dirpath: str
weight_decay: float

logai.algorithms.nn_model.logbert.eval_metric_utils module

logai.algorithms.nn_model.logbert.eval_metric_utils.compute_metrics(eval_metrics_per_instance_series, test_labels, test_counts=None)

Computing evaluation metric scores for anomaly detection.

  • eval_metrics_per_instance_series:(dict) – dict object consisting of eval metrics for each instance index.

  • test_labels:(dict) – gold labels for each instance index.

  • test_counts:(dict) – counts of each instance index.


Exception: IndexError if the indices of eval_metrics_per_instance_series do not match with indices of test_labels.


list of tuples containing labels and scores computed for each index. - y: list of anomaly label for each instance. - loss_mean: list of mean loss (over all masked non-padded tokens) for each instance. - loss_max: list of max loss (over all masked non-padded tokens) for each instance. - loss_top6_mean: list of mean loss (averaged over top-k masked non-padded tokens) for each instance, k = 6(following LanoBERT paper https://arxiv.org/pdf/2111.09564.pdf). - scores_top6_max_prob: for each instance, we take the max prob. score obtained and average over the top-k masked (non-padded) token prediction, k = 6. - scores_top6_min_logprob: for each instance, we take the min logprob score obtained and average over the top-k masked (non-padded) token prediction, k = 6. - scores_top6_max_entropy: for each instance we take the max entropy score obtained and average over the top-k masked (non-padded) token prediction, k = 6.

logai.algorithms.nn_model.logbert.predict module

class logai.algorithms.nn_model.logbert.predict.LogBERTPredict(config: LogBERTConfig)

Bases: object

Class for running inference on logBERT model for unsupervised log anomaly detection.


config – config object describing the parameters of logbert model.


Loading logbert model from the model dir path as specified in the logBERTConfig config

predict(test_dataset: Dataset)

Method for running inference on logbert to predict anomalous loglines in test dataset.


test_dataset – test dataset of type huggingface Dataset object.


dict containing instance-wise loss and scores.

logai.algorithms.nn_model.logbert.predict_utils module

class logai.algorithms.nn_model.logbert.predict_utils.PredictionLabelSmoother(epsilon: float = 0.1, ignore_index: int = -100)

Bases: LabelSmoother

Adds label-smoothing on a pre-computed output from a Transformers model.

  • epsilon – (float, optional, defaults to 0.1): The label smoothing factor.

  • ignore_index – (int, optional, defaults to -100): The index in the labels to ignore when computing the loss.

epsilon: float = 0.0
eval_metrics_per_instance = [[], [], [], [], [], [], [], []]
ignore_index: int = -100
class logai.algorithms.nn_model.logbert.predict_utils.Predictor(model: PreTrainedModel | Module | None = None, args: TrainingArguments | None = None, data_collator: DataCollator | None = None, train_dataset: Dataset | None = None, eval_dataset: Dataset | None = None, tokenizer: PreTrainedTokenizerBase | None = None, model_init: Callable[[], PreTrainedModel] | None = None, compute_metrics: Callable[[EvalPrediction], Dict] | None = None, callbacks: List[TrainerCallback] | None = None, optimizers: Tuple[Optimizer, LambdaLR] = (None, None), preprocess_logits_for_metrics: Callable[[Tensor, Tensor], Tensor] | None = None)

Bases: Trainer

Custom Trainer object for running the inference of logBERT model for unsupervised anomaly detection.

compute_loss(model, inputs, return_outputs=False)

How the loss is computed by Trainer. By default, all models return the loss in the first element. Subclass and override for custom behavior.

get_test_dataloader(test_dataset: Dataset) DataLoader

Returns the test [~torch.utils.data.DataLoader]. Subclass and override this method if you want to inject some custom behavior.


test_dataset – (torch.utils.data.Dataset, optional): The test dataset to use. If it is an datasets.Dataset, columns not accepted by the model.forward() method are automatically removed. It must implement __len__.

logai.algorithms.nn_model.logbert.tokenizer_utils module


Get id of mask token, given a tokenizer object.


tokenizer – (AutoTokenizer): tokenizer object.


id of mask token.


Get ids of special tokens, given a tokenizer object.


tokenizer – (AutoTokenizer): tokenizer object.


list of token ids of special tokens.


gets special tokens


list of special tokens


Get huggingface tokenizer object from a given directory path.


tokenizer_dirpath – (str): absolute path to directory containing pretrained tokenizer.


AutoTokenizer: tokenizer object.


Get vocabulary from a given tokenizer directory path.


tokenizer_dirpath – (str): absolute path to directory containing pretrained tokenizer.


list of vocabulary words.

logai.algorithms.nn_model.logbert.train module

class logai.algorithms.nn_model.logbert.train.LogBERTTrain(config: LogBERTConfig)

Bases: object

Class for training logBERT model to learn log representations


Evaluate methof for evaluating logbert model on dev data using perplexity metric.

fit(train_dataset: Dataset, dev_dataset: Dataset)

Fit method for training logbert model.

  • train_dataset – training dataset of type huggingface Dataset object.

  • dev_dataset – development dataset of type huggingface Dataset object.


Get the latest dumped checkpoint from the model directory path mentioned in logBERTConfig.


path to model checkpoint (or name of model in case of a pretrained model from hugging face).

Module contents