logai.algorithms.nn_model.logbert package

Submodules

logai.algorithms.nn_model.logbert.configs module

class logai.algorithms.nn_model.logbert.configs.LogBERTConfig(pretrain_from_scratch: bool = True, model_name: str = 'bert-base-cased', model_dirname: str | None = None, mlm_probability: float = 0.15, mask_ngram: int = 1, max_token_len: int = 384, evaluation_strategy: str = 'steps', num_train_epochs: int = 20, learning_rate: float = 1e-05, logging_steps: int = 10, per_device_train_batch_size: int = 50, per_device_eval_batch_size: int = 256, eval_accumulation_steps: int = 1000, num_eval_shards: int = 10, weight_decay: float = 0.0001, save_steps: int = 50, eval_steps: int = 50, resume_from_checkpoint: bool = True, output_dir: str | None = None, tokenizer_dirpath: str | None = None)

Bases: Config

Config for logBERT model.

Parameters:
  • pretrain_from_scratch – bool = True : whether to do pretraining from scratch or intialize with the HuggingFace pretrained LM.

  • model_name – str = “bert-base-cased” : name of the model using HuggingFace standardized naming.

  • model_dirname – str = None : name of the directory where the model would be saved. Directory of this name would be created inside output_dir, if it does not exist.

  • mlm_probability – float = 0.15 : probability of the tokens to be masked during MLM trainning.

  • mask_ngram – int = 1 : length of ngrams that are masked during inference.

  • max_token_len – int = 384 : maximum token length of the input.

  • learning_rate – float = 1e-5 : learning rate.

  • weight_decay – float = 0.0001 : parameter to use weight decay of the learning rate.

  • per_device_train_batch_size – int = 50 : training batch size per gpu device.

  • per_device_eval_batch_size – int = 256 : evaluation batch size per gpu device.

  • eval_accumulation_steps – int = 1000 : parameter to accumulate the evaluation results over the steps.

  • num_eval_shards – int = 10 : parameter to shard the evaluation data (to avoid any OOM issue).

  • evaluation_strategy – str = “steps” : either steps or epoch, based on whether the unit of the eval_steps parameter is “steps” or “epoch”.

  • num_train_epochs – int = 20 : number of training epochs.

  • logging_steps – int = 10 : number of steps after which the output is logged.

  • save_steps – int = 50 : number of steps after which the model is saved.

  • eval_steps – int = 50 : number of steps after which evaluation is run.

  • resume_from_checkpoint – bool = True : whether to resume from a given model checkpoint. If set to true, it will find the latest checkpoint saved in the dir and use that to load the model.

  • output_dir – str = None : output directory where the model would be saved.

  • tokenizer_dirpath – str = None : path to directory containing the tokenizer.

eval_accumulation_steps: int
eval_steps: int
evaluation_strategy: str
learning_rate: float
logging_steps: int
mask_ngram: int
max_token_len: int
mlm_probability: float
model_dirname: str
model_name: str
num_eval_shards: int
num_train_epochs: int
output_dir: str
per_device_eval_batch_size: int
per_device_train_batch_size: int
pretrain_from_scratch: bool
resume_from_checkpoint: bool
save_steps: int
tokenizer_dirpath: str
weight_decay: float

logai.algorithms.nn_model.logbert.eval_metric_utils module

logai.algorithms.nn_model.logbert.eval_metric_utils.compute_metrics(eval_metrics_per_instance_series, test_labels, test_counts=None)

Computing evaluation metric scores for anomaly detection.

Parameters:
  • eval_metrics_per_instance_series:(dict) – dict object consisting of eval metrics for each instance index.

  • test_labels:(dict) – gold labels for each instance index.

  • test_counts:(dict) – counts of each instance index.

Raises:

Exception: IndexError if the indices of eval_metrics_per_instance_series do not match with indices of test_labels.

Returns:

list of tuples containing labels and scores computed for each index. - y: list of anomaly label for each instance. - loss_mean: list of mean loss (over all masked non-padded tokens) for each instance. - loss_max: list of max loss (over all masked non-padded tokens) for each instance. - loss_top6_mean: list of mean loss (averaged over top-k masked non-padded tokens) for each instance, k = 6(following LanoBERT paper https://arxiv.org/pdf/2111.09564.pdf). - scores_top6_max_prob: for each instance, we take the max prob. score obtained and average over the top-k masked (non-padded) token prediction, k = 6. - scores_top6_min_logprob: for each instance, we take the min logprob score obtained and average over the top-k masked (non-padded) token prediction, k = 6. - scores_top6_max_entropy: for each instance we take the max entropy score obtained and average over the top-k masked (non-padded) token prediction, k = 6.

logai.algorithms.nn_model.logbert.predict module

class logai.algorithms.nn_model.logbert.predict.LogBERTPredict(config: LogBERTConfig)

Bases: object

Class for running inference on logBERT model for unsupervised log anomaly detection.

Parameters:

config – config object describing the parameters of logbert model.

load_model()

Loading logbert model from the model dir path as specified in the logBERTConfig config

predict(test_dataset: Dataset)

Method for running inference on logbert to predict anomalous loglines in test dataset.

Parameters:

test_dataset – test dataset of type huggingface Dataset object.

Returns:

dict containing instance-wise loss and scores.

logai.algorithms.nn_model.logbert.predict_utils module

class logai.algorithms.nn_model.logbert.predict_utils.PredictionLabelSmoother(epsilon: float = 0.1, ignore_index: int = -100)

Bases: LabelSmoother

Adds label-smoothing on a pre-computed output from a Transformers model.

Parameters:
  • epsilon – (float, optional, defaults to 0.1): The label smoothing factor.

  • ignore_index – (int, optional, defaults to -100): The index in the labels to ignore when computing the loss.

epsilon: float = 0.0
eval_metrics_per_instance = [[], [], [], [], [], [], [], []]
ignore_index: int = -100
class logai.algorithms.nn_model.logbert.predict_utils.Predictor(model: PreTrainedModel | Module | None = None, args: TrainingArguments | None = None, data_collator: DataCollator | None = None, train_dataset: Dataset | None = None, eval_dataset: Dataset | None = None, tokenizer: PreTrainedTokenizerBase | None = None, model_init: Callable[[], PreTrainedModel] | None = None, compute_metrics: Callable[[EvalPrediction], Dict] | None = None, callbacks: List[TrainerCallback] | None = None, optimizers: Tuple[Optimizer, LambdaLR] = (None, None), preprocess_logits_for_metrics: Callable[[Tensor, Tensor], Tensor] | None = None)

Bases: Trainer

Custom Trainer object for running the inference of logBERT model for unsupervised anomaly detection.

compute_loss(model, inputs, return_outputs=False)

How the loss is computed by Trainer. By default, all models return the loss in the first element. Subclass and override for custom behavior.

get_test_dataloader(test_dataset: Dataset) DataLoader

Returns the test [~torch.utils.data.DataLoader]. Subclass and override this method if you want to inject some custom behavior.

Parameters:

test_dataset – (torch.utils.data.Dataset, optional): The test dataset to use. If it is an datasets.Dataset, columns not accepted by the model.forward() method are automatically removed. It must implement __len__.

logai.algorithms.nn_model.logbert.tokenizer_utils module

logai.algorithms.nn_model.logbert.tokenizer_utils.get_mask_id(tokenizer)

Get id of mask token, given a tokenizer object.

Parameters:

tokenizer – (AutoTokenizer): tokenizer object.

Returns:

id of mask token.

logai.algorithms.nn_model.logbert.tokenizer_utils.get_special_token_ids(tokenizer)

Get ids of special tokens, given a tokenizer object.

Parameters:

tokenizer – (AutoTokenizer): tokenizer object.

Returns:

list of token ids of special tokens.

logai.algorithms.nn_model.logbert.tokenizer_utils.get_special_tokens()

gets special tokens

Returns:

list of special tokens

logai.algorithms.nn_model.logbert.tokenizer_utils.get_tokenizer(tokenizer_dirpath)

Get huggingface tokenizer object from a given directory path.

Parameters:

tokenizer_dirpath – (str): absolute path to directory containing pretrained tokenizer.

Returns:

AutoTokenizer: tokenizer object.

logai.algorithms.nn_model.logbert.tokenizer_utils.get_tokenizer_vocab(tokenizer_dirpath)

Get vocabulary from a given tokenizer directory path.

Parameters:

tokenizer_dirpath – (str): absolute path to directory containing pretrained tokenizer.

Returns:

list of vocabulary words.

logai.algorithms.nn_model.logbert.train module

class logai.algorithms.nn_model.logbert.train.LogBERTTrain(config: LogBERTConfig)

Bases: object

Class for training logBERT model to learn log representations

evaluate()

Evaluate methof for evaluating logbert model on dev data using perplexity metric.

fit(train_dataset: Dataset, dev_dataset: Dataset)

Fit method for training logbert model.

Parameters:
  • train_dataset – training dataset of type huggingface Dataset object.

  • dev_dataset – development dataset of type huggingface Dataset object.

get_model_checkpoint()

Get the latest dumped checkpoint from the model directory path mentioned in logBERTConfig.

Returns:

path to model checkpoint (or name of model in case of a pretrained model from hugging face).

Module contents