logai.algorithms.nn_model.logbert package
Submodules
logai.algorithms.nn_model.logbert.configs module
- class logai.algorithms.nn_model.logbert.configs.LogBERTConfig(pretrain_from_scratch: bool = True, model_name: str = 'bert-base-cased', model_dirname: str | None = None, mlm_probability: float = 0.15, mask_ngram: int = 1, max_token_len: int = 384, evaluation_strategy: str = 'steps', num_train_epochs: int = 20, learning_rate: float = 1e-05, logging_steps: int = 10, per_device_train_batch_size: int = 50, per_device_eval_batch_size: int = 256, eval_accumulation_steps: int = 1000, num_eval_shards: int = 10, weight_decay: float = 0.0001, save_steps: int = 50, eval_steps: int = 50, resume_from_checkpoint: bool = True, output_dir: str | None = None, tokenizer_dirpath: str | None = None)
Bases:
Config
Config for logBERT model.
- Parameters:
pretrain_from_scratch – bool = True : whether to do pretraining from scratch or intialize with the HuggingFace pretrained LM.
model_name – str = “bert-base-cased” : name of the model using HuggingFace standardized naming.
model_dirname – str = None : name of the directory where the model would be saved. Directory of this name would be created inside output_dir, if it does not exist.
mlm_probability – float = 0.15 : probability of the tokens to be masked during MLM trainning.
mask_ngram – int = 1 : length of ngrams that are masked during inference.
max_token_len – int = 384 : maximum token length of the input.
learning_rate – float = 1e-5 : learning rate.
weight_decay – float = 0.0001 : parameter to use weight decay of the learning rate.
per_device_train_batch_size – int = 50 : training batch size per gpu device.
per_device_eval_batch_size – int = 256 : evaluation batch size per gpu device.
eval_accumulation_steps – int = 1000 : parameter to accumulate the evaluation results over the steps.
num_eval_shards – int = 10 : parameter to shard the evaluation data (to avoid any OOM issue).
evaluation_strategy – str = “steps” : either steps or epoch, based on whether the unit of the eval_steps parameter is “steps” or “epoch”.
num_train_epochs – int = 20 : number of training epochs.
logging_steps – int = 10 : number of steps after which the output is logged.
save_steps – int = 50 : number of steps after which the model is saved.
eval_steps – int = 50 : number of steps after which evaluation is run.
resume_from_checkpoint – bool = True : whether to resume from a given model checkpoint. If set to true, it will find the latest checkpoint saved in the dir and use that to load the model.
output_dir – str = None : output directory where the model would be saved.
tokenizer_dirpath – str = None : path to directory containing the tokenizer.
- eval_accumulation_steps: int
- eval_steps: int
- evaluation_strategy: str
- learning_rate: float
- logging_steps: int
- mask_ngram: int
- max_token_len: int
- mlm_probability: float
- model_dirname: str
- model_name: str
- num_eval_shards: int
- num_train_epochs: int
- output_dir: str
- per_device_eval_batch_size: int
- per_device_train_batch_size: int
- pretrain_from_scratch: bool
- resume_from_checkpoint: bool
- save_steps: int
- tokenizer_dirpath: str
- weight_decay: float
logai.algorithms.nn_model.logbert.eval_metric_utils module
- logai.algorithms.nn_model.logbert.eval_metric_utils.compute_metrics(eval_metrics_per_instance_series, test_labels, test_counts=None)
Computing evaluation metric scores for anomaly detection.
- Parameters:
eval_metrics_per_instance_series:(dict) – dict object consisting of eval metrics for each instance index.
test_labels:(dict) – gold labels for each instance index.
test_counts:(dict) – counts of each instance index.
- Raises:
Exception: IndexError if the indices of eval_metrics_per_instance_series do not match with indices of test_labels.
- Returns:
list of tuples containing labels and scores computed for each index. - y: list of anomaly label for each instance. - loss_mean: list of mean loss (over all masked non-padded tokens) for each instance. - loss_max: list of max loss (over all masked non-padded tokens) for each instance. - loss_top6_mean: list of mean loss (averaged over top-k masked non-padded tokens) for each instance, k = 6(following LanoBERT paper https://arxiv.org/pdf/2111.09564.pdf). - scores_top6_max_prob: for each instance, we take the max prob. score obtained and average over the top-k masked (non-padded) token prediction, k = 6. - scores_top6_min_logprob: for each instance, we take the min logprob score obtained and average over the top-k masked (non-padded) token prediction, k = 6. - scores_top6_max_entropy: for each instance we take the max entropy score obtained and average over the top-k masked (non-padded) token prediction, k = 6.
logai.algorithms.nn_model.logbert.predict module
- class logai.algorithms.nn_model.logbert.predict.LogBERTPredict(config: LogBERTConfig)
Bases:
object
Class for running inference on logBERT model for unsupervised log anomaly detection.
- Parameters:
config – config object describing the parameters of logbert model.
- load_model()
Loading logbert model from the model dir path as specified in the logBERTConfig config
- predict(test_dataset: Dataset)
Method for running inference on logbert to predict anomalous loglines in test dataset.
- Parameters:
test_dataset – test dataset of type huggingface Dataset object.
- Returns:
dict containing instance-wise loss and scores.
logai.algorithms.nn_model.logbert.predict_utils module
- class logai.algorithms.nn_model.logbert.predict_utils.PredictionLabelSmoother(epsilon: float = 0.1, ignore_index: int = -100)
Bases:
LabelSmoother
Adds label-smoothing on a pre-computed output from a Transformers model.
- Parameters:
epsilon – (
float
, optional, defaults to 0.1): The label smoothing factor.ignore_index – (
int
, optional, defaults to -100): The index in the labels to ignore when computing the loss.
- epsilon: float = 0.0
- eval_metrics_per_instance = [[], [], [], [], [], [], [], []]
- ignore_index: int = -100
- class logai.algorithms.nn_model.logbert.predict_utils.Predictor(model: PreTrainedModel | Module | None = None, args: TrainingArguments | None = None, data_collator: DataCollator | None = None, train_dataset: Dataset | None = None, eval_dataset: Dataset | None = None, tokenizer: PreTrainedTokenizerBase | None = None, model_init: Callable[[], PreTrainedModel] | None = None, compute_metrics: Callable[[EvalPrediction], Dict] | None = None, callbacks: List[TrainerCallback] | None = None, optimizers: Tuple[Optimizer, LambdaLR] = (None, None), preprocess_logits_for_metrics: Callable[[Tensor, Tensor], Tensor] | None = None)
Bases:
Trainer
Custom Trainer object for running the inference of logBERT model for unsupervised anomaly detection.
- compute_loss(model, inputs, return_outputs=False)
How the loss is computed by Trainer. By default, all models return the loss in the first element. Subclass and override for custom behavior.
- get_test_dataloader(test_dataset: Dataset) DataLoader
Returns the test [~torch.utils.data.DataLoader]. Subclass and override this method if you want to inject some custom behavior.
- Parameters:
test_dataset – (torch.utils.data.Dataset, optional): The test dataset to use. If it is an datasets.Dataset, columns not accepted by the model.forward() method are automatically removed. It must implement __len__.
logai.algorithms.nn_model.logbert.tokenizer_utils module
- logai.algorithms.nn_model.logbert.tokenizer_utils.get_mask_id(tokenizer)
Get id of mask token, given a tokenizer object.
- Parameters:
tokenizer – (AutoTokenizer): tokenizer object.
- Returns:
id of mask token.
- logai.algorithms.nn_model.logbert.tokenizer_utils.get_special_token_ids(tokenizer)
Get ids of special tokens, given a tokenizer object.
- Parameters:
tokenizer – (AutoTokenizer): tokenizer object.
- Returns:
list of token ids of special tokens.
- logai.algorithms.nn_model.logbert.tokenizer_utils.get_special_tokens()
gets special tokens
- Returns:
list of special tokens
- logai.algorithms.nn_model.logbert.tokenizer_utils.get_tokenizer(tokenizer_dirpath)
Get huggingface tokenizer object from a given directory path.
- Parameters:
tokenizer_dirpath – (str): absolute path to directory containing pretrained tokenizer.
- Returns:
AutoTokenizer: tokenizer object.
- logai.algorithms.nn_model.logbert.tokenizer_utils.get_tokenizer_vocab(tokenizer_dirpath)
Get vocabulary from a given tokenizer directory path.
- Parameters:
tokenizer_dirpath – (str): absolute path to directory containing pretrained tokenizer.
- Returns:
list of vocabulary words.
logai.algorithms.nn_model.logbert.train module
- class logai.algorithms.nn_model.logbert.train.LogBERTTrain(config: LogBERTConfig)
Bases:
object
Class for training logBERT model to learn log representations
- evaluate()
Evaluate methof for evaluating logbert model on dev data using perplexity metric.
- fit(train_dataset: Dataset, dev_dataset: Dataset)
Fit method for training logbert model.
- Parameters:
train_dataset – training dataset of type huggingface Dataset object.
dev_dataset – development dataset of type huggingface Dataset object.
- get_model_checkpoint()
Get the latest dumped checkpoint from the model directory path mentioned in logBERTConfig.
- Returns:
path to model checkpoint (or name of model in case of a pretrained model from hugging face).