Error log analysis with Natural Language Processing

Repository for the analysis of log files of failing workflows of the CMS Experiment. The goal is to predict the operator's actions for failing workflows stored in the WTC Console. The input for the machine learning are the error logs of the failing jobs and information about the frequency of the error per site.

DAQ of the error messages

To run the analysis of the WMArchive entries with Apache Spark on SWAN and filter the error log snippets run the notebook filter_wm.ipynb. The recommended options for SWAN are:

Software Stack: Bleeding Edge
Memory: 10Gb
Spark cluster: General Purpose (Analytix)

The output is saved in pandas frames.

Preprocessing

The preprocessing can be run with input.py. Preprocessing constist out of the following steps:

Tokenization: uses the NLTK wordbank tokenizer, considers only unique messages and saves them in chunks.
Cleaning: The cleaning step filters out low frequency words, special characters, cleans the tokens and saves the result in chunks.
Message selection: since there are multiple messages per key, only one message out of the sequence is chosen
word2vec: Running word2vec with the error messages creates an embedding matrix that can be used in the Keras embedding layer, either as initial weights or as fixed weights. Additionally a harder filtering / maximum number of words can be specified to reduce further the vocabulary.
Final input: In the final step the counts and labels from the actionshist are merged with the error messages. Both the averaged vector per message from word2vec is stored, as well as the indexed message for supervised training with RNNs.The output is a sparse pandas frame.

Training of the neural networks

Baseline model

To reproduce the previous results without NLP the training can be run with train_baseline.py. For the hyperparameter optimization Bayesian optimization with scikit-optimize is used.

NLP model

To train a single NLP model run train.py. For the hyperparameter optimization there are three options:

Run scikit-optimize with multiple threads threaded_skopt.py
Run on SWAN with spark (experimental) train_on_spark.ipynb The following options should be chosen for SWAN:

Software Stack: Bleeding Edge
Memory: 10Gb
Spark cluster: Cloud Containers (Kubernets)

and spark:

spark.dynamicAllocation.enabled=False
spark.executor.instances = n (you can have up to 60)
spark.executor.memory 12g (you can specify even 14-15g if you want)
spark.executor.cores 3
spark.kubernetes.executor.request.cores 3

Train the model distributed on multiple GPUs with the NNLO framework (experimental) example_nlp.py

Name		Name	Last commit message	Last commit date
Latest commit History 138 Commits
docs		docs
experiments_baseline		experiments_baseline
preprocessing		preprocessing
spark		spark
training		training
utils		utils
.gitignore		.gitignore
README.md		README.md

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Repository files navigation

Error log analysis with Natural Language Processing

DAQ of the error messages

Preprocessing

Training of the neural networks

Baseline model

NLP model

About

Uh oh!

Releases

Packages

Uh oh!

Languages

llayer/AIErrorLogAnalysis

Folders and files

Latest commit

History

Repository files navigation

Error log analysis with Natural Language Processing

DAQ of the error messages

Preprocessing

Training of the neural networks

Baseline model

NLP model

About

Resources

Uh oh!

Stars

Watchers

Forks

Releases

Packages 0

Uh oh!

Languages

Packages