Repository for the analysis of log files of failing workflows of the CMS Experiment. The goal is to predict the operator's actions for failing workflows stored in the WTC Console. The input for the machine learning are the error logs of the failing jobs and information about the frequency of the error per site.
To run the analysis of the WMArchive entries with Apache Spark on SWAN and filter the error log snippets run the notebook filter_wm.ipynb. The recommended options for SWAN are:
Software Stack: Bleeding Edge
Memory: 10Gb
Spark cluster: General Purpose (Analytix)
The output is saved in pandas frames.
The preprocessing can be run with input.py. Preprocessing constist out of the following steps:
- Tokenization: uses the NLTK wordbank tokenizer, considers only unique messages and saves them in chunks.
- Cleaning: The cleaning step filters out low frequency words, special characters, cleans the tokens and saves the result in chunks.
- Message selection: since there are multiple messages per key, only one message out of the sequence is chosen
- word2vec: Running word2vec with the error messages creates an embedding matrix that can be used in the Keras embedding layer, either as initial weights or as fixed weights. Additionally a harder filtering / maximum number of words can be specified to reduce further the vocabulary.
- Final input: In the final step the counts and labels from the actionshist are merged with the error messages. Both the averaged vector per message from word2vec is stored, as well as the indexed message for supervised training with RNNs.The output is a sparse pandas frame.
To reproduce the previous results without NLP the training can be run with train_baseline.py. For the hyperparameter optimization Bayesian optimization with scikit-optimize is used.
To train a single NLP model run train.py. For the hyperparameter optimization there are three options:
- Run scikit-optimize with multiple threads threaded_skopt.py
- Run on SWAN with spark (experimental) train_on_spark.ipynb The following options should be chosen for SWAN:
Software Stack: Bleeding Edge
Memory: 10Gb
Spark cluster: Cloud Containers (Kubernets)
and spark:
spark.dynamicAllocation.enabled=False
spark.executor.instances = n (you can have up to 60)
spark.executor.memory 12g (you can specify even 14-15g if you want)
spark.executor.cores 3
spark.kubernetes.executor.request.cores 3
- Train the model distributed on multiple GPUs with the NNLO framework (experimental) example_nlp.py