Currently, when we evaluate a constraint classifier, the evaluation runs the inference for each label and the labels are evaluated independently. This includes a huge amount of redundant computations. Each time we perform inference we can count the TP/FP/etc for all the involved labels in the head of inference and evaluate all simultaneously by parsing over each head object only once.