We will be releasing the code late December; apologies for the delay. We provide below the highlights of the paper.
Jai Bardhan1 , Cyrin Neeraj1, Subhadip Mitra1, Tanumoy Mandal2
(Accepted to Machine Learning for Physical Sciences workshop in NeurIPS 2024)
At collider experiments at the LHC, beams of high energy particles are collided against one another to search for interesting New Physics phenomena signatures. Due to the rare nature of these NP phenomena, separating them from huge commonplace Standard Model background is a an extremely challenging task. Sophisticated deep learning models are used to separate interesting signal events from that of the SM background using simulated data and then, deployed on real collision data. The signal plus background hypothesis (
Usually, a cross-entropy loss (BCE) or some weighted variants of it used to train classifiers to separate signal events from the backgrounds. Training then does not make use of the first-principle knowledge on the scattering rates of processes that are to be considered in a search; this can be obtained from quantum field theoretic calculations. Furthermore, BCE losses is less aligned to metric we perform the hypothesis testing with.
Our motivations, are therefore, are two-fold:
- Incorporating scattering rates of particular processes considered in a particle physics search systematically while training the classifier.
- Usual losses for classification tasks (like BCE) that optimise the signal-to-background ratio (
$r=N_s/N_b$ ) need not maximise the significance as the$Z$ score depends on the ratio and the absolute number (i.e., set size) of signal or background events ($Z\approx \sqrt{N_s}\cdot\sqrt{r} = r\sqrt{N_b}$ ).
Here we ask (and answer), can we derive a loss function that maximises the
The
We must consider some points before constructing a loss function based on the
We look for a smooth submodular function. A submodular function is a function that captures the concept of diminishing returns. It is defined on sets and has a property similar to concavity. Formally, submodularity can be defined as:
A set function
or equivalently
The submodular functions can be optimised using greedy optimisation techniques, and it is to find optimal solutions in polynomial times. However, these discrete optimisation techniques cannot be used directly without a gradient.
The Lovàsz extension allows us to associate a continuous, convex function with any submodular function.
For a set function
where
For the Lovàsz extension to be applicable, the set function must be submodular.
Additionally, the Lovàsz extension of a submodular function preserves submodularity, i.e., the extension evaluated at the points of the hypercube still follows submodularity.
We propose the following set function, a surrogate to the
where,
-
$y$ $(\tilde{y})$ $\to$ ground-truth (predicted) label -
$\nu_i$ $\to$ number of events of process type$i$ ($i\in S \cup B$ ) -
$n_i, p_i$ $\to$ number of false negatives, false positives -
$\mathcal{L} \to$ Luminosity at which the experiment is performed. -
$\epsilon,;\sum_{i \in S} \sigma_i \mathcal{L}/\sqrt{\epsilon}$ $\to$ added to ensure$\Delta_Z(\emptyset) = \boldsymbol{0}$ .
In the paper, we prove that the surrogate above is submodular (Appendix 1) and, hence, get a convex loss function,
Choice of error $\boldsymbol{m}$ (from Lovàsz extenstion)
We pick the error is given by the hinge loss,
There is only one free parameter in our loss:
Our goal is to separate the signal (RAdam optimiser.
We train the linear classifier using the BCE loss and
- Case 1:
$\sigma_{b1} = 1$ fb,$\sigma_{b2} = 100$ fb;$\sigma_{s} = 0.1$ fb. - Case 2:
$\sigma_{b1} = 100$ fb,$\sigma_{b2} = 1$ fb;$\sigma_{s} = 0.1$ fb.
We set the luminosity to
Left Panel: Decision boundaries for Case 1; Right Panel: Decision boundaries for Case 2.
Feature distributions are the same for both cases; the cases differ in the scattering rates of the processes considered.
The classifier trained with
(From the top) First panel: The estimated
Second panel: The distribution of the
Third panel: Class efficiencies vs.
For the scans presented above, we demand
We plot the ROC curves for experiments for Case 1. Case 2 gives similar results. Let
and the true background efficiency is given by,
Left panel: ROC Curve for dataset (total) background efficiency vs signal efficiency; Right panel: ROC Curve for true background efficiency vs signal efficiency.
The true background efficiency differs from the total background efficiency in that it accounts for the cross sections of the background processes. ROC curves show that
While our results are promising, further tests are needed to fully characterise and understand the benefits and limitations of
Finally, we note that while it is possible to introduce rate-dependent weights directly in the BCE loss, tuning them is an empirical task. The weights that yield the best performance need not be simply the rates of the processes. In contrast,






