1 Introduction
Measurements in many areas of research like, e.g., highenergy particle physics, are typically based on the statistical inference of one or more parameters of interest defined by the likelihood with the observables building the dataset and the parameters of the statistical model. The likelihood would have to be evaluated for the dataset spanning a highdimensional input space, which is computational expensive and typically unfeasible. The dimension of can be reduced by the engineering of highlevel observables and the usage of summary statistics. Analysts create highlevel observables to reduce the dimension of a single observation to , ideally without losing information about the parameters . An example from highenergy particle physics is the usage of an invariant mass of a decay system instead of the kinematic properties of all its constituents. The dimension of can be reduced with the application of a summary statistic, for which histograms are frequently used so that the statistical model can be expressed in form of a likelihood, based on Poisson statistics. The dimension is thus reduced from the number of observations to the number of bins in the histogram, whereby the analyst tries to lose as little information as possible by optimizing the location and number of the bins. Applying both methods, the initial dimension of is reduced to .
This paper discusses an analysis strategy using machine learning techniques, by which the suboptimal performance introduced by the reduction of dimensionality can be avoided resulting in estimates of the parameters of interest
close to optimal. We put emphasis on the applicability of this approach to analyses commonly performed in highenergy particle physics at the LHC cowan2011asymptotic ; atlas2011procedure like the discovery of the Higgs boson in 2012 chatrchyan2012observation ; aad2012observation .Section 2 presents the method in detail and section 3 puts the proposed technique in context of related work. Section 4 shows the performance of the method with a simple example using pseudoexperiments of a twocomponent mixture model with signal and background and section 5 applies the same approach to a more complex example from highenergy particle physics.
2 Method
The method is built on top of an initial dataset used for the statistical inference of the parameters of interest with being the number of observations and the number of observables. To simplify the statistical evaluation, we want to reduce the number of observables by the engineering of highlevel observables. Besides manual crafting of such features, a suited approach taken from machine learning is using a NN function with being the free parameters. After application of the NN, we get a transformed dataset with the number of output nodes of the NN architecture.
To reduce the dataset further, the number of observations is compressed using a histogram. Histograms are widely used as a summary statistic since counts are well described by the Poisson statistic and therefore well suited to build the statistical model of the analysis. For example in highenergy particle physics, many statistical models and well established methods for describing systematic uncertainties are based on binned Poisson likelihoods and could only be replaced with an enormous effort. The resulting dataset is using number of bins for the dimensional histogram. The count operation for a single bin in the histogram can be written as with
(1) 
So that we can propagate the gradient from the result of the statistical inference to the free parameters of the NN, the histogram has to be differentiable. Since the derivative of is illdefined on the edges of the bin and otherwise zero, the gradient is not suitable for optimization. Therefore, we use a smoothed approximation of the gradient wunsch2019reducing shown in figure 1 for a onedimensional bin. The approximation uses the similarity of to a Gaussian function normalized to
with the standard deviation being the halfwidth of the bin. We replace only the gradient of the operation
and not the calculation of the count itself.On top of the reduced dataset , we build the statistical model using a binned likelihood with being the parameters of the statistical model. For a mixture model with the two processes signal and background , the binned likelihood describing the statistical component is given by
(2) 
with
being the Poisson distribution,
the observation and the parameter of interest scaling the expectation of the signal process .Moreover, the formulation of the statistical model allows to implement systematic uncertainties by adding nuisance parameters to the set of parameters . For the model in equation 2, a single nuisance parameter controlling a systematic variation of the expected bin contents results in
(3) 
with being a normalized Gaussian constraining the nuisance . If the systematic variation is asymmetric, the additional nuisance term can be written as
(4) 
or with any other differential formulation conway2011incorporating .
The performance of an analysis is measured in terms of the variance of the estimate for the parameters of interest, for example in our case the variance of the estimated signal strength . We built a differential estimate of the variance using the Fisher information fisher1925 of the likelihood in equation 3 given by
(5) 
Because the maximum likelihood estimator is asymptotically efficient cramer1999mathematical ; rao1992information , the variance of the estimates for is asymptotically close to
(6) 
Assuming the first diagonal element to correspond to the parameter of interest
, without loss of generality, the loss function to optimize the variance of the estimate for
with respect to the free parameters of the NN function is .To be independent of the statistical fluctuations of the observation, the optimization is performed on an Asimov dataset cowan2011asymptotic . This artificial dataset replaces the observation with the nominal expectation for and serving as representative for the median expected outcome of the analysis in presence of the signal plus background hypothesis.
Given the assumption that the dimensionality reduction performed by the NN together with the histogram is a sufficient statistic, the optimization can find a function for that gives the best estimate for the parameter of interest , similar to a statistical inference performed on the initial highdimensional dataset with an unbinned likelihood.
A graphical overview of the proposed method is given in figure 2.
3 Related work
The approach in deCastro:2018mgh identifies the problem that a histogram has no suitable derivative and therefore replaces the summary statistic by means of a softmax function, which shares similarities with a count. This does not allow to construct trivially a likelihood on top of the summary and is therefore only suited for likelihoodfree inference problems. Further, the approach parametrizes the systematic variations in the input space and not on the reduced space after the application of the summary statistic, which omits the possibility to describe systematic variations with reweighting techniques in form of statistical weights. These points constrain the applicability of the approach. We circumvent this problem by leaving the computation of the histogram unchanged but approximating the gradient.
The strategy to allow a NN to find the best compression of the data has been also discussed in Charnock:2018ogm . This approach shows that the NN is able to learn a summary statistic that is a close approximation of a sufficient statistic, yielding a powerful statistical inference. Similar to deCastro:2018mgh , the method is based on likelihoodfree inference restricting the applicability for our target usecases in highenergy particle physics at the LHC.
A related approach to include systematic uncertainties in the training of the NN is the explicit decorrelation against the systematic variation. For example, the idea has been discussed on the basis of an adversarial architecture louppe2017learning and an approach penalizing the variation using approximated bin counts wunsch2019reducing . These strategies are not aware of the analysis objective such as the variance of a parameter of interest and therefore the decorrelation is subject to manual optimization. For a large number of nuisances, this optimization procedure is computational expensive and typically unfeasible.
4 Application to a simple example based on pseudoexperiments
A simple example based on pseudoexperiments and a known likelihood in the input space is used to illustrate our approach. The distributions of the signal and background components in the input space are shown in figure 3. We assume a systematic uncertainty on the mean of the background process modelled by the shifts , representing the systematic variations in equation 4.
The NN architecture is a fullyconnected feedforward network with nodes in one hidden layer. The initialization follows the Glorot algorithm glorot2010understanding
and the activation function is a rectified linear unit
glorot2011deep . The output layer has a single node with a sigmoid activation function.We use eight bins for the histogram of the NN output and compute the variance of the estimate for the parameter of interest denoted by
. The operations are implemented using TensorFlow as computational graph library
abadi2016tensorflow ; tensorflow_probability and we use the provided automatic differentiation and the Adam algorithm kingma2014adam to optimize the free parameters with the objective to minimize . The systematic variations can be implemented with reweighting techniques using statistical weights or duplicates of the nominal dataset with the simulated variations, whereas we chose the latter solution. Each gradient step is performed on the full dataset with simulated events for each process. The training is stopped if the loss has not improved for gradient steps eventually using the model with the smallest loss on the validation dataset for further analysis. We found that the convergence is more stable if the model is first optimized only on the statistical part of the likelihood shown in equation 2 and therefore apply the pretraining for gradient steps. We apply statistical weights to scale the expectation of signal and background to and , respectively. The dataset is split in half for training and validation, and all results are computed from a statistically independent dataset of the same size as the original one.The best possible expected result in terms of the variance of the estimate for is given by a fit of the unbinned statistical model without dimensionality reduction. Alternatively, we can get an asymptotically close result by using a binned likelihood with sufficiently large number of bins in the twodimensional input space. The latter approach with equidistant bins in the range shown in figure 3 results in the profile shown in figure 4 with . The bestfit value of is always at because of the used Asimov dataset. Further, we find the uncertainty of in all fits by profiling the likelihood james2006statistical rather than using the approximation by the covariance matrix in equation 6. We obtain all results in this paper with validated statistical tools, RooFit and RooStats root ; RooStats ; roofit , such as used by most publications analyzing data of the LHC experiments.
The first comparison to this bestpossible result is done by training the NN not on the variance of the estimate for , , but on the cross entropy loss with signal and background weighted to the same expectation. This approach has been used in multiple analyses in highenergy particle physics tthbb ; smhtt . The NN function is a sufficient statistic  and therefore optimal  if no systematic uncertainties have to be considered for the statistical inference such as the likelihood in equation 2 deCastro:2018mgh . The resulting function is shown in the input space and by the distribution of the output in figure 7. The NN learns to project the twodimensional space spanned by and on the diagonal, which is trivially the optimal dimensionality reduction in this simple example. If we apply the statistical model including the systematic uncertainty on the histograms in figure 7, the parameter of interest is fitted as with an uncertainty worse by than the best possible result obtained above.
As a consistency check for our new strategy described in section 2, we train the NN on the variance of the estimate for given by in equation 6 but without adding the nuisance parameter modelling the systematic uncertainty. The resulting NN function in the input space, the distribution of the outputs and the profile of the likelihood are shown in figure 7. As expected, the plane of the function in the input space is qualitatively similar, resulting with in a comparable performance than the training on the cross entropy loss. It should be noted that the systematic uncertainty has been included again for the statistical inference.
When adding the nuisance parameter to the likelihood, the training of the NN results in the function shown in figure 7. The uncertainty of the parameter of interest is with the fit result considerably decreased and lowers the residual difference to the optimal result from to . The function in the input space in figure 7 shows that the training identified successfully the signalenriched region with less contribution of the systematic uncertainty resulting in counts in the histogram yielding high signal statistics with a small uncertainty from the variation of the background process. Figure 7 shows also that the NN function is decorrelated against the systematic uncertainty because the profile of the likelihood changes only little if we remove the systematic uncertainty from the statistical model. The proposed method shares this feature with other approaches for decorrelation of the NN function such as discussed in section 3
. The difference is that the strength of the decorrelation is not a hyperparameter but controlled by the higher objective
, which enables us to find directly the best tradeoff between statistical and systematic uncertainty contributing to the estimate of . The correlation of the parameter of interest to the parameter controlling the systematic variation is reduced from for the training on the cross entropy loss to for the training on the variance of the parameter of interest .5 Application to a more complex analysis task typical for highenergy particle physics
In this section, we apply the proposed method to a problem typical for data analysis in highenergy particle physics at the LHC. We use a subset of the dataset published for the Higgs boson machine learning challenge adam2014learning extended by a systematic variation. The goal of the challenge is to achieve the best possible significance for the signal process representing Higgs boson decays to two tau leptons overlaid by the background simulated as a mixture of different physical processes adam2014learning . We pick from the dataset four variables, namely PRI_met, DER_mass_vis, DER_pt_h and DER_deltaeta_jet_jet and select only events, which have all of these features defined. In addition to the event weights provided with the dataset, we scale the signal expectation with a factor of two. The final dataset has 244.0 and 35140.1 (106505 and 131480) weighted (unweighted) events for the signal and background process, respectively. The systematic uncertainty in the dataset is assumed as a uncertainty on the missing transverse energy implemented with the transformation and propagated to the other variables using reweighting. The distributions of the variables including the systematic variations are shown in figures 8 to 10. The NN is trained only on three of the four variables, excluding the missing transverse energy. The systematic variations propagated to the remaining variables are thus correlated via a hidden variable, representing a more complex scenario than the simple example in section 4. We split the dataset using one third for training and validation of the NN, and two thirds for the results presented in this paper. The NN architecture and the training procedure are the same as implemented for the simple example in section 4 with the difference that we apply a standardization of the input ranges following the rule with the mean and standard variation of the input .
An (asymptotically) optimal result as derived for the previous example is not available since the likelihood in the input space is not known. Instead we use the training on the cross entropy loss as reference with . Using as training objective, but without the implementation of the systematic variations of the input distributions in the loss function, the result for the signal strength shows a similar uncertainty compared to this reference. However, using the full likelihood from equation 3 as training objective, the signal strength is fitted with . The inclusion of the systematic variations yields an improvement in terms of the uncertainty on of compared to the training on the cross entropy loss. The histograms and profiles of the likelihood used for extracting the results are shown in figures 14 to 14. For the assessment of the distributions of the NN output, it should be noted that in contrast to the training based on the cross entropy loss, for the training based on no preference is given for signal (background) events to obtain values close to 1 (0). Similar to the result from the simple example in section 4, the profiles of the likelihood for all scenarios show that the training on removes the dependence on the systematic uncertainty yielding a smaller variance on . On the other hand, the training on the cross entropy optimizes best the estimate of in the absence of systematic uncertainties, as expected from our previous discussion. With the proposed strategy, the NN function learns to decorrelate against the systematic uncertainty, visible in the correlation of the signal strength to the parameter controlling the systematic variation, which drops from for the training on the cross entropy to for the training on the variance of the parameter of interest , based on the full likelihood information as given in equation 3.
To improve the estimate of for the approach with the NN trained on the cross entropy loss, a possible strategy could be to increase the number of histogram bins to exploit better the separation between the signal and background process. Figure 15 shows the development of the performance with the number of bins for the training on the cross entropy loss and the training on the likelihood via . The training on the cross entropy loss results in an estimate of with a mean correlation to the nuisance parameter of and a falling uncertainty in with an average distance of between the result for taking only the statistical uncertainties and statistical and systematic uncertainties into account for the statistical inference of . In contrast, the strategy with the NN trained on shows a reduction of the correlation between and of when moving from two to eight bins for the input histogram for the statistical inference. The estimate remains robust against the systematic variation for all tested configurations, yielding a smaller variance for the estimate of compared to the training on the cross entropy loss. The average distance between the inference using only the statistical part of the likelihood and the full statistical model is . Including the systematic uncertainty in the inference, the comparison of the estimate of between the training based on and the training based on the cross entropy shows an improved variance of by 0.07 on average, yielding a stable average improvement of .
It should be noted that in practice the granularity of the binning is limited by the statistics of data and the simulation. Limited statistical precision in the simulation is usually taken into account by introducing dedicated systematic uncertainties in the statistical model that typically degrade the performance of the analysis for a large number of bins.
6 Summary
We have presented a novel approach to optimize statistical inference in the presence of systematic uncertainties, when using dimensionality reduction of the dataset and likelihoods based on Poisson statistics. Neural networks and the differential approximation for the gradient of a histogram enables us to optimize directly the variance of the estimate of the parameters of interest in consideration of the nuisance parameters representing the systematic uncertainties of the measurement. The proposed method yields an improved performance for data analysis influenced by systematic uncertainties in comparison to conventional strategies using classificationbased objectives for the dimensionality reduction. The improvements are discussed using a simple example based on pseudoexperiments with a known likelihood in the input space and we show that the technique is able to perform a statistical inference close to optimal by leveraging the given information about the systematic uncertainties. The applicability of the method for more complex analyses is demonstrated with an example typical for data analyses in highenergy particle physics.
Acknowledgments
We thank Lorenzo Moneta and Andrew Gilbert for helpful discussions and feedback, which greatly improved the manuscript.
References
 (1) Cowan, G., Cranmer, K., Gross, E., Vitells, O.: Asymptotic formulae for likelihoodbased tests of new physics. Eur. Phys. J. C 71(2) (2011) 1554
 (2) The ATLAS and CMS collaborations: Procedure for the LHC Higgs boson search combination in summer 2011. Technical report, ATLPHYSPUB2011011, CMS NOTE 2011/005 (2011)
 (3) The CMS collaboration: Observation of a new boson at a mass of 125 GeV with the CMS experiment at the LHC. Phys. Lett. B 716(1) (2012) 30
 (4) The ATLAS collaboration: Observation of a new particle in the search for the Standard Model Higgs boson with the ATLAS detector at the LHC. Phys. Lett. B 716(1) (2012) 1
 (5) Wunsch, S., Jörger, S., Wolf, R., Quast, G.: Reducing the dependence of the neural network function to systematic uncertainties in the input space. Computing and Software for Big Science 4(1) (Feb 2020)
 (6) Conway, J.S.: Incorporating nuisance parameters in likelihoods for multisource spectra (2011)
 (7) Fisher, R.A.: Theory of statistical estimation. Mathematical Proceedings of the Cambridge Philosophical Society 22(5) (1925) 700–725
 (8) Cramér, H.: Mathematical methods of statistics. Volume 9. Princeton university press (1999)
 (9) Rao, C.R.: Information and the accuracy attainable in the estimation of statistical parameters. In: Breakthroughs in statistics. Springer (1992) 235–247
 (10) De Castro, P., Dorigo, T.: INFERNO: InferenceAware Neural Optimisation. Comput. Phys. Commun. 244 (2019) 170–179
 (11) Charnock, T., Lavaux, G., Wandelt, B.D.: Automatic physical inference with information maximizing neural networks. Phys. Rev. D97(8) (2018) 083004
 (12) Louppe, G., et al.: Learning to pivot with adversarial networks. In: Advances in Neural Information Processing Systems. (2017) 982

(13)
Glorot, X., Bengio, Y.:
Understanding the difficulty of training deep feedforward neural
networks.
In: Proceedings of the thirteenth international conference on artificial intelligence and statistics. (2010) 249–256
 (14) Glorot, X., et al.: Deep sparse rectifier neural networks. In: Proceedings of the fourteenth international conference on artificial intelligence and statistics. (2011) 315–323
 (15) Abadi, M., Agarwal, A., Barham, P., et al.: Tensorflow: Largescale machine learning on heterogeneous distributed systems. arXiv preprint arXiv:1603.04467 (2016)
 (16) Dillon, J.V., Langmore, I., Tran, D., Brevdo, E., Vasudevan, S., Moore, D., Patton, B., Alemi, A., Hoffman, M.D., Saurous, R.A.: Tensorflow distributions. CoRR abs/1711.10604 (2017)
 (17) Kingma, D., Ba, J.: Adam: A method for stochastic optimization. arXiv preprint arXiv:1412.6980 (2014)
 (18) James, F.: Statistical methods in experimental physics. World Scientific Publishing Company (2006)
 (19) Antcheva, I., Ballintijn, M., Bellenot, B., et al.: ROOT  A C++ framework for petabyte data storage, statistical analysis and visualization. Computer Physics Communications 180(12) (2009) 2499–2512
 (20) Moneta, L., Belasco, K., Cranmer, K.S., Kreiss, S., Lazzaro, A., Piparo, D., Schott, G., Verkerke, W., Wolf, M.: The RooStats project. In: 13 International Workshop on Advanced Computing and Analysis Techniques in Physics Research (ACAT2010), SISSA (2010) PoS(ACAT2010)057.
 (21) Verkerke, W., Kirkby, D.: The RooFit toolkit for data modeling (2003)
 (22) The CMS collaboration: Search for production in the decay channel with leptonic decays in protonproton collisions at tev. JHEP 03 (2019) 026
 (23) The CMS collaboration: Measurement of Higgs boson production and decay to the final state. (2019)
 (24) AdamBourdarios, C., et al.: Learning to discover: the Higgs boson machine learning challenge https://higgsml.lal.in2p3.fr/documentation/.
Comments
There are no comments yet.