The job of the muon ID algorithm is to select the most muon-like track from all of the reconstructed tracks within a given slice. Since νµ CC event selection is handled by the method described in section 8.1.2, the tracks selected by the muon ID from this sample of events are assumed to be the primary muons generated in the νµ CC interactions. To select the most muon-like track in a slice, I generated a “muon score” for each track in the slice, taking the track with the highest score. This “muon score” is computed using a k-nearest-neighbor (kNN) classification algorithm, generated with the TMVA package [79]
in ROOT [31, 32]. I chose this method for its simplicity in implementation.
A kNN classification algorithm works by building up a phase space density-map for both
“signal” and “background” events, distributed according to some group of input variables.
An example of such a distribution is shown in figure 7.9. A collection of signal events and background events must be fed to the algorithm to build the initial version of this density map (a process referred to as “training” the kNN.) Once the training is complete, an unknown event can be classified by searching the density map to find the “k” number of nearest neighbors. The score for that unclassified event is the relative probability that the event is of “signal type” calculated as
Ps=
where Ns (Nb) is the number of signal (background) events among the nearest neighbors, Ws and Wb are weights that can be set for each individual event, and Ns+ Nb = k. In the simple case of a training sample composed of an equal number of signal and background
events, all with weight W = 1,
Ps= Ns
Ns+ Nb = Ns
k . (7.5)
A minimum value of Ps can then be chosen to select a sample of signal events that meets some desired efficiency and purity.
Given that the input variables are likely to have different units, the variable with the widest distribution will dominate a normal Euclidean metric used to compute distances within the density-map. Therefore, a weighted Euclidean metric is used when determining the nearest neighbors to an unclassified event. This weighted distance r between an event in the density-map x and the unclassified event y is given by
r2=
where the index i classifies the d phase space variables, and wi is the width of the xi distribution for the combined sample of signal and background events. So as to not be distorted by events out in the tails of the distribution, this width wican be chosen to contain only a certain percentage of the events around the central region of the distribution. Overall, the kNN classification method works well for situations in which the borders between signal and background are vague or irregularly shaped.
I created the BPF muon ID kNN using version 5.34/25 of ROOT [31, 32] with version 4.2.0 of the TMVA toolkit [79]. The signal and background samples were tracks recon-structed from far detector events simulated with GENIE. The events (slices) from which the training samples were picked, were required to have no cell hits within 50 cm of a wall, at least 10 hits in each view, and to be true νµ CC events. Signal and background tracks from these events were divided according to whether or not the particle that contributed
Figure 7.9: An example of the phase space density-map for a kNN with 2 input variables, x1 and x2 (taken from [79].) The filled circles are the signal events and the open circles are the background events. The star represents an unclassified event with the “nearest neighborhood” drawn in a circle around it. In this example, the number of signal and background events within the circle is roughly the same. Therefore Ps≈ 0.5.
the most energy to the track was a muon (µ±) or not. The number of signal (background) tracks passing all cuts in the training sample was roughly 410,000 (460,000). The default mode for the kNN classification scheme (which was used for this training) is to scale the background events so that the total sum of the event weights for both signal and back-ground is the same. Each event in the backback-ground sample was therefore given a weight of Wb = 0.893. The distribution widths (wi) of each of the input variables, was set to the width of the central 80% of the values for that variable.
To choose the number of neighbors k, I trained several kNNs using k = 20, 40, and 80.
In each case, I examined the percentage of slices in which the track with the best muon ID value was a true muon for true νµCC events and for true νµCC events with a best muon ID value > 0.4. For both of these samples, these percentages showed essentially no variation for the different values of k. I chose to use k = 80 since the percentages were slightly higher for this value in the sample of true νµ CC events.
The distribution of muon ID values for signal and background tracks is shown in figure 7.10. The data used to generate both plots in this figure was taken from a set of far detector events simulated with GENIE, separate from the training sample, but selected by the same criteria used to select the training events. The second plot in figure 7.10 shows the same distribution for background tracks only, broken down by particle type. The background track sample was approximately 40% protons, 40% photons, and 20% π±.
Figure 7.10: Top: Distribution of BPF muon ID values for signal and background tracks (normalized to unit area.) Bottom: Distribution of BPF muon ID values for background tracks only, broken down into the particle that contributed the most energy to the track.
Since I am using this muon ID value to select the most “muon-like” track from a set of events selected as νµ CC, the best performance metric for the BPF muon ID value is how
often it correctly picks the muon in this scenario. Shown in table 7.1 are the percentage break downs by true particle type, for the most muon-like track in a slice selected by the BPF muon ID. The break downs are given for νµ CC events selected by truth, νµ CC events selected by the ReMId algorithm discussed in section 8.1.2, and true νµ CC events selected by ReMId (included for comparison since the ReMId selected sample does include a small number of NC events). ReMId also uses a kNN with input variables from a different tracking algorithm and is used to select events for the standard NOvA νµCC disappearance analysis. I use ReMId to select the events for my νµCC disappearance analysis as well. For this sample, the BPF muon ID correctly selects a muon track over 96% of the time. The misidentified non-muon tracks are roughly an equal mixture of protons, photons, and π±.
particle true νµ CC ReMId selected ReMId and true νµ CC
µ± 87.6 96.3 97.9
γ 4.3 1.1 0.7
p 4.9 0.9 0.6
π± 3.0 1.6 0.8
Table 7.1: Percentages for the particle that contributed the most energy to the track selected by the BPF muon ID to be the most “muon-like” track within a slice. Roughly one third of all events are NC. The ReMId selection algorithm is discussed in section 8.1.2.