{ "cells": [ { "cell_type": "markdown", "id": "e6260cf0", "metadata": {}, "source": [ "# **BDTs at work: the $\\Omega$ analysis**\n", "\n", "The goal of this tutorial is to provide an example of binary classification with machine learning techniques applied to an ALICE analysis. This tutorial is based on the measurement of the invariant mass of the $\\mathrm{\\Omega}$ , through its cascade decay channel $\\mathrm{\\Omega^-} \\rightarrow \\mathrm{\\Lambda} + K^- \\rightarrow p + \\pi^- + K^-$. We will need two samples: \n", "- Real data: Pb--Pb collisions at $s_{\\sqrt{NN}} = 5.02$ TeV (LHC18qr, subsample)\n", "- Anchored MC production: LHC21l5\n", "\n", "At the end of the tutorial we will be able to see the peak of the $\\mathrm{\\Omega}$ !\n" ] }, { "cell_type": "markdown", "id": "8f35bcb5", "metadata": {}, "source": [ "" ] }, { "cell_type": "markdown", "id": "0bb651de", "metadata": {}, "source": [ "#### First, we need some libraries ###" ] }, { "cell_type": "code", "execution_count": null, "id": "61974b28", "metadata": {}, "outputs": [], "source": [ "### standard sci-py libraries\n", "import numpy as np\n", "import matplotlib.pyplot as plt\n", "import pandas as pd\n", "\n", "import uproot ### to read, convert, inspect ROOT TTrees\n" ] }, { "cell_type": "markdown", "id": "dd7657f9", "metadata": {}, "source": [ "One tip before starting: to access the documentation associated to each function we are going to call just type Shift+Tab after the first parenthesis of the function" ] }, { "cell_type": "markdown", "id": "5ec88a72", "metadata": {}, "source": [ "## Reading trees with uproot, handling them with pandas" ] }, { "cell_type": "markdown", "id": "4090cdb0", "metadata": {}, "source": [ "Uproot (https://github.com/scikit-hep/uproot4) is a Python package that provides tools for reading/writing ROOT files using Python and Numpy (does not depend on ROOT) and is primarly intended to stream data into machine learning libraries in Python." ] }, { "cell_type": "code", "execution_count": null, "id": "cf8f4fa1", "metadata": {}, "outputs": [], "source": [ "## first we have to download the trees\n", "\n", "!curl -L https://cernbox.cern.ch/s/V05rgkoJfGe8x7K/download --output AnalysisResults-mc_reduced.root\n", "!curl -L https://cernbox.cern.ch/s/ReP4m9tDJ6UfivD/download --output AnalysisResults_reduced.root" ] }, { "cell_type": "code", "execution_count": null, "id": "6d4b8dfa", "metadata": {}, "outputs": [], "source": [ "mc_file = uproot.open(\"AnalysisResults-mc_reduced.root\")" ] }, { "cell_type": "code", "execution_count": null, "id": "a309d9cf", "metadata": {}, "outputs": [], "source": [ "mc_file.keys()" ] }, { "cell_type": "code", "execution_count": null, "id": "5e77483d", "metadata": {}, "outputs": [], "source": [ "mc_file[\"XiOmegaTree\"].keys()" ] }, { "cell_type": "code", "execution_count": null, "id": "633858e6", "metadata": {}, "outputs": [], "source": [ "## convert the tree into a dictionary of numpy arrays" ] }, { "cell_type": "code", "execution_count": null, "id": "df46a49b", "metadata": {}, "outputs": [], "source": [ "numpy_mc = mc_file[\"XiOmegaTree\"].arrays(library=\"np\")" ] }, { "cell_type": "code", "execution_count": null, "id": "7cee450e", "metadata": {}, "outputs": [], "source": [ "## print the array" ] }, { "cell_type": "markdown", "id": "75d53747", "metadata": {}, "source": [ "Not easy to handle! Better to use Pandas. Pandas is a library that provides data structures and analysis tools for Pyhton. The two primary data structures of pandas are Series (1-dimensional) and DataFrame (2-dimensional) and we will work with them.\n", "- Series are 1-dimensional ndarray with axis labels.\n", "- DataFrames are 2-dimensional tabular data structure with labeled axes (rows and columns).\n", "\n", "For more details: https://pandas.pydata.org/pandas-docs/stable/" ] }, { "cell_type": "code", "execution_count": null, "id": "458120eb", "metadata": {}, "outputs": [], "source": [ "## same exercise as before, but use pandas this time" ] }, { "cell_type": "code", "execution_count": null, "id": "931a5e2e", "metadata": {}, "outputs": [], "source": [ "## inspect the dataframe, use the .head() function" ] }, { "cell_type": "code", "execution_count": null, "id": "0d9a12c2", "metadata": {}, "outputs": [], "source": [ "pd_mc.columns ## the suffix MC indicates the generated quantity" ] }, { "cell_type": "markdown", "id": "71773ecb", "metadata": {}, "source": [ "#### Query and eval operations to apply selections and generate new columns" ] }, { "cell_type": "code", "execution_count": null, "id": "23efae44", "metadata": {}, "outputs": [], "source": [ "## let's focus only on Omegas\n", "pd_mc.query(\"isOmega and abs(pdg)==3334 and flag==1\", inplace=True)" ] }, { "cell_type": "code", "execution_count": null, "id": "a57c33d4", "metadata": { "scrolled": true }, "outputs": [], "source": [ "## have a look at the momentum and decay length generated distributions\n", "plt.hist(pd_mc[\"ctMC\"], bins=1000, range=[1,15]);\n", "plt.yscale(\"log\")\n", "plt.xlabel(r\"$ct$ (cm)\");\n", "plt.ylabel(\"Counts\");" ] }, { "cell_type": "code", "execution_count": null, "id": "d50b8df1", "metadata": {}, "outputs": [], "source": [ "## select only the MC particles that are reconstructed, but keep the full df\n", "df_reco = pd_mc.query(\"isReconstructed==True\")" ] }, { "cell_type": "code", "execution_count": null, "id": "5f45447d", "metadata": {}, "outputs": [], "source": [ "## compute the relative momentum resolution using eval" ] }, { "cell_type": "code", "execution_count": null, "id": "bd4c8f50", "metadata": {}, "outputs": [], "source": [ "## compute the efficiency vs momentum using numpy histograms" ] }, { "cell_type": "markdown", "id": "faa35ff7", "metadata": {}, "source": [ "## Machine Learning" ] }, { "cell_type": "markdown", "id": "3360d2a2", "metadata": {}, "source": [ "Supervised learning is a subcategory of ML well known in HEP. Supervised learning\n", "algorithms are employed for discriminating between two or more classes, signal and\n", "background in our case, starting from a set of examples called training set. Each\n", "element of the training sample, a $\\Omega$ candidate in this tutorial, has a label\n", "containing its class (signal / background), which is known a priori: the training\n", "process fixes the internal parameters of the learning algorithm in order to maximize\n", "the separation power among the classes. The goal of the training is to teach to the model a common pattern in data that can be used to classify properly an independent sample, in our case the real data sample. The output of the supervised model, or score, is evaluated starting from the candidate properties, which are called features. The score is related to the candidate probability of belonging to the different classes\n", "\n", "\n", "In this tutorial Boosted Decision Trees (BDTs) will be used for tagging real $\\Omega$ candidates. The core of every BDT model is the decision tree algorithm (DT). A\n", "DT is a flowchart-like binary structure where an internal node represents a feature(or\n", "candidate), the branch represents a decision rule, and each leaf node represents the\n", "outcome. The topmost node in a decision tree is known as the root node. The DT\n", "works by combining a sequence of simple binary tests (each branch of the tree), to\n", "classify a data point in terms of its features. Each test consists in a linear threshold\n", "applied to one of the features which helps the model to predict the belonging class\n", "of every candidate. \n", "\n", "\n", "\n", "\n", "The training of a DT consists in the automatic procedure that builds the tree recursively starting from the training set. The main flaw of the DT is that it is prone to the so-called overfitting: this means that the model is able to perfectly classify the training set if deep enough (the depth\n", "is defined as the length of the longest path from a root to a leaf), but it does not generalize well to new data. Overfitting occurs when the model memorizes the training\n", "set rather than learning a general pattern in the data. To overcome this problem,\n", "BDT algorithms combine numerous shallow trees using for each a subsample of\n", "features. In particular, in the boosting procedure the DTs are constructed sequentially\n", "taking care of compensating the misclassified candidates of the previous trees. The\n", "resulting model, the BDT, maintains high performances both on the training and the\n", "test set.\n", "\n", "\n", "\n" ] }, { "cell_type": "markdown", "id": "e4d8f663", "metadata": {}, "source": [] }, { "cell_type": "markdown", "id": "8118424f", "metadata": {}, "source": [ "First we will import hipe4ml: https://github.com/hipe4ml/hipe4ml\n", "\n", "This is a package developed in ALICE containing useful methods and classes for dealing with ML analyses. Two main classes are implemented:\n", "- TreeHandler, wrapping uproot and pandas methods: allows for conversion and handling of the training samples\n", "- ModelHandler, a common interface for many ML methods" ] }, { "cell_type": "code", "execution_count": null, "id": "5689902b", "metadata": {}, "outputs": [], "source": [ "from hipe4ml.model_handler import ModelHandler\n", "from hipe4ml.tree_handler import TreeHandler\n", "from hipe4ml import analysis_utils\n", "from hipe4ml import plot_utils" ] }, { "cell_type": "code", "execution_count": null, "id": "ee025443", "metadata": {}, "outputs": [], "source": [ "### Read TTrees with hipe4ml" ] }, { "cell_type": "code", "execution_count": null, "id": "3040140d", "metadata": {}, "outputs": [], "source": [ "hdl_mc = TreeHandler(\"AnalysisResults-mc_reduced.root\", \"XiOmegaTree\")\n", "hdl_data = TreeHandler(\"AnalysisResults_reduced.root\", \"XiOmegaTree\")" ] }, { "cell_type": "code", "execution_count": null, "id": "2e991f67", "metadata": {}, "outputs": [], "source": [ "### select only reconstructed MC Omega candidates: they will be our signal sample for the training\n", "hdl_mc.apply_preselections(\"abs(pdg)==3334 and isOmega==1 and isReconstructed==1\") " ] }, { "cell_type": "code", "execution_count": null, "id": "066f3883", "metadata": {}, "outputs": [], "source": [ "hdl_mc.print_summary()" ] }, { "cell_type": "code", "execution_count": null, "id": "de14af1e", "metadata": {}, "outputs": [], "source": [ "### How to select the background? We can take it from the candidates available in\n", "### the sidebands of our real data sample! Use the range: mass < 1.660 or mass > 1.685" ] }, { "cell_type": "code", "execution_count": null, "id": "8bf8febc", "metadata": {}, "outputs": [], "source": [ "print(len(hdl_bkg), len(hdl_mc))" ] }, { "cell_type": "markdown", "id": "7ebc9db7", "metadata": {}, "source": [ "The bkg data sample is much larger than the signal one. This does not represent a problem, as the BDT is not sensitive to the relative abundances of the classes. However, to speed-up the training process, only a 20% fraction of the background will be used for the training. Use the shuffle_data_frame() function" ] }, { "cell_type": "code", "execution_count": null, "id": "a687fd10", "metadata": {}, "outputs": [], "source": [ "## now we remove the background from the data sample\n", "hdl_data.apply_preselections(\"mass > 1.660 and mass < 1.685\", inplace=True)" ] }, { "cell_type": "markdown", "id": "d4865b5b", "metadata": {}, "source": [ "Data prepared! Now, before the training, we need to visualize the feature properties.\n", "hipe4ml plot utilities allow for\n", "- comparing the features of different samples (plot_distr)\n", "- evaluate the correlations among the features" ] }, { "cell_type": "code", "execution_count": null, "id": "b780d1a5", "metadata": {}, "outputs": [], "source": [ "cols_to_be_compared = ['pt', 'ct', 'mass', 'dcaBachPV', 'dcaV0PV', 'dcaV0piPV', 'dcaV0prPV', 'dcaV0tracks', \n", " 'dcaBachV0','cosPA', 'cosPAV0', 'tpcNsigmaV0Pr']" ] }, { "cell_type": "code", "execution_count": null, "id": "7ff1ce8e", "metadata": { "scrolled": true }, "outputs": [], "source": [ "## some matplotlib tuning is needed to display all the features\n", "plot_utils.plot_distr([hdl_mc, hdl_bkg], cols_to_be_compared, \n", " bins=50, labels=['Signal', 'Background'],\n", " log=True, density=True, figsize=(12, 12), alpha=0.3, grid=False);" ] }, { "cell_type": "markdown", "id": "fb517d0d", "metadata": {}, "source": [ "#### Some questions....\n", "- Which variables do we expect to be relevant for the training? Can we use all of them? Pros and cons?\n", "- why do we see some spikes in the DCA?" ] }, { "cell_type": "markdown", "id": "537801fc", "metadata": {}, "source": [ "And correlations are important as well: the model can potentially exploit them to perform a better classification. Moreover, there could be some potentially dangerous correlations as those with the invariant mass of the particle of interest" ] }, { "cell_type": "code", "execution_count": null, "id": "aa9dafe7", "metadata": {}, "outputs": [], "source": [ "plot_utils.plot_corr([hdl_mc, hdl_bkg], cols_to_be_compared, labels=['Signal', 'Background']);" ] }, { "cell_type": "markdown", "id": "6246caf8", "metadata": {}, "source": [ "Considerations? Doubts? Let's now define the training columns and build the training sample" ] }, { "cell_type": "code", "execution_count": null, "id": "7021f9d4", "metadata": {}, "outputs": [], "source": [ "training_cols = ['dcaBachPV', 'dcaV0PV', 'dcaV0piPV', 'dcaV0prPV', 'dcaV0tracks', \n", " 'dcaBachV0','cosPA', 'cosPAV0', 'tpcNsigmaV0Pr']" ] }, { "cell_type": "markdown", "id": "7b5a2f3b", "metadata": {}, "source": [ "Now we split our data in a training and test set. To do it, we use the train_test_generator function from hipe4ml" ] }, { "cell_type": "code", "execution_count": null, "id": "3f9e203b", "metadata": {}, "outputs": [], "source": [ "train_test_data = analysis_utils.train_test_generator([hdl_bkg, hdl_mc], [0, 1], test_size=0.5, random_state=42)" ] }, { "cell_type": "code", "execution_count": null, "id": "96b0f709", "metadata": {}, "outputs": [], "source": [ "### let's print the train_test_data variable and see what we have" ] }, { "cell_type": "markdown", "id": "d381a426", "metadata": {}, "source": [ "## Training and testing a BDT" ] }, { "cell_type": "markdown", "id": "19dce8e0", "metadata": {}, "source": [ "We will use the BDT of XGBoost (https://github.com/dmlc/xgboost): boosting is implemented with a gradient descent method. It features few hyperparameters that can be tuned to improve the performance and reduce the overfitting, even if the algorithm works smoothly out of the box.\n", "\n" ] }, { "cell_type": "code", "execution_count": null, "id": "3c35b321", "metadata": {}, "outputs": [], "source": [ "import xgboost as xgb" ] }, { "cell_type": "code", "execution_count": null, "id": "8dde06f5", "metadata": {}, "outputs": [], "source": [ "xgb_model = xgb.XGBClassifier()" ] }, { "cell_type": "code", "execution_count": null, "id": "66d9ddfc", "metadata": {}, "outputs": [], "source": [ "### wrap the classifier into the ModelHandler" ] }, { "cell_type": "code", "execution_count": null, "id": "3906a9b1", "metadata": {}, "outputs": [], "source": [ "model_hdl = ModelHandler(xgb_model, training_cols)" ] }, { "cell_type": "code", "execution_count": null, "id": "fd9820af", "metadata": { "scrolled": true }, "outputs": [], "source": [ "model_hdl.fit(train_test_data[0], train_test_data[1])" ] }, { "cell_type": "markdown", "id": "80debc44", "metadata": {}, "source": [ "Training Done! Let's apply the model to the test sample" ] }, { "cell_type": "code", "execution_count": null, "id": "47a1608c", "metadata": {}, "outputs": [], "source": [ "score_test = model_hdl.predict(train_test_data[2])" ] }, { "cell_type": "code", "execution_count": null, "id": "1ff0d942", "metadata": {}, "outputs": [], "source": [ "#### plot the score distribution" ] }, { "cell_type": "markdown", "id": "34eeed48", "metadata": {}, "source": [ "Two peaks clearly distinguishable: will they be corresponding to the signal and the background? Let's plot the two distribution separately (w/ legend)" ] }, { "cell_type": "code", "execution_count": null, "id": "546d591c", "metadata": {}, "outputs": [], "source": [ "\n" ] }, { "cell_type": "markdown", "id": "30c7d43f", "metadata": {}, "source": [ "Well separated peaks on the test set! This is what we want.\n", "Now we can evaluate the performance on the test set: we can use the ROC curve, which is built by plotting\n", "the true positive rate (TPR) against the false negative rate (FPR) as a function of the\n", "score threshold, where TPR and FPR are defined as:\n", "\n", "\n", "$TPR=\\frac{\\sum TP}{\\sum TP + \\sum FN} \\hspace{2cm} FPR=\\frac{\\sum FP}{\\sum FP + \\sum TN} $\n", "\n", "and in our case represent the signal selection and the background rejection\n", "efficiencies respectively as a function of the score. The most common way employed to evaluate the performance of a BDT is to\n", "compute the area under the ROC curve, called AUC: a perfect classifier will have\n", "a ROC AUC equal to 1, whereas a random classifier will have a ROC AUC equal\n", "to 0.5.\n", "\n" ] }, { "cell_type": "code", "execution_count": null, "id": "4942d0f7", "metadata": {}, "outputs": [], "source": [ "plot_utils.plot_roc(train_test_data[3], score_test);" ] }, { "cell_type": "markdown", "id": "8a685817", "metadata": {}, "source": [ "Repeat this exercise with the training set: what do you get?" ] }, { "cell_type": "markdown", "id": "a9fd8516", "metadata": {}, "source": [ "Training ROC-AUC is slightly higher than the test set one. This is a systematic behaviour due to the small presence of overfitting. We can see it also by plotting the BDT output for the training and test set distributions" ] }, { "cell_type": "code", "execution_count": null, "id": "5c99bd1f", "metadata": {}, "outputs": [], "source": [ "plot_utils.plot_output_train_test(model_hdl, train_test_data, density=True, bins=100, logscale=True);" ] }, { "cell_type": "markdown", "id": "87a3ee67", "metadata": {}, "source": [ "Now, before applying the BDT to data we can have a look at which variables are relevant for the training. We will use the feature importance implemented in the SHAP library (https://github.com/slundberg/shap). In the context of machine learning, the Shapley value is used to evaluate the contribution of each feature to the model output, and it is calculated by averaging the marginal contributions of each feature to the model output. The marginal contribution of a feature is the difference in the model output when the feature is present or absent. The variables that are\n", "more important for the model are those that have a higher marginal contribution, and Shapley values consequently.\n", "\n", "In hipe4ml the function plot_feat_imp implements the algorithm: try to use it!" ] }, { "cell_type": "code", "execution_count": null, "id": "2fdfad72", "metadata": {}, "outputs": [], "source": [ "plot_utils.plot_feature_imp(train_test_data[2], train_test_data[3], model_hdl) " ] }, { "cell_type": "markdown", "id": "7581cb21", "metadata": {}, "source": [ "Two plots given: how to interpret them?" ] }, { "cell_type": "markdown", "id": "f0835461", "metadata": {}, "source": [ "**Bonus** : repeat the training using a different model and compare its performance with the XGB BDT. All the sklearn and keras (NN) models can be used to feed the ModelHandler" ] }, { "cell_type": "markdown", "id": "0b1a1bf2", "metadata": {}, "source": [ "### Applying the BDT" ] }, { "cell_type": "markdown", "id": "9c4b23b8", "metadata": {}, "source": [ "Now that the model is tested we can use it to classify the real data sample. Try to evaluate the invariant mass as a function of the score distribution" ] }, { "cell_type": "code", "execution_count": null, "id": "0f4cb507", "metadata": {}, "outputs": [], "source": [ "hdl_data.apply_model_handler(model_hdl)" ] }, { "cell_type": "code", "execution_count": null, "id": "c8d0d998", "metadata": {}, "outputs": [], "source": [ "### very nice! But how to decide which threshold is the best one? Many methods can be used, but it is important to evaluate the selection efficiency\n", "### as a function of the score. Excercise: write a function for computing the BDT efficiency vs Score. Would you compute it on the training or the test sets?" ] }, { "cell_type": "code", "execution_count": null, "id": "89fd7004", "metadata": {}, "outputs": [], "source": [ "### hipe4ml has a function to do that! \n", "eff_array, score_array = analysis_utils.bdt_efficiency_array(train_test_data[3], score_test, 1000)" ] }, { "cell_type": "code", "execution_count": null, "id": "dfb2b291", "metadata": {}, "outputs": [], "source": [ "plot_utils.plot_bdt_eff(score_array, eff_array);" ] }, { "cell_type": "code", "execution_count": null, "id": "0e694087", "metadata": {}, "outputs": [], "source": [ "## Now we can try to fit the invariant mass spectrum" ] }, { "cell_type": "code", "execution_count": null, "id": "65b778aa", "metadata": {}, "outputs": [], "source": [ "from scipy.optimize import curve_fit\n", "from scipy import integrate\n", "\n", "def fit_invmass(df, fit_range=[1.660, 1.685]):\n", " \n", " # histogram of the data\n", " counts, bins = np.histogram(df, bins=40, range=fit_range)\n", " \n", " # define functions for fitting \n", " def gaus_function(x, N, mu, sigma):\n", " return N * np.exp(-(x-mu)**2/(2*sigma**2))\n", " \n", " def pol2_function(x, a, b):\n", " return (a + x*b)\n", " \n", " def fit_function(x, a, b, N, mu, sigma):\n", " return pol2_function(x, a, b) + gaus_function(x, N, mu, sigma)\n", " \n", " # x axis ranges for plots\n", " x_point = 0.5 * (bins[1:] + bins[:-1])\n", " r = np.arange(fit_range[0], fit_range[1], 0.00001)\n", " \n", " # fit the invariant mass distribution with fit_function() pol2+gauss\n", " popt, pcov = curve_fit(fit_function, x_point, counts, p0 = [100, -1, 100, 1.673, 0.001])\n", " \n", " # plot data\n", " plt.errorbar(x_point, counts, yerr=np.sqrt(counts), fmt='.', ecolor='k', color='k', elinewidth=1., label='Data')\n", " \n", " # plot pol2 and gauss obtained in the fit separately\n", " plt.plot(r, gaus_function(r, N=popt[2], mu=popt[3], sigma=popt[4]), label='Gaus', color='red')\n", " plt.plot(r, pol2_function(r, a=popt[0], b=popt[1]), label='pol2', color='green', linestyle='--')\n", "\n", " # plot the global fit\n", " plt.plot(r, fit_function(r, *popt), label='pol2+Gaus', color='blue')\n", " \n", " # compute significance of the signal\n", " signal = integrate.quad(gaus_function, fit_range[0], fit_range[1], args=(popt[2], popt[3], popt[4]))[0] / 0.00225\n", " background = integrate.quad(pol2_function, fit_range[0], fit_range[1], args=(popt[0], popt[1]))[0] / 0.00225\n", " print(f'Signal counts: {signal:.0f}')\n", " print(f'Background counts: {background:.0f}') \n", " significance = signal / np.sqrt(signal + background)\n", "\n", " # Add some axis labels\n", " plt.legend()\n", " plt.xlabel('$M_{\\Lambda\\pi}$ $(\\mathrm{GeV/}\\it{c}^2)$')\n", " plt.ylabel('Counts')\n", " plt.show()\n" ] }, { "cell_type": "code", "execution_count": null, "id": "ec466f52", "metadata": {}, "outputs": [], "source": [ "### you can also use some packages for implementing ROOT like plots in python\n", "import mplhep as hep\n", "hep.style.use(hep.style.ROOT)" ] }, { "cell_type": "code", "execution_count": null, "id": "ea2b948d", "metadata": {}, "outputs": [], "source": [ "## try again to fit the invariant mass" ] }, { "cell_type": "code", "execution_count": null, "id": "cd3e0a0f", "metadata": {}, "outputs": [], "source": [ "## modify the fit function to get the signal counts, then plot them as a function of the BDT score" ] }, { "cell_type": "code", "execution_count": null, "id": "594a8faa", "metadata": {}, "outputs": [], "source": [ "#### reset matplotlib style if needed\n", "plt.style.use('default')" ] }, { "cell_type": "markdown", "id": "fe227835", "metadata": {}, "source": [ "For a dedicated fitting routine you can have a look at:\n", "- https://github.com/flarefly/flarefly" ] }, { "cell_type": "markdown", "id": "b8e59c6e", "metadata": {}, "source": [ "### **Bonus**: Repeat the excercise with the Xi" ] }, { "cell_type": "markdown", "id": "9697ff26", "metadata": {}, "source": [ "### **Bonus**: optimize your BDT" ] }, { "cell_type": "markdown", "id": "179c291d", "metadata": {}, "source": [ "The XGBoost Classifier has many hyperparameters that control the complexity of\n", "the model. Few of them are listed here, for a complete description see : https://xgboost.readthedocs.io/en/stable/parameter.html\n", "\n", "- n_estimators: Number of trees in the BDT\n", "- max_depth: Maximum depth of a tree\n", "- eta: learning rate of the algorithm. It controls the step size of the gradient descent algorithm\n", "\n", "\n", "The optimisation of the hyperparameters is a key step to obtain the best performance from the algorithm and prevent overfitting. In hipe4ml the Optuna package is employed for the optimisation through the method ModelHandler.optimize_params_optuna. (https://github.com/optuna/optuna)\n", "\n", "The Optuna package provides a wide choice of algorithms for the hyperparameter optimization. The default one is the TPESampler, which is known to provide robust performance in few iterations. The difference between other approaches, like grid search or random search, and the TPESampler optimisation is that the latter takes into account past evaluations when choosing the hyperparameter set to evaluate next.\n", "\n", "A set of hyperparameters should be tested on different samples to avoid overfitting problems. Since the number of events is limited, an approach called cross validation is used. It has been proved that the cross validation removes the dependence of the model on the data sample.\n", "\n", "In the cross validation procedure, the original sample is divided in n parts called folds (in this case 5 folds are used). For each set of hyperparameters, n-1 folds are used for the optimisation and the remaining one as test. This operation is repeated after permuting the folds used for optimisation and for testing and the final result is the mean value of all the permutations.\n", "\n", "The ModelHandler automatically updates the hyperparameters after their optimisation." ] }, { "cell_type": "code", "execution_count": null, "id": "f5f5fe9d", "metadata": {}, "outputs": [], "source": [ "### Excercise: try to optimize your hyperparams with optuna and compare the performance! Be aware: the optimisation is CPU expensive and take some time" ] }, { "cell_type": "code", "execution_count": null, "id": "6ecd8364", "metadata": {}, "outputs": [], "source": [] } ], "metadata": { "kernelspec": { "display_name": "Python 3 (ipykernel)", "language": "python", "name": "python3" }, "language_info": { "codemirror_mode": { "name": "ipython", "version": 3 }, "file_extension": ".py", "mimetype": "text/x-python", "name": "python", "nbconvert_exporter": "python", "pygments_lexer": "ipython3", "version": "3.8.10" } }, "nbformat": 4, "nbformat_minor": 5 }