Python Tutorial : Introducing XGBoost

Показать описание

---

Now let's talk about what you're actually here for, the hottest library in supervised machine learning, XGBoost.

XGBoost is an incredibly popular machine learning library for good reason. It was developed originally as a C++ command-line application. After winning a popular machine learning competition, the package started being adopted within the ML community.

As a result, bindings, or functions that tapped into the core C++ code, started appearing in a variety of other languages, including Python, R, Scala, and Julia. We will cover the Python API in this course.

What makes XGBoost so popular? Its speed and performance. Because the core XGBoost algorithm is parallelizable, it can harness all of the processing power of modern multi-core computers. Furthermore, it is parallelizable onto GPU's and across networks of computers, making it feasible to train models on very large datasets on the order of hundreds of millions of training examples. However, XGBoost's speed isn't the package's real draw.

Ultimately, a fast but poorly performing machine learning algorithm is not going to have wide adoption within the community. What makes XGBoost so popular is that it consistently outperforms almost all other single-algorithm methods in machine learning competitions and has been shown to achieve state-of-the-art performance on a variety of benchmark machine learning datasets. Here's an example of how we can use XGBoost using a classification problem.

In lines 1-4, we import the libraries or functions we will be using, including xgboost, and the train/test/split function from scikit-learn. Remember, you always build a machine learning model using train/test splits of your data, where some portion of your data is used for training, and the remainder is held-out for testing to ensure that your model doesn't overfit and can generalize to unseen data.

In lines 5 and 6 we load our data in from file and split the entire dataset into a matrix of samples by features, called X by convention, and a vector of target values, called y by convention.

In line 7 we create our train/test split, keeping 20% of the data for testing.

In line 8 we instantiate our xgboost classifier instance with some parameters that we will cover shortly.

Lines 9 and 10 should appear familiar to you. XGBoost has a scikit-learn compatible api, and this is it! It uses the fit/predict pattern that you should have seen before, where we fit, or train, our algorithm on the training set, and then evaluate it by generating predictions using the test set and comparing our predictions to the actual target labels on the test set.

Lines 11 and 12 evaluate the accuracy of the trained model on the test set and print the results to screen.

Given that XGBoost is this popular, let's get to using it already!