Split Data for Machine Learning

Показать описание

Splitting data ensures that there are independent sets for training, testing, and validation. Data can be divided into sequential blocks where the order is preserved (e.g. time series) or with random selection (shuffle). Cross-validation demonstrates the effect of choosing alternating test sets.

0:00 Train, Validate, Test
2:04 Split DataFrame
3:05 Split by Index
4:30 Split Numpy Array
7:49 Cross Validation
15:50 Overview
18:27 Overfit Detection

The test set is to evaluate the model fit independently of the training and to improve the hyper-parameters without overfitting on the training. Scikit-learn has a train / test split function with a test_size that is the fraction to reserve for testing.