filmov
tv
[Data Analysis with Python] 11. Binning in Python

Показать описание
In this video, we'll be talking about binning as a method of data preprocessing. Binning is when you group values together into bins. For example, you can bin age into 0-5, 6-10, 11-15 and so on. Sometimes binning can improve accuracy of the predictive models. In addition, sometimes we use data binning to group a set of numerical values into a smaller number of bins to have a better understanding of the data distribution. As example, "price" here is an attribute range from 5,000 to 45,500. Using binning, we categorize the price into three bins: low price, medium price and high prices. In the actual car dataset, price is a numerical variable ranging from 5,188 to 45,400. It has 201 unique values. We can categorize them into three bins, low, medium and high price cars. In Python, we can easily implement the binning. We would like three bins of equal bin width, so we need four numbers as dividers that are equal distance apart. First, we use the NumPy function linspace to return the array bins that contains four equally spaced numbers over the specified interval of the price. We create a list group underscore names that contains the different bin names. We use the Pandas function cut to segment and sort the data values into bins. You can then use histograms to visualize the distribution of the data after they've been divided into bins. This is the histogram that we plotted based on the binning that we applied in the price feature. From the plot, it is clear that most cars have a low price and only very few cars have high price.