Handling Non-Numeric Data - Practical Machine Learning Tutorial with Python p.35

Показать описание

In this machine learning tutorial, we cover how to work with non-numerical data. This useful with any form of machine learning, all of which require data to be in numerical form, even when the real world data is not always in numerical form.

Рекомендации по теме

Комментарии

for future viewers of this amazing tutorials;
if you getting an error like AttributeError: 'DataFrame' object has no attribute 'convert_objects'
use;

df.apply(pd.to_numeric, errors='ignore')
instead of

it worked for me hope it works for you too. GL

dantusqq

Hi Sentdex. I think you are making a mistake encoding categorical variables as continuous numerical. Technically, you should create a dummy variable for each class (if the variable has more than 2 classes). For example, if you have a variable 'city', and the following dataset [(n0, Tokio), (n1, Rome), (n2, Rome), (n3, London)], you should transform it as follows: [(n0, 1, 0, 0), (n1, 0, 1, 0), (n2, 0, 1, 0), (n3, 0, 0, 1)]
This is the correct way according to statistics books. If you just convert it to a continuous numerical variable, the algo won't know that you are not talking about a quantity. If you feed that data to a linear regression, for example, you would end up with weird results such as "y grows 0.25 for each unitary increase of city", which doesn't make sense. On the other hand, if you use the dummy variables you end up with something like "y grows 0.25 when 'london' is 1 and everything else is 0', which is much more meaningful.
Pandas has a method pd.get_dummies() that does that process for you, but it's very heavy on memory, because if you have a variable with 1000 categories you will have 1000 more columns. I'm looking for a way to solve that problem, maybe sparse arrays? But how do we combine a sparse array with a non-sparse array at the moment of algo-fitting?
Can we work out something?

RobertoFrobs

Converting an entire DF from strings to numeric values per column in a single line:

def fac_df(df):
#factorizes every column in the dataframe, starting with 0.
df = df.apply(lambda x: pd.factorize(x)[0]+1)
return df

This functions does pretty much everything your functions do, but in a single line. Hope this is helpful to some of you.

PapayaPaii

Whenever I face a problem and google it, you come to my help. I appreciate that, thank you especially

enesercin

If anyone get Deprecation Warning while using cross_validation, use model_selection instead

athulreji_

I think I saw a couple people who already mentioned something like this, but I'll say it again. Enumerating categorical types isn't a typically a good way of encoding non-numerical data.
If the possible strings have some sort of order to them, like "strongly disagree, " "disagree, " "neither, " "agree, " "strongly agree", then maybe you can use a simple enumeration. But if it's something like race, gender, or some other categorical type which has no order, then enumeration is a bad approach.
Instead, you can use something called "One-Hot" encoding. This will split the categorical feature into multiple features of all the different options.

So for a male entry, the features will be,
sex_male = 1, sex_female = 0

For a female,
sex_male = 0, sex_female = 1

iamstickfigure

You can use instead of - - df._convert(numeric=True)

MohitRudrarajuSuresh

In case you wanted a simpler version of converting the non-numerical data:

def
for column in df.columns:
if type(df[column].values[0]) != np.int64 and type(df[column].values[0]) != np.float64:
xformlist = dict()
for item, value in
xformlist[value]=item
df[column] = (df[column].apply(lambda x: xformlist[x]))
return df

aaronge

Great Job Sentdex. The resemblance between you and Snowden is uncanny btw.

mr_sandiego

Use LabelEncoder for the representation of categorical data. from sklearn.preprocessing
Use df = instead of handle_non_numerical_data(df) function.

athulreji_

Can't we use this ?
le = preprocessing.LabelEncoder()
df = df.apply(le.fit_transform)

gamingbugs

Alternative method to map string values to numbers:

# Loop through all columns
for col_name in df:
# Select one column at the time
col = df[col_name]
# Run if the column is non-numeric
if np.issubdtype(col.dtype, np.number)==False:
# Get all the possible values inside the columns
values = col.drop_duplicates()
# Assign to each value the index
mapping = pd.Series(values.index.values, index=values).to_dict()
# Map to each of the dataframe the value the index previously defined
df[col_name] = df[col_name].replace(mapping)

ElChe-Ko

we can use df[column].unique() to get a list of all unique values in a column. Then we create a dictionary with the unique values and numbers (let's say xformlist). Then we can use lambda function to do the whole assignment in one go without using lists.
df[column] = x: xformlist[x]))
Here is my version of the function

for item, value in enumerate(df[column].unique()):

return df

WritankarBhattacharya

It was an amazing tutorial.It is quite informative and is helping me a lot for getting better at coding.
Thanks a lot

decode

for n in df.columns.values:
df[n] = pd.to_numeric(df[n], errors="ignore")
that works with me.

daliamokhtar

If you see where NA values are, you will notice that you have a lot of data in the age column, which is probably bad because when you set the value to zero you are essentially causing a distortion in your data. You'd rather fill it with the median value or drop it .

batatambor

This is so great. Thank you Sentdex. Now I hope to find a way to add an intermittent step in the for loop that saves the dictionary - to use as a label key.

deborahweissner

def to_numeric_data(df):
column = df.columns.values
for columns in column :
if df.columns.dtype != np.int64 or df.columns.dtype != np.float64 :
df[columns] =
return df
you can use this also give this a try

suvarghaghoshdastidar

For converting to numeric (pandas 0.19.1), use df.apply(pd.to_numeric, errors='ignore')

amyxst

I guess following code works good! Sorry if I'm wrong

def
enc = LabelEncoder()
columns_list = df.columns.values
for col in columns_list:
if df[col].dtype != np.int64 and df[col].dtype != np.float64:
unique_item = df[col].tolist()
enc.fit(unique_item)
temp = enc.transform(unique_item)
df[col] = temp

yathiraju

Handling Non-Numeric Data - Practical Machine Learning Tutorial with Python p.35

Handling Non-Numeric Data - Practical Machine Learning Tutorial with Python p.35

Dealing with Non-Numeric Data

35 Handling NonNumeric Data Practical Machine Learning Tutorial with Python p 35 red manc

Handling Non Numeric dataset - ML with Python - Part - 28

Example of non numeric data is

PYTHON : Remove non-numeric rows in one column with pandas

Converting non numerical data to numerical data for Machine Learning - Best Practices

87 Getting Your Data Ready Convert Data To Numbers | Scikit-learn Creating Machine Learning Models

PYTHON : Drop non-numeric columns from a pandas DataFrame

Removing non-numeric characters other than dots and commas from a string

How to use describe on pandas dataframe for non numeric features

Learning KNN model with features subset and with non-numeric data

Enhance Your Python Function to Handle Non-Numeric Inputs While Reversing Numbers

How To Handle Missing Values in Categorical Features

8 Using excel IF function with non numeric data

Remove Non-Numeric Characters from Pandas DataFrame Column Values

Machine Learning with Python video 7:How to Handle Categorical Data||OneHotEncoding||ColumnTransform

PYTHON : Detect if a NumPy array contains at least one non-numeric value?

Types of Data: Categorical(Nominal, Ordinal), Numerical(Discrete, Continues) Stats: part-1

Dealing with Non Numeric Key Fields

Practical uses of Non-numeric DataType || String (fixed length) vs String (variable length)

Java Program for Copy only non Numeric data from one File to Another File | Using Java File Handling

How to stop R from executing when value is non-numeric

Machine Learning DataScience - How to Deal with numeric, continuous feature -1?