Handling Non-Numeric Data - Practical Machine Learning Tutorial with Python p.35

preview_player
Показать описание
In this machine learning tutorial, we cover how to work with non-numerical data. This useful with any form of machine learning, all of which require data to be in numerical form, even when the real world data is not always in numerical form.

Рекомендации по теме
Комментарии
Автор

for future viewers of this amazing tutorials;
if you getting an error like AttributeError: 'DataFrame' object has no attribute 'convert_objects'
use;

df.apply(pd.to_numeric, errors='ignore')
instead of

it worked for me hope it works for you too. GL

dantusqq
Автор

Hi Sentdex. I think you are making a mistake encoding categorical variables as continuous numerical. Technically, you should create a dummy variable for each class (if the variable has more than 2 classes). For example, if you have a variable 'city', and the following dataset [(n0, Tokio), (n1, Rome), (n2, Rome), (n3, London)], you should transform it as follows: [(n0, 1, 0, 0), (n1, 0, 1, 0), (n2, 0, 1, 0), (n3, 0, 0, 1)]
This is the correct way according to statistics books. If you just convert it to a continuous numerical variable, the algo won't know that you are not talking about a quantity. If you feed that data to a linear regression, for example, you would end up with weird results such as "y grows 0.25 for each unitary increase of city", which doesn't make sense. On the other hand, if you use the dummy variables you end up with something like "y grows 0.25 when 'london' is 1 and everything else is 0', which is much more meaningful.
Pandas has a method pd.get_dummies() that does that process for you, but it's very heavy on memory, because if you have a variable with 1000 categories you will have 1000 more columns. I'm looking for a way to solve that problem, maybe sparse arrays? But how do we combine a sparse array with a non-sparse array at the moment of algo-fitting?
Can we work out something?

RobertoFrobs
Автор

Converting an entire DF from strings to numeric values per column in a single line:

def fac_df(df):
#factorizes every column in the dataframe, starting with 0.
df = df.apply(lambda x: pd.factorize(x)[0]+1)
return df

This functions does pretty much everything your functions do, but in a single line. Hope this is helpful to some of you.

PapayaPaii
Автор

Whenever I face a problem and google it, you come to my help. I appreciate that, thank you especially

enesercin
Автор

If anyone get Deprecation Warning while using cross_validation, use model_selection instead

athulreji_
Автор

I think I saw a couple people who already mentioned something like this, but I'll say it again. Enumerating categorical types isn't a typically a good way of encoding non-numerical data.
If the possible strings have some sort of order to them, like "strongly disagree, " "disagree, " "neither, " "agree, " "strongly agree", then maybe you can use a simple enumeration. But if it's something like race, gender, or some other categorical type which has no order, then enumeration is a bad approach.
Instead, you can use something called "One-Hot" encoding. This will split the categorical feature into multiple features of all the different options.

So for a male entry, the features will be,
sex_male = 1, sex_female = 0

For a female,
sex_male = 0, sex_female = 1

iamstickfigure
Автор

You can use instead of - - df._convert(numeric=True)

MohitRudrarajuSuresh
Автор

In case you wanted a simpler version of converting the non-numerical data:


def
for column in df.columns:
if type(df[column].values[0]) != np.int64 and type(df[column].values[0]) != np.float64:
xformlist = dict()
for item, value in
xformlist[value]=item
df[column] = (df[column].apply(lambda x: xformlist[x]))
return df

aaronge
Автор

Great Job Sentdex. The resemblance between you and Snowden is uncanny btw.

mr_sandiego
Автор

Use LabelEncoder for the representation of categorical data. from sklearn.preprocessing
Use df = instead of handle_non_numerical_data(df) function.

athulreji_
Автор

Can't we use this ?
le = preprocessing.LabelEncoder()
df = df.apply(le.fit_transform)

gamingbugs
Автор

Alternative method to map string values to numbers:

# Loop through all columns
for col_name in df:
# Select one column at the time
col = df[col_name]
# Run if the column is non-numeric
if np.issubdtype(col.dtype, np.number)==False:
# Get all the possible values inside the columns
values = col.drop_duplicates()
# Assign to each value the index
mapping = pd.Series(values.index.values, index=values).to_dict()
# Map to each of the dataframe the value the index previously defined
df[col_name] = df[col_name].replace(mapping)

ElChe-Ko
Автор

we can use df[column].unique() to get a list of all unique values in a column. Then we create a dictionary with the unique values and numbers (let's say xformlist). Then we can use lambda function to do the whole assignment in one go without using lists.
df[column] = x: xformlist[x]))
Here is my version of the function





            for item,  value in enumerate(df[column].unique()):


    return df

WritankarBhattacharya
Автор

It was an amazing tutorial.It is quite informative and is helping me a lot for getting better at coding.
Thanks a lot

decode
Автор

for n in df.columns.values:
df[n] = pd.to_numeric(df[n], errors="ignore")
that works with me.

daliamokhtar
Автор

If you see where NA values are, you will notice that you have a lot of data in the age column, which is probably bad because when you set the value to zero you are essentially causing a distortion in your data. You'd rather fill it with the median value or drop it .

batatambor
Автор

This is so great. Thank you Sentdex. Now I hope to find a way to add an intermittent step in the for loop that saves the dictionary - to use as a label key.

deborahweissner
Автор

def to_numeric_data(df):
column = df.columns.values
for columns in column :
if df.columns.dtype != np.int64 or df.columns.dtype != np.float64 :
df[columns] =
return df
you can use this also give this a try

suvarghaghoshdastidar
Автор

For converting to numeric (pandas 0.19.1), use df.apply(pd.to_numeric, errors='ignore')

amyxst
Автор

I guess following code works good! Sorry if I'm wrong

def
enc = LabelEncoder()
columns_list = df.columns.values
for col in columns_list:
if df[col].dtype != np.int64 and df[col].dtype != np.float64:
unique_item = df[col].tolist()
enc.fit(unique_item)
temp = enc.transform(unique_item)
df[col] = temp

yathiraju