Data cleansing importance in Pyspark | Multiple date format, clean special characters in header

preview_player
Показать описание
in this video what is the importance of data cleaning and python importanace in pyspark.
If you have multiple date format, if u want to remove special characters from header how to skip explained in this video.

# yyyy-MM-dd format only
#data cleaning steps
def dynamic_date(col,frmts=("yyyy-MM-dd","dd-MMM-yyyy","ddMMMMyyyy","MM-dd-yyyy","MMM/yyyy/dd")):
return coalesce(*[to_date(col, i )for i in frmts])

import re
#data process
Рекомендации по теме
Комментарии
Автор

really impressive technique, need to learn these kinds of scenario based questions .

satishrb
Автор

Thanks for making this video Mr. Venu as this is very important usecase in real time scenario.

santoshkumarthammineni