Hadoop Certification - CCA - Pyspark - Reading and Saving Sequence Files

preview_player
Показать описание
Connect with me or follow me at
Рекомендации по теме
Комментарии
Автор

Hi, thanks for your videos. I am getting the below exception while running the
command :
Read: dataRDD=sc.sequenceFile("/user/cloudera/pyspark/departmentsSeq", "org.apache.hadoop.io.IntWritable", "org.apache.hadoop.io.Text")

Save: dataRDD.map(lambda x: tuple(x.split(", ", 1))).saveAsNewAPIHadoopFile("/user/cloudera/pyspark/departmentsSequence", "org.apache.hadoop.mapreduce.lib.output.SequenceFileOutputFormat", keyClass="org.apache.hadoop.io.IntWritable", valueClass="org.apache.hadoop.io.Text")

Exception : AttributeError: 'tuple' object has no attribute 'split'

Can you please suggest me the possible cause.

sanjibdharitd
Автор

For any technical discussions or doubts, please use our forum - discuss.itversity.com
For practicing on state of the art big data cluster, please sign up on - labs.itversity.com
Lab is under free preview until 12/31/2016 and after that subscription
charges are 14.99$ per 31 days, 34.99$ per 93 days and 54.99$ per 185 days

itversity
Автор

Hello Sir,
Do we have to Load and store avro data files too? If we have to, then please help me finding the solution.

"Convert a set of data values in a given format stored in HDFS into new data values and/or a new data format and write them into HDFS." I wonder if some files stored in HDFS are avro!

Thank you
Uma

umak