3.4 Spark Cache vs Persist | Spark Interview Questions

preview_player
Показать описание
As part of our spark Interview question Series, we want to help you prepare for your spark interviews. We will discuss various topics about spark like Lineage, reduceby vs group by, yarn client mode vs yarn cluster mode etc.

As part of this video we are covering difference between Spark cache() and persist()

Please subscribe to our channel.
Here is link to other spark interview questions

Here is link to other Hadoop interview questions
Рекомендации по теме
Комментарии
Автор

Just to highlight a correction, when we're talking about persist using memory and disk. If the memory is not enough to hold the entire data, it won't spill the remaining data to disk rather it will persist the entire data to disk instead of memory. :)

harshitdamani
Автор

Your way of explanation is amazing. Please explain practically how we can implement the concept in the coding

praneethbhat
Автор

2. The cache method is used to persist the DataFrame or RDD in memory by default. It is a shorthand for calling persist() with the default storage level, which is MEMORY_ONLY

3. The persist method allows you to specify a storage level for persisting the DataFrame or RDD. This storage level can include options such as MEMORY_ONLY, MEMORY_ONLY_SER, DISK_ONLY, MEMORY_AND_DISK, etc.

pandurangbhadange
Автор

Hi Savvy..I like your videos thanks for posting..i have one tech question as below. During cache what happens if one of the JVM got crashed or memory failure happened to one of the data node what will happen to the cached data???

plabanrout
Автор

As per cache if the data is not fit them it will recreate the data when we call that dataframe as per documentation from databricks

badri
Автор

can you please make some videos of spark with pyspark/python APIs also...could be some minor differences but its good to understand.

albinchandy
Автор

why we are go with cache instead of persist ..persist also will do right?

karunm
Автор

Sir, Can we have difference between serilization and deserilization?

kaleshavali
Автор

Hi Bro,
could you please answer the following question which i faced in interview.

i have 3 csv files like a.csv, b.csv and c.csv and it size is 10mb, 1gb and 100gb i want to join these files based some columns. but while joining using spark in memory what are the issues we will face.

ravir
Автор

Thanks but not much info regarding DISK_ONLY, MEMORY_ONLY_SER
MEMORY_AND_DISK
MEMORY_AND_DISK_SER, various trade-offs and use-cases when to use what?

SpiritOfIndiaaa
Автор

As per new update, default StorageLevel for cache is now MEMORY_AND_DISK

thelifehackerpro
Автор

Thanks, but it will be good if you include code samples ( small ) in most of your videos when ever possible to demonstrate it will be much helpful i guess .

anannyamukherjee
Автор

Hi sir, I have a doubt what is the difference between cache() and broadcast variable.

cindyalex
Автор

your voice is too low sir.. correct your mic setting ..

brogames
Автор

Voice is very low ...Kindly look into it.

guruyadavraj