Cache y Persist
Cache And Persist Techniques¶
# Spark Session
from pyspark.sql import SparkSession
spark = (
SparkSession
.builder
.appName("Understand Caching")
.master("local[*]")
.config("spark.executor.memory", "512M")
.getOrCreate()
)
spark
Setting default log level to "WARN". To adjust logging level use sc.setLogLevel(newLevel). For SparkR, use setLogLevel(newLevel). 25/02/21 12:42:14 WARN NativeCodeLoader: Unable to load native-hadoop library for your platform... using builtin-java classes where applicable
SparkSession - in-memory
df_movies = spark.read.format("csv").option("header",True).load("data/ImdbMovieDataset.csv")
df_movies.count()
1048575
df_movies.show()
+------+--------------------+------------+----------+--------+------------+----------+-------+-----+---------+---------+-----------------+--------------------+--------------------+--------------------+--------------------+--------------------+--------------------+--------------------+--------------------+--------------------+ | id| title|vote_average|vote_count| status|release_date| revenue|runtime|adult| budget| imdb_id|original_language| original_title| overview| popularity| tagline| genres|production_companies|production_countries| spoken_languages| keywords| +------+--------------------+------------+----------+--------+------------+----------+-------+-----+---------+---------+-----------------+--------------------+--------------------+--------------------+--------------------+--------------------+--------------------+--------------------+--------------------+--------------------+ | 27205| Inception| 8.364| 34495|Released| 7/15/2010| 825532764| 148|FALSE|160000000|tt1375666| en| Inception|"Cobb, a skilled ...| the implantation...| 83.952|Your mind is the ...|Action, Science F...|Legendary Picture...|United Kingdom, U...|English, French, ...| |157336| Interstellar| 8.417| 32571|Released| 11/5/2014| 701729206| 169|FALSE|165000000|tt0816692| en| Interstellar|The adventures of...| 140.241|Mankind was born ...|Adventure, Drama,...|Legendary Picture...|United Kingdom, U...| English|rescue, future, s...| | 155| The Dark Knight| 8.512| 30619|Released| 7/16/2008|1004558444| 152|FALSE|185000000|tt0468569| en| The Dark Knight|Batman raises the...| 130.643|Welcome to a worl...|Drama, Action, Cr...|DC Comics, Legend...|United Kingdom, U...| English, Mandarin|joker, sadism, ch...| | 19995| Avatar| 7.573| 29815|Released| 12/15/2009|2923706026| 162|FALSE|237000000|tt0499549| en| Avatar|In the 22nd centu...| 79.932|Enter the world o...|Action, Adventure...|Dune Entertainmen...|United States of ...| English, Spanish|future, society, ...| | 24428| The Avengers| 7.71| 29166|Released| 4/25/2012|1518815515| 143|FALSE|220000000|tt0848228| en| The Avengers|When an unexpecte...| 98.082|Some assembly req...|Science Fiction, ...| Marvel Studios|United States of ...|English, Hindi, R...|new york city, su...| |293660| Deadpool| 7.606| 28894|Released| 2/9/2016| 783100000| 108|FALSE| 58000000|tt1431045| en| Deadpool|The origin story ...| 72.735|Witness the begin...|Action, Adventure...|20th Century Fox,...|United States of ...| English|superhero, anti h...| |299536|Avengers: Infinit...| 8.255| 27713|Released| 4/25/2018|2052415039| 149|FALSE|300000000|tt4154756| en|Avengers: Infinit...|As the Avengers a...| 154.34|An entire univers...|Adventure, Action...| Marvel Studios|United States of ...| English, Xhosa|sacrifice, magic,...| | 550| Fight Club| 8.438| 27238|Released| 10/15/1999| 100853753| 139|FALSE| 63000000|tt0137523| en| Fight Club|"A ticking-time-b...| until an eccentr...| 69.498|Mischief. Mayhem....| Drama|Regency Enterpris...|United States of ...| English| |118340|Guardians of the ...| 7.906| 26638|Released| 7/30/2014| 772776600| 121|FALSE|170000000|tt2015381| en|Guardians of the ...|Light years from ...| 33.255|All heroes start ...|Action, Science F...| Marvel Studios|United States of ...| English|spacecraft, based...| | 680| Pulp Fiction| 8.488| 25893|Released| 9/10/1994| 213900000| 154|FALSE| 8500000|tt0110912| en| Pulp Fiction|A burger-loving h...| 74.862|Just because you ...| Thriller, Crime|Miramax, A Band A...|United States of ...|English, Spanish,...|drug dealer, boxe...| | 13| Forrest Gump| 8.477| 25409|Released| 6/23/1994| 677387716| 142|FALSE| 55000000|tt0109830| en| Forrest Gump|A man with a low ...| 92.693|The world will ne...|Comedy, Drama, Ro...|Paramount, The St...|United States of ...| English|vietnam war, viet...| | 671|Harry Potter and ...| 7.916| 25379|Released| 11/16/2001| 976475550| 152|FALSE|125000000|tt0241527| en|Harry Potter and ...|Harry Potter has ...| 185.482|Let the magic begin.| Adventure, Fantasy|Warner Bros. Pict...|United Kingdom, U...| English|witch, school fri...| | 1726| Iron Man| 7.64| 24874|Released| 4/30/2008| 585174222| 126|FALSE|140000000|tt0371746| en| Iron Man|After being held ...| 72.897|Heroes aren't bor...|Action, Science F...| Marvel Studios|United States of ...|English, Persian,...|middle east, supe...| | 68718| Django Unchained| 8.171| 24672|Released| 12/25/2012| 425368238| 165|FALSE|100000000|tt1853728| en| Django Unchained|With the help of ...| 54.224|Life, liberty and...| Drama, Western|The Weinstein Com...|United States of ...|English, French, ...|rescue, friendshi...| | 278|The Shawshank Red...| 8.702| 24649|Released| 9/23/1994| 28341469| 142|FALSE| 25000000|tt0111161| en|The Shawshank Red...|Framed in the 194...| 122.61|Fear can hold you...| Drama, Crime|Castle Rock Enter...|United States of ...| English|prison, friendshi...| |299534| Avengers: Endgame| 8.263| 23857|Released| 4/24/2019|2800000000| 181|FALSE|356000000|tt4154796| en| Avengers: Endgame|After the devasta...| 91.756| Avenge the fallen.|Adventure, Scienc...| Marvel Studios|United States of ...|English, Japanese...|superhero, time t...| | 603| The Matrix| 8.206| 23815|Released| 3/30/1999| 463517383| 136|FALSE| 63000000|tt0133093| en| The Matrix|Set in the 22nd c...| 78.564|Welcome to the Re...|Action, Science F...|Village Roadshow ...|United States of ...| English|man vs machine, m...| | 597| Titanic| 7.9| 23637|Released| 11/18/1997|2264162353| 194|FALSE|200000000|tt0120338| en| Titanic|101-year-old Rose...| 102.348|Nothing on Earth ...| Drama, Romance|Paramount, 20th C...|United States of ...|English, French, ...|epic, ship, drown...| |475557| Joker| 8.168| 23425|Released| 10/1/2019|1074458282| 122|FALSE| 55000000|tt7286456| en| Joker|During the 1980s,...| 54.522|Put on a happy face.|Crime, Thriller, ...|Warner Bros. Pict...|Canada, United St...| English|dream, street gan...| | 120|The Lord of the R...| 8.402| 23323|Released| 12/18/2001| 871368364| 179|FALSE| 93000000|tt0120737| en|The Lord of the R...|Young hobbit Frod...| 87.037|One ring to rule ...|Adventure, Fantas...|New Line Cinema, ...|New Zealand, Unit...| English|based on novel or...| +------+--------------------+------------+----------+--------+------------+----------+-------+-----+---------+---------+-----------------+--------------------+--------------------+--------------------+--------------------+--------------------+--------------------+--------------------+--------------------+--------------------+ only showing top 20 rows
from pyspark.sql.functions import col
from pyspark.sql.types import IntegerType
df_movies_cast = df_movies.withColumn("vote_average", df_movies["vote_average"].cast(IntegerType()))
# df_movies_cast = df_movies.withColumn(col("vote_average")).cast("decimal(38,6)")
df_movies_cast.where(col("vote_average") > 8).show()
+------+--------------------+------------+----------+--------+------------+--------+-------+-----+-------+----------+-----------------+-----------------------------+--------------------+--------------------+--------------------+--------------------+--------------------+--------------------+--------------------+--------------------+ | id| title|vote_average|vote_count| status|release_date| revenue|runtime|adult| budget| imdb_id|original_language| original_title| overview| popularity| tagline| genres|production_companies|production_countries| spoken_languages| keywords| +------+--------------------+------------+----------+--------+------------+--------+-------+-----+-------+----------+-----------------+-----------------------------+--------------------+--------------------+--------------------+--------------------+--------------------+--------------------+--------------------+--------------------+ |665399|BTS World Tour: L...| 9| 306|Released| 10/9/2019| 0| 231|FALSE| 0| NULL| ko| BTS World Tour: L...|BTS perform their...| 4.892| NULL| Music, Documentary|Big Hit Entertain...| South Korea| Korean|making of, concer...| |596601|Prison High Pressure| 9| 210|Released| 3/7/2019| 0| 116| TRUE| 0| NULL| fr| Prison sous haute...|Behind the scenes...| 297.563| NULL| NULL| Marc Dorcel| France| French| prison| |730647|Break the Silence...| 9| 175|Released| 9/10/2020| 8954945| 90|FALSE| 0|tt12850582| ko|브레이크 더 사일런스: 더 무비|K-pop sensation B...| 7.198| Borahae.| Music, Documentary|Big Hit Entertain...| South Korea| Korean|pop star, pop mus...| |529414|Franco Escamilla:...| 9| 118|Released| 6/8/2018| 0| 66|FALSE| 0| tt8467922| es| Franco Escamilla:...|Mexican stand-up ...| 7.013| NULL| Comedy| NULL| Mexico| Spanish|comedian, stand-u...| |939984|BTS Permission to...| 9| 113|Released| 3/12/2022|32600000| 195|FALSE| 0|tt18687124| ko| BTS Permission to...|Join us as BTS an...| 6.877| NULL| Music|HYBE, Big Hit Ent...| South Korea| English, Korean|concert, live per...| |863291|BTS 2021 Muster: ...| 9| 86|Released| 6/14/2021| 0| 139|FALSE| 0| NULL| en| BTS 2021 Muster: ...|"BTS 2021 Muster ...| 2021 in South Ko...| 4.509| NULL| Music|HYBE, Big Hit Ent...| South Korea| Korean| |832199|Twenty One Pilots...| 9| 82|Released| 5/21/2021| 0| 60|FALSE| 0|tt14717082| en| Twenty One Pilots...|"A one-night live...| 5.242|There is always a...| Music|Fueled By Ramen, ...|United States of ...| English|behind the scenes...| |672490|EXO PLANET #2 The...| 9| 81|Released| 3/9/2016| 0| 164|FALSE| 0| NULL| fr| EXO PLANET #2 The...| NULL| 4.476| NULL| Music, Documentary| NULL| NULL| NULL| NULL| |769234|Dua Lipa: Studio ...| 9| 77|Released| 11/27/2020| 0| 70|FALSE|1500000|tt13891738| en| Dua Lipa: Studio ...|Dua Lipa's kaleid...| 5.456| NULL| Music| Warner Music UK| United Kingdom|English, Spanish,...|pop music, concer...| |284521|Scooby-Doo! Meets...| 9| 66|Released| 10/15/2003| 0| 87|FALSE| 0| NULL| en| Scooby-Doo! Meets...|In yet another hi...| 6.085| NULL|Animation, Comedy...|Hanna-Barbera Pro...|United States of ...| English|criminal investig...| | 48919|Scooby-Doo's A Nu...| 9| 63|Released| 12/1/1984| 0| 22|FALSE| 0| tt1183445| en| Scooby-Doo's A Nu...|The evil is set t...| 3.779| NULL|Animation, Family...|Hanna-Barbera Pro...|United States of ...| English| NULL| |260825|Scooby-Doo! Winte...| 9| 59|Released| 10/8/2002| 0| 75|FALSE| 0| tt5896802| en| Scooby-Doo! Winte...|Celebrate the sea...| 5.955|'Tis the season t...|Animation, Family...|Hanna-Barbera Pro...|United States of ...| English|holiday, compilat...| |288446|Scooby-Doo! and t...| 9| 58|Released| 9/27/2012| 0| 63|FALSE| 0| tt2730302| en| Scooby-Doo! and t...|"The gang flies o...| including the vi...| a giant Wakumi b...| Old Monster. The...| 3.804|Go Wild as Scooby...|Animation, Family...|Hanna-Barbera Pro...| |263311|Scooby-Doo! and t...| 9| 57|Released| 10/23/2012| 0| 0|FALSE| 0| NULL| en| Scooby-Doo! and t...|DVD compilation o...| 5.129|Beware of the ful...|Animation, Comedy...|Hanna-Barbera Pro...|United States of ...| English, French| NULL| |295421|Scooby Doo and Th...| 9| 56|Released| 8/30/2011| 0| 63|FALSE| 0| NULL| en| Scooby Doo and Th...|"A DVD compilatio...| will our friends...| or will their lo...| where a seaweed-...| it'll be lights ...| 3.246| NULL|Animation, Advent...| |806439|Morat: Balas Perd...| 9| 54|Released| 3/26/2021| 0| 116|FALSE| 0|tt14217578| es| Morat: Balas Perd...|"Morat, the band ...| 2.94| NULL| Music, Documentary|Amazon Studios, C...| Colombia, Spain| Spanish|concert, recital,...| |316272|Scooby-Doo! and t...| 9| 53|Released| 8/30/2011| 0| 87|FALSE| 0| NULL| en| Scooby-Doo! and t...|Splash into actio...| 3.689| NULL|Animation, Family...|Hanna-Barbera Pro...|United States of ...| English| NULL| |272438|Scooby-Doo! and t...| 9| 52|Released| 8/30/2011| 0| 0|FALSE| 0| tt3413334| en| Scooby-Doo! and t...|3 robot-themed ep...| 3.051| NULL|Animation, Family...|Hanna-Barbera Pro...|United States of ...| English| robot| |441348|What's New Scooby...| 9| 52|Released| 9/12/2007| 0| 100|FALSE| 0| NULL| en| What's New Scooby...|Snoop along with ...| 2.956| NULL|Animation, Comedy...|Warner Bros. Anim...|United States of ...| Mandarin, English|talking dog, drag...| |486646|Natale in casa Cu...| 9| 51|Released| 12/24/1977| 0| 133|FALSE| 0| tt5250966| it| Natale in casa Cu...|Luca Cupiello, li...| 2.42| NULL|Comedy, Drama, TV...| RAI| Italy| Italian|theater play, fam...| +------+--------------------+------------+----------+--------+------------+--------+-------+-----+-------+----------+-----------------+-----------------------------+--------------------+--------------------+--------------------+--------------------+--------------------+--------------------+--------------------+--------------------+ only showing top 20 rows
# cache dataset
df_movies_cast.cache().count() # always prefer count as action to cache since it necessiates going throgh the whole dataset and caching properly
25/02/21 12:43:00 WARN MemoryStore: Not enough space to cache rdd_33_2 in memory! (computed 106.1 MiB so far) 25/02/21 12:43:00 WARN BlockManager: Persisting block rdd_33_2 to disk instead. 25/02/21 12:43:01 WARN MemoryStore: Not enough space to cache rdd_33_2 in memory! (computed 106.1 MiB so far) 25/02/21 12:43:02 WARN MemoryStore: Not enough space to cache rdd_33_2 in memory! (computed 106.1 MiB so far)
1048575
Go to storage tab,
Check the cache
Storage Level : Disk Memory Deserialized 1x Replicated
Size in Memory and Size spilled to disk
# Again running the filter
df_movies_cast.where(col("vote_average") > 8).show()
# This ran much faster because the data is cached. 2s vs 0.3 ms
+------+--------------------+------------+----------+--------+------------+--------+-------+-----+-------+----------+-----------------+-----------------------------+--------------------+--------------------+--------------------+--------------------+--------------------+--------------------+--------------------+--------------------+ | id| title|vote_average|vote_count| status|release_date| revenue|runtime|adult| budget| imdb_id|original_language| original_title| overview| popularity| tagline| genres|production_companies|production_countries| spoken_languages| keywords| +------+--------------------+------------+----------+--------+------------+--------+-------+-----+-------+----------+-----------------+-----------------------------+--------------------+--------------------+--------------------+--------------------+--------------------+--------------------+--------------------+--------------------+ |665399|BTS World Tour: L...| 9| 306|Released| 10/9/2019| 0| 231|FALSE| 0| NULL| ko| BTS World Tour: L...|BTS perform their...| 4.892| NULL| Music, Documentary|Big Hit Entertain...| South Korea| Korean|making of, concer...| |596601|Prison High Pressure| 9| 210|Released| 3/7/2019| 0| 116| TRUE| 0| NULL| fr| Prison sous haute...|Behind the scenes...| 297.563| NULL| NULL| Marc Dorcel| France| French| prison| |730647|Break the Silence...| 9| 175|Released| 9/10/2020| 8954945| 90|FALSE| 0|tt12850582| ko|브레이크 더 사일런스: 더 무비|K-pop sensation B...| 7.198| Borahae.| Music, Documentary|Big Hit Entertain...| South Korea| Korean|pop star, pop mus...| |529414|Franco Escamilla:...| 9| 118|Released| 6/8/2018| 0| 66|FALSE| 0| tt8467922| es| Franco Escamilla:...|Mexican stand-up ...| 7.013| NULL| Comedy| NULL| Mexico| Spanish|comedian, stand-u...| |939984|BTS Permission to...| 9| 113|Released| 3/12/2022|32600000| 195|FALSE| 0|tt18687124| ko| BTS Permission to...|Join us as BTS an...| 6.877| NULL| Music|HYBE, Big Hit Ent...| South Korea| English, Korean|concert, live per...| |863291|BTS 2021 Muster: ...| 9| 86|Released| 6/14/2021| 0| 139|FALSE| 0| NULL| en| BTS 2021 Muster: ...|"BTS 2021 Muster ...| 2021 in South Ko...| 4.509| NULL| Music|HYBE, Big Hit Ent...| South Korea| Korean| |832199|Twenty One Pilots...| 9| 82|Released| 5/21/2021| 0| 60|FALSE| 0|tt14717082| en| Twenty One Pilots...|"A one-night live...| 5.242|There is always a...| Music|Fueled By Ramen, ...|United States of ...| English|behind the scenes...| |672490|EXO PLANET #2 The...| 9| 81|Released| 3/9/2016| 0| 164|FALSE| 0| NULL| fr| EXO PLANET #2 The...| NULL| 4.476| NULL| Music, Documentary| NULL| NULL| NULL| NULL| |769234|Dua Lipa: Studio ...| 9| 77|Released| 11/27/2020| 0| 70|FALSE|1500000|tt13891738| en| Dua Lipa: Studio ...|Dua Lipa's kaleid...| 5.456| NULL| Music| Warner Music UK| United Kingdom|English, Spanish,...|pop music, concer...| |284521|Scooby-Doo! Meets...| 9| 66|Released| 10/15/2003| 0| 87|FALSE| 0| NULL| en| Scooby-Doo! Meets...|In yet another hi...| 6.085| NULL|Animation, Comedy...|Hanna-Barbera Pro...|United States of ...| English|criminal investig...| | 48919|Scooby-Doo's A Nu...| 9| 63|Released| 12/1/1984| 0| 22|FALSE| 0| tt1183445| en| Scooby-Doo's A Nu...|The evil is set t...| 3.779| NULL|Animation, Family...|Hanna-Barbera Pro...|United States of ...| English| NULL| |260825|Scooby-Doo! Winte...| 9| 59|Released| 10/8/2002| 0| 75|FALSE| 0| tt5896802| en| Scooby-Doo! Winte...|Celebrate the sea...| 5.955|'Tis the season t...|Animation, Family...|Hanna-Barbera Pro...|United States of ...| English|holiday, compilat...| |288446|Scooby-Doo! and t...| 9| 58|Released| 9/27/2012| 0| 63|FALSE| 0| tt2730302| en| Scooby-Doo! and t...|"The gang flies o...| including the vi...| a giant Wakumi b...| Old Monster. The...| 3.804|Go Wild as Scooby...|Animation, Family...|Hanna-Barbera Pro...| |263311|Scooby-Doo! and t...| 9| 57|Released| 10/23/2012| 0| 0|FALSE| 0| NULL| en| Scooby-Doo! and t...|DVD compilation o...| 5.129|Beware of the ful...|Animation, Comedy...|Hanna-Barbera Pro...|United States of ...| English, French| NULL| |295421|Scooby Doo and Th...| 9| 56|Released| 8/30/2011| 0| 63|FALSE| 0| NULL| en| Scooby Doo and Th...|"A DVD compilatio...| will our friends...| or will their lo...| where a seaweed-...| it'll be lights ...| 3.246| NULL|Animation, Advent...| |806439|Morat: Balas Perd...| 9| 54|Released| 3/26/2021| 0| 116|FALSE| 0|tt14217578| es| Morat: Balas Perd...|"Morat, the band ...| 2.94| NULL| Music, Documentary|Amazon Studios, C...| Colombia, Spain| Spanish|concert, recital,...| |316272|Scooby-Doo! and t...| 9| 53|Released| 8/30/2011| 0| 87|FALSE| 0| NULL| en| Scooby-Doo! and t...|Splash into actio...| 3.689| NULL|Animation, Family...|Hanna-Barbera Pro...|United States of ...| English| NULL| |272438|Scooby-Doo! and t...| 9| 52|Released| 8/30/2011| 0| 0|FALSE| 0| tt3413334| en| Scooby-Doo! and t...|3 robot-themed ep...| 3.051| NULL|Animation, Family...|Hanna-Barbera Pro...|United States of ...| English| robot| |441348|What's New Scooby...| 9| 52|Released| 9/12/2007| 0| 100|FALSE| 0| NULL| en| What's New Scooby...|Snoop along with ...| 2.956| NULL|Animation, Comedy...|Warner Bros. Anim...|United States of ...| Mandarin, English|talking dog, drag...| |486646|Natale in casa Cu...| 9| 51|Released| 12/24/1977| 0| 133|FALSE| 0| tt5250966| it| Natale in casa Cu...|Luca Cupiello, li...| 2.42| NULL|Comedy, Drama, TV...| RAI| Italy| Italian|theater play, fam...| +------+--------------------+------------+----------+--------+------------+--------+-------+-----+-------+----------+-----------------+-----------------------------+--------------------+--------------------+--------------------+--------------------+--------------------+--------------------+--------------------+--------------------+ only showing top 20 rows
If you go to SQL/Dataframe and check the DAG, it doesnt read the data from the csv file rather it does from the in memory table scan
Default memory for cache is MEMORY_AND_DISK for Dataframe and for rdd its MEMORY
Remove from Cache and then perform filter again
df_movies_cast.unpersist()
DataFrame[id: string, title: string, vote_average: int, vote_count: string, status: string, release_date: string, revenue: string, runtime: string, adult: string, budget: string, imdb_id: string, original_language: string, original_title: string, overview: string, popularity: string, tagline: string, genres: string, production_companies: string, production_countries: string, spoken_languages: string, keywords: string]
# Create new dataset df_cache from df_movies_cast
df_cache = df_movies_cast.cache()
df_cache.count()
25/02/21 12:43:39 WARN MemoryStore: Not enough space to cache rdd_51_2 in memory! (computed 106.1 MiB so far) 25/02/21 12:43:39 WARN BlockManager: Persisting block rdd_51_2 to disk instead. 25/02/21 12:43:41 WARN MemoryStore: Not enough space to cache rdd_51_2 in memory! (computed 106.1 MiB so far) 25/02/21 12:43:41 WARN MemoryStore: Not enough space to cache rdd_51_2 in memory! (computed 106.1 MiB so far)
1048575
# Read from the original dataset df_movies_cast
df_movies_cast.where("vote_average > 8").show()
+------+--------------------+------------+----------+--------+------------+--------+-------+-----+-------+----------+-----------------+-----------------------------+--------------------+--------------------+--------------------+--------------------+--------------------+--------------------+--------------------+--------------------+ | id| title|vote_average|vote_count| status|release_date| revenue|runtime|adult| budget| imdb_id|original_language| original_title| overview| popularity| tagline| genres|production_companies|production_countries| spoken_languages| keywords| +------+--------------------+------------+----------+--------+------------+--------+-------+-----+-------+----------+-----------------+-----------------------------+--------------------+--------------------+--------------------+--------------------+--------------------+--------------------+--------------------+--------------------+ |665399|BTS World Tour: L...| 9| 306|Released| 10/9/2019| 0| 231|FALSE| 0| NULL| ko| BTS World Tour: L...|BTS perform their...| 4.892| NULL| Music, Documentary|Big Hit Entertain...| South Korea| Korean|making of, concer...| |596601|Prison High Pressure| 9| 210|Released| 3/7/2019| 0| 116| TRUE| 0| NULL| fr| Prison sous haute...|Behind the scenes...| 297.563| NULL| NULL| Marc Dorcel| France| French| prison| |730647|Break the Silence...| 9| 175|Released| 9/10/2020| 8954945| 90|FALSE| 0|tt12850582| ko|브레이크 더 사일런스: 더 무비|K-pop sensation B...| 7.198| Borahae.| Music, Documentary|Big Hit Entertain...| South Korea| Korean|pop star, pop mus...| |529414|Franco Escamilla:...| 9| 118|Released| 6/8/2018| 0| 66|FALSE| 0| tt8467922| es| Franco Escamilla:...|Mexican stand-up ...| 7.013| NULL| Comedy| NULL| Mexico| Spanish|comedian, stand-u...| |939984|BTS Permission to...| 9| 113|Released| 3/12/2022|32600000| 195|FALSE| 0|tt18687124| ko| BTS Permission to...|Join us as BTS an...| 6.877| NULL| Music|HYBE, Big Hit Ent...| South Korea| English, Korean|concert, live per...| |863291|BTS 2021 Muster: ...| 9| 86|Released| 6/14/2021| 0| 139|FALSE| 0| NULL| en| BTS 2021 Muster: ...|"BTS 2021 Muster ...| 2021 in South Ko...| 4.509| NULL| Music|HYBE, Big Hit Ent...| South Korea| Korean| |832199|Twenty One Pilots...| 9| 82|Released| 5/21/2021| 0| 60|FALSE| 0|tt14717082| en| Twenty One Pilots...|"A one-night live...| 5.242|There is always a...| Music|Fueled By Ramen, ...|United States of ...| English|behind the scenes...| |672490|EXO PLANET #2 The...| 9| 81|Released| 3/9/2016| 0| 164|FALSE| 0| NULL| fr| EXO PLANET #2 The...| NULL| 4.476| NULL| Music, Documentary| NULL| NULL| NULL| NULL| |769234|Dua Lipa: Studio ...| 9| 77|Released| 11/27/2020| 0| 70|FALSE|1500000|tt13891738| en| Dua Lipa: Studio ...|Dua Lipa's kaleid...| 5.456| NULL| Music| Warner Music UK| United Kingdom|English, Spanish,...|pop music, concer...| |284521|Scooby-Doo! Meets...| 9| 66|Released| 10/15/2003| 0| 87|FALSE| 0| NULL| en| Scooby-Doo! Meets...|In yet another hi...| 6.085| NULL|Animation, Comedy...|Hanna-Barbera Pro...|United States of ...| English|criminal investig...| | 48919|Scooby-Doo's A Nu...| 9| 63|Released| 12/1/1984| 0| 22|FALSE| 0| tt1183445| en| Scooby-Doo's A Nu...|The evil is set t...| 3.779| NULL|Animation, Family...|Hanna-Barbera Pro...|United States of ...| English| NULL| |260825|Scooby-Doo! Winte...| 9| 59|Released| 10/8/2002| 0| 75|FALSE| 0| tt5896802| en| Scooby-Doo! Winte...|Celebrate the sea...| 5.955|'Tis the season t...|Animation, Family...|Hanna-Barbera Pro...|United States of ...| English|holiday, compilat...| |288446|Scooby-Doo! and t...| 9| 58|Released| 9/27/2012| 0| 63|FALSE| 0| tt2730302| en| Scooby-Doo! and t...|"The gang flies o...| including the vi...| a giant Wakumi b...| Old Monster. The...| 3.804|Go Wild as Scooby...|Animation, Family...|Hanna-Barbera Pro...| |263311|Scooby-Doo! and t...| 9| 57|Released| 10/23/2012| 0| 0|FALSE| 0| NULL| en| Scooby-Doo! and t...|DVD compilation o...| 5.129|Beware of the ful...|Animation, Comedy...|Hanna-Barbera Pro...|United States of ...| English, French| NULL| |295421|Scooby Doo and Th...| 9| 56|Released| 8/30/2011| 0| 63|FALSE| 0| NULL| en| Scooby Doo and Th...|"A DVD compilatio...| will our friends...| or will their lo...| where a seaweed-...| it'll be lights ...| 3.246| NULL|Animation, Advent...| |806439|Morat: Balas Perd...| 9| 54|Released| 3/26/2021| 0| 116|FALSE| 0|tt14217578| es| Morat: Balas Perd...|"Morat, the band ...| 2.94| NULL| Music, Documentary|Amazon Studios, C...| Colombia, Spain| Spanish|concert, recital,...| |316272|Scooby-Doo! and t...| 9| 53|Released| 8/30/2011| 0| 87|FALSE| 0| NULL| en| Scooby-Doo! and t...|Splash into actio...| 3.689| NULL|Animation, Family...|Hanna-Barbera Pro...|United States of ...| English| NULL| |272438|Scooby-Doo! and t...| 9| 52|Released| 8/30/2011| 0| 0|FALSE| 0| tt3413334| en| Scooby-Doo! and t...|3 robot-themed ep...| 3.051| NULL|Animation, Family...|Hanna-Barbera Pro...|United States of ...| English| robot| |441348|What's New Scooby...| 9| 52|Released| 9/12/2007| 0| 100|FALSE| 0| NULL| en| What's New Scooby...|Snoop along with ...| 2.956| NULL|Animation, Comedy...|Warner Bros. Anim...|United States of ...| Mandarin, English|talking dog, drag...| |486646|Natale in casa Cu...| 9| 51|Released| 12/24/1977| 0| 133|FALSE| 0| tt5250966| it| Natale in casa Cu...|Luca Cupiello, li...| 2.42| NULL|Comedy, Drama, TV...| RAI| Italy| Italian|theater play, fam...| +------+--------------------+------------+----------+--------+------------+--------+-------+-----+-------+----------+-----------------+-----------------------------+--------------------+--------------------+--------------------+--------------------+--------------------+--------------------+--------------------+--------------------+ only showing top 20 rows
We got the data instantly. So even if we read from original dataset it picks up data from the cache itself.
Caching data after filter¶
df_cache.unpersist()
DataFrame[id: string, title: string, vote_average: int, vote_count: string, status: string, release_date: string, revenue: string, runtime: string, adult: string, budget: string, imdb_id: string, original_language: string, original_title: string, overview: string, popularity: string, tagline: string, genres: string, production_companies: string, production_countries: string, spoken_languages: string, keywords: string]
df_cache_filter = df_movies_cast.where(col("vote_average") > 8).cache()
df_cache_filter.count()
35373
Observe not all the data has been cached here, count is not total_count
Now let's again filter something else
df_cache_filter_new = df_movies_cast.where(col("vote_average") < 6).show()
+------+--------------------+------------+----------+--------+------------+----------+-------+-----+---------+---------+-----------------+--------------------+--------------------+--------------------+--------------------+--------------------+--------------------+--------------------+--------------------+--------------------+ | id| title|vote_average|vote_count| status|release_date| revenue|runtime|adult| budget| imdb_id|original_language| original_title| overview| popularity| tagline| genres|production_companies|production_countries| spoken_languages| keywords| +------+--------------------+------------+----------+--------+------------+----------+-------+-----+---------+---------+-----------------+--------------------+--------------------+--------------------+--------------------+--------------------+--------------------+--------------------+--------------------+--------------------+ |297761| Suicide Squad| 5| 20097|Released| 8/3/2016| 746846894| 123|FALSE|175000000|tt1386697| en| Suicide Squad|From DC Comics co...| 35.356|Worst. Heroes. Ever.|Action, Adventure...|DC Entertainment,...|United States of ...|English, Japanese...|secret mission, s...| |209112|Batman v Superman...| 5| 17081|Released| 3/23/2016| 873637528| 152|FALSE|250000000|tt2975590| en|Batman v Superman...|Fearing the actio...| 78.589| Who will win?|Action, Adventure...|Warner Bros. Pict...|Canada, United Ki...| English|superhero, based ...| | 14161| 2012| 5| 11300|Released| 10/10/2009| 791217826| 158|FALSE|200000000|tt1190080| en| 2012|"Dr. Adrian Helms...| Curtis struggles...| volcanic eruptio...| 51.914| We Were Warned.|Action, Adventure...|Columbia Pictures...|United States of ...| |216015|Fifty Shades of Grey| 5| 11101|Released| 2/11/2015| 571006128| 125|FALSE| 40000000|tt2322441| en|Fifty Shades of Grey|When college seni...| 96.283| Are you curious?|Drama, Romance, T...|Focus Features, M...|United States of ...| English|bad smell, based ...| | 9738| Fantastic Four| 5| 8765|Released| 6/29/2005| 333535934| 106|FALSE|100000000|tt0120667| en| Fantastic Four|"During a space v...| 23.572|4 times the actio...|Action, Adventure...|Kumar Mobilienges...|Germany, United S...| English|mask, friendship,...| | 87101| Terminator Genisys| 5| 7870|Released| 6/23/2015| 440603537| 126|FALSE|155000000|tt1340138| en| Terminator Genisys|The year is 2029....| 67.345| Reset the future|Science Fiction, ...|Skydance, Paramou...|Canada, United St...| English|future, artificia...| | 217|Indiana Jones and...| 5| 7629|Released| 5/21/2008| 786636033| 122|FALSE|185000000|tt0367882| en|Indiana Jones and...|Set during the Co...| 47.907|The adventure con...| Adventure, Action|Paramount, Lucasf...|United States of ...|English, German, ...|treasure, mexico ...| | 91314|Transformers: Age...| 5| 7581|Released| 6/25/2014|1104054072| 165|FALSE|210000000|tt2109248| en|Transformers: Age...|"As humanity pick...|"" Autobots and D...| a group of powerful| ingenious busine...| powerful Transfo...| 57.581|This is not war, ...|Science Fiction, ...| | 58595|Snow White and th...| 5| 7503|Released| 5/30/2012| 396600000| 127|FALSE|170000000|tt1735898| en|Snow White and th...|After the Evil Qu...| 34.976|The Fairytale is ...|Adventure, Fantas...|Universal Picture...|United States of ...| English|magic, immortalit...| | 1979|Fantastic Four: R...| 5| 7381|Released| 6/13/2007| 301913131| 92|FALSE|130000000|tt0486576| en|Fantastic Four: R...|The Fantastic Fou...| 30.047|Discover the secr...|Adventure, Fantas...|1492 Pictures, Be...|Germany, United K...|English, Japanese...|mask, surfboard, ...| |121856| Assassin's Creed| 5| 7328|Released| 12/21/2016| 240697856| 116|FALSE|125000000|tt2094766| en| Assassin's Creed|Through unlocked ...| 26.133|Your destiny is i...|Action, Adventure...|New Regency Pictu...|France, United Ki...|English, Spanish,...|assassin, spain, ...| |257344| Pixels| 5| 7061|Released| 7/16/2015| 244874809| 105|FALSE| 88000000|tt2120120| en| Pixels|Video game expert...| 32.028| Game On.|Action, Comedy, S...|Sony Pictures, Co...|China, United Sta...| English|new york city, lo...| | 44912| Green Lantern| 5| 6806|Released| 6/14/2011| 219851172| 114|FALSE|200000000|tt1133985| en| Green Lantern|For centuries, a ...| 32.041|In our darkest ho...|Adventure, Action...|DC Entertainment,...|United States of ...| English|superhero, transf...| |223702| Sausage Party| 5| 6789|Released| 7/11/2016| 140705322| 89|FALSE| 19000000|tt1700841| en| Sausage Party|Frank leads a gro...| 44.804|Always use condim...|Adventure, Animat...|Columbia Pictures...|Canada, United St...| English|supermarket, paro...| |282035| The Mummy| 5| 6741|Released| 6/6/2017| 409231607| 110|FALSE|125000000|tt2345759| en| The Mummy|Though safely ent...| 21.893|Welcome To A New ...|Fantasy, Thriller...|Secret Hideout, S...|China, Japan, Uni...| Arabic, English|egypt, monster, s...| | 98566|Teenage Mutant Ni...| 5| 6391|Released| 8/7/2014| 485004754| 101|FALSE|125000000|tt1291150| en|Teenage Mutant Ni...|When a kingpin th...| 76.327|Mysterious. Dange...|Science Fiction, ...|Nickelodeon Movie...|United States of ...| English|new york city, va...| | 76757| Jupiter Ascending| 5| 6353|Released| 2/4/2015| 183987723| 127|FALSE|176000003|tt1617661| en| Jupiter Ascending|In a universe whe...| 53.857|Expand your unive...|Science Fiction, ...|Anarchos Producti...|United States of ...| English, Russian|jupiter, space, s...| | 68728|Oz the Great and ...| 5| 6232|Released| 3/7/2013| 491868548| 130|FALSE|200000000|tt1623205| en|Oz the Great and ...|Oscar Diggs, a sm...| 29.84|In Oz, nothing is...|Fantasy, Adventur...|Roth Films, Walt ...|United States of ...| English|witch, magic, cir...| | 82700| After Earth| 5| 6205|Released| 5/30/2013| 243843127| 100|FALSE|130000000|tt1815862| en| After Earth|One thousand year...| 24.446|Danger is real, f...|Science Fiction, ...|Columbia Pictures...|United States of ...| English| dystopia| |439079| The Nun| 5| 6076|Released| 9/5/2018| 365582797| 96|FALSE| 22000000|tt5814060| en| The Nun|A priest with a h...| 357.731|Pray For Forgiveness|Horror, Mystery, ...|New Line Cinema, ...|United States of ...|English, Romanian...|rome, italy, nun,...| +------+--------------------+------------+----------+--------+------------+----------+-------+-----+---------+---------+-----------------+--------------------+--------------------+--------------------+--------------------+--------------------+--------------------+--------------------+--------------------+--------------------+ only showing top 20 rows
Observe that now if you check the DAG, original csv is read to filter and not cached data.
So this effects performance.
Hence be careful while caching data
Persist¶
# Remove cache
spark.catalog.clearCache()
# MEMORY_ONLY, MEMORY_AND_DISK, MEMORY_ONLY_SER, MEMORY_AND_DISK_SER, DISK_ONLY, MEMORY_ONLY_2, MEMORY_AND_DISK_2
import pyspark
df_persist = df_movies.persist(pyspark.StorageLevel.MEMORY_ONLY)
df_persist.write.format("noop").mode("overwrite").save()
Now check 'Storage' tab. You can see that entire storage is on the memory and nothing has spilled to disk + the data is serialized now unlike persist
Since the data is serialized it can be stored completely in memory.
MEMORY_AND_DISK and MEMORY_ONLY_SER cannot be used in pyspark
MEMORY_ONLY_2 - Creates two replicas of the cache on each executor
# MEMORY_ONLY, MEMORY_AND_DISK, MEMORY_ONLY_SER, MEMORY_AND_DISK_SER, DISK_ONLY, MEMORY_ONLY_2, MEMORY_AND_DISK_2
import pyspark
df_persist = df_movies.persist(pyspark.StorageLevel.MEMORY_ONLY_2)
df_persist.write.format("noop").mode("overwrite").save()
25/02/21 13:00:35 WARN RandomBlockReplicationPolicy: Expecting 1 replicas with only 0 peer/s. 25/02/21 13:00:35 WARN BlockManager: Block rdd_92_0 replicated to only 0 peer(s) instead of 1 peers 25/02/21 13:00:35 WARN RandomBlockReplicationPolicy: Expecting 1 replicas with only 0 peer/s. 25/02/21 13:00:35 WARN BlockManager: Block rdd_92_1 replicated to only 0 peer(s) instead of 1 peers 25/02/21 13:00:43 WARN RandomBlockReplicationPolicy: Expecting 1 replicas with only 0 peer/s. 25/02/21 13:00:43 WARN BlockManager: Block rdd_92_2 replicated to only 0 peer(s) instead of 1 peers
Observe we can see 'Memory Serialized 2x Replicated' now
spark.stop()