Python Tutorial: Understanding Computer Storage & Big Data

Показать описание

---
Hello and welcome to Parallel Computing with Dask. I'm Dhavide Aruliah. This course is about the Python library Dask that helps data scientists scale their analyses painlessly.

First, let's ask what "Big Data" means. Roughly, speaking, it's more data than a single machine can accommodate. To understand that precisely, we need to discuss computer storage.

First, let's clarify storage units. In conventional metric units, prefixes like kilo-, mega-, giga-, & tera- denote factors of 1000. For instance, kiloWatt, megaWatt, gigaWatt, & teraWatt mean 10^3, 10^6, 10^9, and 10^{12} Watts respectively. Computers, however, use in binary or base 2 notation,not base 10. The base unit of data storage is the binary digit or bit. Eight bits make one byte of data. So, with data, it's conventional to use kilo-, mega-, giga- & tera- to scale units up by $2^{10}$ or 1,024 (because that's almost 1000). Then a kilobyte is 2^10 bytes, a megabyte is 2^20 bytes, a gigabyte is 2^30 bytes and so on.

Hard disks are used for permanent storage...

...Current, hard disks can accommodate terabyes of data but require milliseconds to access data.

Computers use Random Access Memory (RAM) for temporary storage. Data in RAM vanishes when shut down. Data can be retrieved from RAM in nanoseconds to microseconds, but the largest RAM configurations are a handful of gigabytes at best today.

This table (modified from a table by Brendan Gregg) tells the story. These are data transfer times from RAM, hard-disks, and the internet, (say, from San Francisco to New York). What matters is relative time scales - nanoseconds to microseconds to milliseconds. If we pretend transfer from RAM takes one second. Then, transfer from a solid state disk takes 7 to 21 minutes, transfer from a rotational hard disk takes two and half hours to a full day, and transfer from the internet takes almost 4 days.

This is what big data really means. When data overflows from RAM to harddisk or the cloud, the relative processing time effectively jumps from seconds to minutes or days.

With that in mind, let's examine memory usage on my laptop.
We need tools from the psutil & os modules, so we import those. We then define a custom function memory_footprint() to return how much memory (in MB) the Python process uses. We don't have to understand its inner workings, but notice the result returned is rescaled from bytes to megabytes.

We import NumPy and we call memory footprint before doing any work. We then create a NumPy array x that requires 50 MB of storage and then call memory footprint again to see the memory usage after. Sure enough, the process pulled roughly another 50 MB from the operating system to create x.

We then compute `x` squared & again compute memory use before & after. We find an additional 50 MB of RAM is acquired for this computation
(even though the computed result `x**2` is not actually bound to an identifier).

We can use a numpy array's `nbytes` attribute to find out its actual memory requirements. The result is given in bytes. Dividing by 1024^2 gives the size in megabytes (a more readable number).

We can also create a Pandas DataFrames `df` using `x`. The Dataframe has an attribute `memory_usage()` that returns an Integer Series summarizing its memory footprint in Bytes. As before, we can divide by 1024^2 to get the memory footprint in megabytes.

So, before we dive into Dask, try some exercises to make sure you know how much memory you're using.

#PythonTutorial #ComputerStorage #Python #DataCamp #BigData #parallelprogramming #dask