filmov
tv
How to open, edit and export data from large/giant CSV files of several million rows

Показать описание
In this video I show how to open a CSV file of 74 million rows. Excel has a limit of a little under a million and a half rows and Google Sheets has a similar limit. I'm assuming the audience is someone who knows how to use a spreadsheet but has little to know programming eperience. Files these large (over 3GB in this case) can't be opened by python by regular means either. I'll be using RStudio to make this happen.
The file I'll open is from sales from Grupo Bimbo in Mexico, and I'll filter our certain products and filter it by the OXXO stores, which have nearly 10,000 in the data set.
My code in R:
(nota: YouTube doesn't allow me to add the “angle brackets” in the commen section which is necessary and part of R. I've changed this symbol for “{angle bracket}” in the code below, don't use the "*" symbol either).
#Download necessary libraries:
#Import the libraries:
* library(plyr)
* library(plyr)
* library(plyr)
#Upload the CSV:
# break down the 74 million row file into two files with the first two million rows:
#Create a table filtering a column by certain values:
*trainfilter {angle bracket}- subset(train, Producto_ID == 30007 | Producto_ID == 30008 | Producto_ID == 36533 | Producto_ID == 36534 | Producto_ID == 36535 | Producto_ID == 37075 | Producto_ID == 37159 | Producto_ID == 37076 | Producto_ID == 40985)
# Example for merging two columns. The code will work similar to the filter above, but you'll be filtering for almost 10,000 OXXO stores:
*OXXO_STORES {angle bracket}- read_csv("bimbo_cliente_tablas_OXXO.csv")
*OXXO_MERGED {angle bracket}- merge(train,OXXO_STORES,by="Cliente_ID")
#Increase the memory used by RStudios if you get an error similar to “Error: cannot allocate vector of size 283.0 Mb”:
#demuestra cuanta memoria estas utilizando:
#incrementa el número de memoria que utilizas:
#How to download a table in R into a CSV file in your computer
#If your filtered file is still over a million and a half rows you can download it to your computer and divided into two in R, and download the smaller tables. It will be similar to the table used above, but you'll have to change all intergers into characters and account for any new columns added with the merge:
The file I'll open is from sales from Grupo Bimbo in Mexico, and I'll filter our certain products and filter it by the OXXO stores, which have nearly 10,000 in the data set.
My code in R:
(nota: YouTube doesn't allow me to add the “angle brackets” in the commen section which is necessary and part of R. I've changed this symbol for “{angle bracket}” in the code below, don't use the "*" symbol either).
#Download necessary libraries:
#Import the libraries:
* library(plyr)
* library(plyr)
* library(plyr)
#Upload the CSV:
# break down the 74 million row file into two files with the first two million rows:
#Create a table filtering a column by certain values:
*trainfilter {angle bracket}- subset(train, Producto_ID == 30007 | Producto_ID == 30008 | Producto_ID == 36533 | Producto_ID == 36534 | Producto_ID == 36535 | Producto_ID == 37075 | Producto_ID == 37159 | Producto_ID == 37076 | Producto_ID == 40985)
# Example for merging two columns. The code will work similar to the filter above, but you'll be filtering for almost 10,000 OXXO stores:
*OXXO_STORES {angle bracket}- read_csv("bimbo_cliente_tablas_OXXO.csv")
*OXXO_MERGED {angle bracket}- merge(train,OXXO_STORES,by="Cliente_ID")
#Increase the memory used by RStudios if you get an error similar to “Error: cannot allocate vector of size 283.0 Mb”:
#demuestra cuanta memoria estas utilizando:
#incrementa el número de memoria que utilizas:
#How to download a table in R into a CSV file in your computer
#If your filtered file is still over a million and a half rows you can download it to your computer and divided into two in R, and download the smaller tables. It will be similar to the table used above, but you'll have to change all intergers into characters and account for any new columns added with the merge: