How to import a 7M+ rows data set into PostgreSQL? (SQL for beginners)

preview_player
Показать описание
This is already unzipped, so you won't need the dtrx step, either! (Check the steps below OR the article for the steps.)

-----------------------------------------------

Or start with SQL for Data Analysis - episode 1:

How to install SQL Workbench for postgreSQL:

-----------------------------------------------
STEPS -- UPDATED in 2022 (not the same as in the video):
1. Open Terminal and login (ssh) to your dataserver!

2. Download the “flight delays” data!

3. Format your data!

4. Now we have to give permission to our postgreSQL user to create tables and load data into them. This will need multiple steps. Here’s a gif first (note: my username is “dataguy” - your’s might be something

4.1. First sudo to the user called “postgres”:
sudo -u postgres -i

4.2. Then start postgreSQL:
psql

4.3. The prompt will change to this one postgres=#! Type:
alter user [your_user_name] superuser;

4.4. This turns your original user into a super user! Go back to your user!
Exit from postgreSQL:
\q
Then exit from the user called “postgres”:
exit

4.5. Finally access your original user’s postgreSQL database from the command line:
psql -d postgres

5. Now all you need to do is create the table by simply copy-pasting this lines into your terminal:

CREATE TABLE flight_delays (
year INTEGER,
month INTEGER,
dayofmonth INTEGER,
dayofweek INTEGER,
deptime INTEGER,
arrtime INTEGER,
flightnum INTEGER,
tailnum VARCHAR,
airtime INTEGER,
arrdelay INTEGER,
depdelay INTEGER,
origin VARCHAR,
dest VARCHAR,
distance INTEGER);

6. And finally copy the data from the csv file you have just downloaded!

7. Go back to SQL Workbench and make a simple SELECT statement… but make sure, that you use the LIMIT clause too. Now you have over 7.000.000 rows of data - even though postgreSQL can handle it easily, your computer might be frozen, if you try print all that data on your screen. So try something like this first:
SELECT * FROM test_test LIMIT 10;

-----------------------------------------------
Рекомендации по теме
Комментарии
Автор

UPDATE: the data set is now here:
This is already unzipped, so you won't need the dtrx step, either! (Check the video description OR the article for steps.)

datatomi
Автор

Hi, how would you do this, if in your table have FOREIGN KEY and subquerys?
Please, I'll be grateful if you answer.

andresareiza
Автор

I have 5 million records in a CSV file which I tried to load into a PostgreSQL table using a copy command just like the one you used. However for reasons unknown only 375, 000 records were copied in the target table. Please help me out with this. Thanks in advance

MuneerAhmed-urbi
Автор

Hi, could you please Explain how to read file from S3 and deploy this in AWS EC2 instance ? Would be happy to understand how to use pg_bulkload instead of copy

shajilakp
Автор

postgres=# COPY PUBLIC.flight_delays_all FROM '/users/SPM/desktop/2007.csv' DELIMITER ', ' CSV HEADER;
ERROR: could not open file "/users/SPM/desktop/2007.csv" for reading: Permission denied
HINT: COPY FROM instructs the PostgreSQL server process to read a file. You may want a client-side facility such as psql's \copy.

shamilponnath
Автор

Good video, do you have any video how to import huge amount of data from folder - with subfolders? And data type is in my example .dif files?

mirrrvelll
Автор

¿Es necesario hacer todos los pasos del servicio remoto para obtener la data flight_delays?. Ayuda por favor. Necesito el archivo csv.

JeanPierreAguirreUrdaniga
Автор

Hello ... could you help me with the procedure for uploading a .csv file with postgresql on UBUNTU?

prathibhakae
Автор

Moreover SELECT * FROM fligh_delays LIMIT 10 ; it goes through but return nothing

mohamedziedan
Автор

Is there a way to use that copy command and eliminate duplicates ? I think mysql supports that with their import tool

cyclonus
Автор

Great vid! You really should be using SSH with authentication key instead of password.

hgiagiamou
Автор

Thanks for share Tomi.I have a similar task for this week.

SergioFerrazSouza
Автор

I followed ur instructions, however i got this >> ERROR: missing data for column "flightnum"
CONTEXT: COPY flight_delays, line 6933096: "2007, 12, 21, 5, 804, "

mohamedziedan
Автор

if you are working with bigdata it is not funny to create table by writing columns, it should be create from header of csv file automatically but anyway thanks for good video!

romanbond