Beware of CSV Copies in PostgreSQL Which Include a Primary Key!
Slow COPY commands got you down? Try without the primary key!
Posted on April 18, 2021
While working on the hackathon project behind my blog post "Analyzing Park and Rail (P+Rail) Parking Availability Across The Entirety of Switzerland", I had to migrate a PostgreSQL table I had stored on my local machine to my production server. (The table was populated locally from a batch job since my processor is way faster on my laptop than the small single CPU droplet I use for the production site.
Naïve Attempt
Not really thinking ahead, I naïvely generated a .csv
with the following command:
COPY "ParkingAvailabilities" TO '/Users/chris/downloads/ParkingAvailabilities.csv' DELIMITER ',' CSV HEADER;
I then sent my .csv
file to the droplet using the scp
command:
scp -r /Users/chris/Downloads/ParkingAvailabilities.csv root@<<DROPLET IP HERE>>:/root
Logging into the droplet, I first moved the .csv
file from the root to the tmp
folder, since the postgres
user won't be able to access a file in the root/
folder:
mv ~/ParkingAvailabilities.csv /tmp/
I then attempted to import the .csv
directly into my production table:
COPY "ParkingAvailabilities" FROM '/tmp/ParkingAvailabilities.csv' WITH (FORMAT csv, HEADER true);
The command successfully dispatched, but I noticed a strange problem. I kept monitoring the progress of the import by checking the COUNT
of the table I was importing to:
SELECT COUNT(*) FROM "ParkingAvailabilities";
I saw that something like 4000 rows were rapidly inserted, but the progress began to slow, showing things like 5000, then 5100, and slower and slower until the insert progress seemed to nearly hang. I had never seen this before, so I did a bit of research.
It turns out what PostgreSQL was examining each primary key for each new CSV row that it tried to import. Essentially in this case, for each new row it tries to import, it will check the primary key against all 3 million rows in the csv. This of course gets incrementally slower and slower for each row that is inserted!
For a development-to-production migration scenario, this doesn't make sense anyway: it's best to let PostgreSQL manage it's own primary key creation.
So, to get the fastest COPY possible, I needed to remove that primary key from my CSV data.
The Solution: Remove the Primary Key Column from the CSV File!
Luckily, I was smart enough to realize didn't have to re-export the data and move it to the production server with SCP again. I instead used the handy Bash command cut
on the existing .csv
file I already had to strip off the primary key column, which I my case, was the first column of the csv:
cut -d, -f 1 —complements ParkingAvailabilities.csv > ParkingAvailabilitiesNoID.csv
Here, I used the -d
flag for delete, the -f
flag which is for field, and I specify field 1
, which for me was the Id
column, my primary key. I then write the file to a new csv file, ParkingAvailabilitiesNoID.csv
I then unfortunately had to modify the COPY
command as well, explicitly specifying the columns I wanted to import from the csv
file, since PostgreSQL will complain when it sees a mismatch between the table and the number of columns in the .csv
file. The COPY
command with these columns specified looked like this:
COPY "ParkingAvailabilities" ("DateCalculatedFor", "DayOfWeek", "FacilityName","EstimatedParkingSpacesAvailable", "ParkingSpacesTotal", "ParkingForecastData", "Created", "Updated") FROM '/tmp/ParkingAvailabilitiesNoID.csv' WITH (FORMAT csv, HEADER true);
With the Id
column removed, this command successfully completed in a matter of seconds - no hangs or memory explosions since PostgreSQL didn't have to make those computationally heavy primary key checks anymore!
Thanks!
That's it for this quick post! I hope you'll be able to use this pattern of commands when you also need to migrate data to your production server!
Cheers 🍻
-Chris