Beware of CSV Copies in PostgreSQL Which Include a Primary Key!

Slow COPY commands got you down? Try without the primary key!

Posted on April 18, 2021

While working on the hackathon project behind my blog post "Analyzing Park and Rail (P+Rail) Parking Availability Across The Entirety of Switzerland", I had to migrate a PostgreSQL table I had stored on my local machine to my production server. (The table was populated locally from a batch job since my processor is way faster on my laptop than the small single CPU droplet I use for the production site.

Naïve Attempt

Not really thinking ahead, I naïvely generated a .csv with the following command:

COPY "ParkingAvailabilities" TO '/Users/chris/downloads/ParkingAvailabilities.csv' DELIMITER ',' CSV HEADER;

I then sent my .csv file to the droplet using the scp command:

scp -r /Users/chris/Downloads/ParkingAvailabilities.csv root@<<DROPLET IP HERE>>:/root

Logging into the droplet, I first moved the .csv file from the root to the tmp folder, since the postgres user won't be able to access a file in the root/ folder:

mv ~/ParkingAvailabilities.csv /tmp/

I then attempted to import the .csv directly into my production table:

COPY "ParkingAvailabilities" FROM '/tmp/ParkingAvailabilities.csv' WITH (FORMAT csv, HEADER true);

The command successfully dispatched, but I noticed a strange problem. I kept monitoring the progress of the import by checking the COUNT of the table I was importing to:

SELECT COUNT(*) FROM "ParkingAvailabilities";

I saw that something like 4000 rows were rapidly inserted, but the progress began to slow, showing things like 5000, then 5100, and slower and slower until the insert progress seemed to nearly hang. I had never seen this before, so I did a bit of research.

It turns out what PostgreSQL was examining each primary key for each new CSV row that it tried to import. Essentially in this case, for each new row it tries to import, it will check the primary key against all 3 million rows in the csv. This of course gets incrementally slower and slower for each row that is inserted!

For a development-to-production migration scenario, this doesn't make sense anyway: it's best to let PostgreSQL manage it's own primary key creation.

So, to get the fastest COPY possible, I needed to remove that primary key from my CSV data.

The Solution: Remove the Primary Key Column from the CSV File!

Luckily, I was smart enough to realize didn't have to re-export the data and move it to the production server with SCP again. I instead used the handy Bash command cut on the existing .csv file I already had to strip off the primary key column, which I my case, was the first column of the csv:

cut -d, -f 1 —complements ParkingAvailabilities.csv > ParkingAvailabilitiesNoID.csv

Here, I used the -d flag for delete, the -f flag which is for field, and I specify field 1, which for me was the Id column, my primary key. I then write the file to a new csv file, ParkingAvailabilitiesNoID.csv

I then unfortunately had to modify the COPY command as well, explicitly specifying the columns I wanted to import from the csv file, since PostgreSQL will complain when it sees a mismatch between the table and the number of columns in the .csv file. The COPY command with these columns specified looked like this:

COPY "ParkingAvailabilities" ("DateCalculatedFor", "DayOfWeek", "FacilityName","EstimatedParkingSpacesAvailable", "ParkingSpacesTotal", "ParkingForecastData", "Created", "Updated") FROM '/tmp/ParkingAvailabilitiesNoID.csv' WITH (FORMAT csv, HEADER true);

With the Id column removed, this command successfully completed in a matter of seconds - no hangs or memory explosions since PostgreSQL didn't have to make those computationally heavy primary key checks anymore!

Thanks!

That's it for this quick post! I hope you'll be able to use this pattern of commands when you also need to migrate data to your production server!

Cheers 🍻

-Chris

Next / Previous Post:

Find more posts by tag:

-~{/* */}~-