While working on the hackathon project behind my blog post "Analyzing Park and Rail (P+Rail) Parking Availability Across The Entirety of Switzerland", I had to migrate a PostgreSQL table I had stored on my local machine to my production server. (The table was populated locally from a batch job since my processor is way faster on my laptop than the small single CPU droplet I use for the production site.
Not really thinking ahead, I naïvely generated a
.csv with the following command:
COPY "ParkingAvailabilities" TO '/Users/chris/downloads/ParkingAvailabilities.csv' DELIMITER ',' CSV HEADER;
I then sent my
.csv file to the droplet using the
scp -r /Users/chris/Downloads/ParkingAvailabilities.csv root@<<DROPLET IP HERE>>:/root
Logging into the droplet, I first moved the
.csv file from the root to the
tmp folder, since the
postgres user won't be able to access a file in the
mv ~/ParkingAvailabilities.csv /tmp/
I then attempted to import the
.csv directly into my production table:
COPY "ParkingAvailabilities" FROM '/tmp/ParkingAvailabilities.csv' WITH (FORMAT csv, HEADER true);
The command successfully dispatched, but I noticed a strange problem. I kept monitoring the progress of the import by checking the
COUNT of the table I was importing to:
SELECT COUNT(*) FROM "ParkingAvailabilities";
I saw that something like 4000 rows were rapidly inserted, but the progress began to slow, showing things like 5000, then 5100, and slower and slower until the insert progress seemed to nearly hang. I had never seen this before, so I did a bit of research.
It turns out what PostgreSQL was examining each primary key for each new CSV row that it tried to import. Essentially in this case, for each new row it tries to import, it will check the primary key against all 3 million rows in the csv. This of course gets incrementally slower and slower for each row that is inserted!
For a development-to-production migration scenario, this doesn't make sense anyway: it's best to let PostgreSQL manage it's own primary key creation.
So, to get the fastest COPY possible, I needed to remove that primary key from my CSV data.
The Solution: Remove the Primary Key Column from the CSV File!
Luckily, I was smart enough to realize didn't have to re-export the data and move it to the production server with SCP again. I instead used the handy Bash command
cut on the existing
.csv file I already had to strip off the primary key column, which I my case, was the first column of the csv:
cut -d, -f 1 —complements ParkingAvailabilities.csv > ParkingAvailabilitiesNoID.csv
Here, I used the
-d flag for delete, the
-f flag which is for field, and I specify field
1, which for me was the
Id column, my primary key. I then write the file to a new csv file,
I then unfortunately had to modify the
COPY command as well, explicitly specifying the columns I wanted to import from the
csv file, since PostgreSQL will complain when it sees a mismatch between the table and the number of columns in the
.csv file. The
COPY command with these columns specified looked like this:
COPY "ParkingAvailabilities" ("DateCalculatedFor", "DayOfWeek", "FacilityName","EstimatedParkingSpacesAvailable", "ParkingSpacesTotal", "ParkingForecastData", "Created", "Updated") FROM '/tmp/ParkingAvailabilitiesNoID.csv' WITH (FORMAT csv, HEADER true);
Id column removed, this command successfully completed in a matter of seconds - no hangs or memory explosions since PostgreSQL didn't have to make those computationally heavy primary key checks anymore!
That's it for this quick post! I hope you'll be able to use this pattern of commands when you also need to migrate data to your production server!