multi-level #activerecord-import
dont know why pull request comments do not show up readily in google search, but this was very relevant to what i need
seen from United States
seen from TΓΌrkiye

seen from United States

seen from United States

seen from United States
seen from Australia

seen from Malaysia
seen from Yemen
seen from United States
seen from United States
seen from Germany
seen from Japan

seen from United States
seen from Morocco
seen from United States
seen from United States

seen from United States
seen from Brazil
seen from United States
seen from United States
multi-level #activerecord-import
dont know why pull request comments do not show up readily in google search, but this was very relevant to what i need

Anya is live and ready to show you everything. Watch her strip, dance, and perform exclusive shows just for you. Interact in real-time and make your fantasies come true.
Free to watch β’ No registration required β’ HD streaming
mass updating using #sql #ruby @postgresql
classic article showing raw sql is the fastest
bulk_updateable is a module for mass updating in postgres, obviously before upsert is generally available (coming up in 9.5). but this does not upsert, only updates in bulk.
bulk_insert seems to achieve the same thing as activerecord-import, which i already use
database query optimization
Completed 200 OK in 1364ms (Views: 658.7ms | ActiveRecord: 519.8ms) Completed 200 OK in 77ms (Views: 57.7ms | ActiveRecord: 4.2ms)
without transaction Completed 302 Found in 13438ms (ActiveRecord: 4055.4ms) Completed 200 OK in 645ms (Views: 603.9ms | ActiveRecord: 39.3ms)
with transaction Completed 302 Found in 11166ms (ActiveRecord: 3050.6ms) Completed 200 OK in 564ms (Views: 393.8ms | ActiveRecord: 78.1ms)
without transaction Completed 302 Found in 12376ms (ActiveRecord: 3641.6ms) Completed 200 OK in 78ms (Views: 69.8ms | ActiveRecord: 6.8ms)
with transaction Completed 302 Found in 10167ms (ActiveRecord: 2133.2ms) Completed 200 OK in 642ms (Views: 458.3ms | ActiveRecord: 70.6ms)
with transaction outside Completed 302 Found in 10215ms (ActiveRecord: 2227.3ms) Completed 200 OK in 558ms (Views: 483.6ms | ActiveRecord: 24.5ms)
with transaction outside Completed 302 Found in 9564ms (ActiveRecord: 2008.1ms)
double transaction Completed 302 Found in 9630ms (ActiveRecord: 1929.7ms)
without Completed 302 Found in 12418ms (ActiveRecord: 3495.7ms)
with multiple create (timing_per_row) Completed 302 Found in 20588ms (ActiveRecord: 3302.2ms)
with multiple create at top level (alltimings) Completed 302 Found in 16489ms (ActiveRecord: 5036.2ms) Completed 200 OK in 63ms (Views: 55.7ms | ActiveRecord: 6.1ms)
Completed 302 Found in 15355ms (ActiveRecord: 4982.4ms) Completed 200 OK in 64ms (Views: 56.8ms | ActiveRecord: 6.0ms)
double transaction
UPDATE : 5/1/2014 - not much diff; this is in development
Completed 302 Found in 9703ms (ActiveRecord: 3558.0ms)
running into server timeout in production
NOTES:
http://stackoverflow.com/questions/2509320/saving-multiple-objects-in-a-single-call-in-rails
http://stackoverflow.com/questions/15784305/batch-insertion-in-rails-3
http://www.postgresql.org/docs/9.2/static/sql-copy.html
http://ruby-journal.com/how-to-import-millions-records-via-activerecord-within-minutes-not-hours/
http://stackoverflow.com/questions/15317837/bulk-insert-records-into-active-record-table
Speeding up CSV imports with rails
I was recently working on a project that required frequent importing of S3-hosted CSV files containing hundreds of thousands of users. My first pass at the import was fairly standard:
require 'rubygems' require 'fog' require 'csv' start_time = Time.now counter = 0 tenant = Tenant.first connection = Fog::Storage.new({ :provider => 'AWS', :aws_access_key_id => 'xxx', :aws_secret_access_key => 'xxx' }) directory = connection.directories.get("xxx") file = directory.files.get('imports/test-import-medium.txt') body = file.body CSV.parse(body, col_sep: "|", headers: true) do |row| row_hash = row.to_hash user = User.new( first_name: row_hash["FirstName"], last_name: row_hash["LastName"], address: row_hash["Address1"], address2: row_hash["Address2"], city: row_hash["City"], state: row_hash["State"], zip: row_hash["ZipCode"], email: row_hash["Email"], gender: row_hash["Gender"], ) user.set_random_password user.memberships.build(tenant: tenant, status: Membership::STATUSES[:created]) user.save! counter += 1 end end_time = Time.now puts "#{counter} users imported #{((end_time - start_time) / 60).round(2)} minutes (#{( counter / (end_time - start_time)).round(2)} users/second)"
I ran that version against a test file containing 25,000 records. The result was -- well, I'm not sure what the result was because I stopped it over an hour in when it had only imported 5,000 records.
After scratching my head for a minute and doing a little debugging, I discovered that the slowest part of the script was the "user.save!." Sure, connecting to s3 and streaming down the file wasn't lightning-fast, but the real culprit was saving a user. It was extraordinarily slow for some reason.
Then it hit me, the reason I was setting a password via my custom "set_random_password" method was because the User model used has_secure_password. My workflow didn't even require the user to have a password at this point, but has_secure_password requires a password on creation, so I was passing in a dummy password to pass the validation. The generating the password was fast, but encrypting the passwords was where things were slowing down. I was taking a significant hit for something that I didn't even need.
So, my first step was to rip out has_secure_password and write my own encryption/authentication methods. That way, I had control over the validation which meant I no longer needed to pass in a password during this import process. That alone was a huge gain. After that change, I re-ran the import and the result was:
25000 users imported 12.25 minutes (34.01 users/second)
12 minutes was a significant improvement from the first pass, but at that rate, it would still take 3-5 hours to process a file containing 200k rows (assuming the rate slowed with larger files). So, I began looking for big optimization gains.
After some quick googling, I found https://github.com/zdennis/activerecord-import. This looked extremely promising, so I cranked out some code to test it out
start_time = Time.now counter = 0 tenant = Tenant.first users = [] ... CSV.parse(body, col_sep: "|", headers: true) do |row| row_hash = row.to_hash user = User.new( first_name: row_hash["FirstName"], last_name: row_hash["LastName"], address: row_hash["Address1"], address2: row_hash["Address2"], city: row_hash["City"], state: row_hash["State"], zip: row_hash["ZipCode"], email: row_hash["Email"], gender: row_hash["Gender"] ) user.memberships.build(tenant: tenant, status: Membership::STATUSES[:created]) users << user counter += 1 end User.import users end_time = Time.now puts "#{counter} users imported #{((end_time - start_time) / 60).round(2)} minutes (#{( counter / (end_time - start_time)).round(2)} users/second)"
The good news: it was FAST
25000 users imported 1.57 minutes (265.99 users/second)
The bad news, it completely ignored the build() association and didn't save the membership records. This was because activerecord-import can't handle associations. When it inserts records, nothing is returned, so building an associated model just wouldn't work. As frustrating as that was though, I wasn't willing to abandon the activerecord-import path -- 265 records/second was a huge impremovent over 34.
My next pass is where things got a little less attractive, but I was willing to sacrafice some elegance for the sake of speed. I decided to import the users, then loop through the CSV again, get the User.id for each freshly inserted record, manually build a Membership model, then import an array of Memberships. My theory was that doing two loops and a lookup of every user to get their id would still be faster than a non activerecord-import solution.
require 'rubygems' require 'fog' require 'csv' start_time = Time.now counter = 0 tenant = Tenant.first users = [] memberships = [] ... CSV.parse(body, col_sep: "|", headers: true) do |row| row_hash = row.to_hash user = User.new( first_name: row_hash["FirstName"], last_name: row_hash["LastName"], address: row_hash["Address1"], address2: row_hash["Address2"], city: row_hash["City"], state: row_hash["State"], zip: row_hash["ZipCode"], email: row_hash["Email"], gender: row_hash["Gender"] ) users << user counter += 1 end User.import users counter = 0 CSV.parse(body, col_sep: "|", headers: true) do |row| row_hash = row.to_hash user = User.where("email = ?",row_hash["Email"]).first membership = Membership.new( tenant: tenant, status: Membership::STATUSES[:created], user: user ) memberships << membership counter += 1 end Membership.import memberships
I ran it again and the results were:
25000 users imported 2.72 minutes (153.3 users/second)
OK, so not as fast as before, but still plenty fast. Even with 200k records, this would certainly finish within 30-45 minutes. I felt good about the path I was on, but began looking for other improvements. First, I was getting an entire user record in the second loop, when I really only needed the User.id:
user = User.where("email = ?",row_hash["Email"]).first
So, I replaced that line with the following:
user_id = User.where("email = ?",email).select(:id).first.id
I was also making unecessary calls within the loop to get the value of Membership::STATUSES[:created], so I moved that out of the loop and set a "status" variable once. I was also passing in an entire Tenant object, when again, all I needed was the Tenant.id, so I fixed that too.
My final version looked like this
... start_time = Time.now users = [] memberships = [] counter = 0 tenant_id = Tenant.first.id status = Membership::STATUSES[:created] ... CSV.parse(body, col_sep: "|", headers: true) do |row| row_hash = row.to_hash user = User.new( first_name: row_hash["FirstName"], last_name: row_hash["LastName"], address: row_hash["Address1"], address2: row_hash["Address2"], city: row_hash["City"], state: row_hash["State"], zip: row_hash["ZipCode"], email: row_hash["Email"], gender: row_hash["Gender"] ) users << user counter += 1 end User.import users counter = 0 CSV.parse(body, col_sep: "|", headers: true) do |row| row_hash = row.to_hash user_id = User.where("email = ?",row_hash["Email"]).select(:id).first.id membership = Membership.new( tenant_id: tenant_id, status: status, user_id: user_id ) memberships << membership counter += 1 end Membership.import memberships end_time = Time.now puts "#{counter} users imported in #{((end_time - start_time) / 60).round(2)} minutes."
And when I ran it, I got the following
25000 users imported 2.4 minutes (173.71 users/second)
Those last little optimizations had a 20 records per second improvement. From where I started, the import process went from 5-10 hours to under thirty minutes for 200k records. That's a huge improvement and the time savings will yield huge rewards for my client. Not a bad day's work.