Top Posts Tagged with #activerecord-import

multi-level #activerecord-import

dont know why pull request comments do not show up readily in google search, but this was very relevant to what i need

•18+ Adults Only

Watch Anya Live on Cam

Anya is live and ready to show you everything. Watch her strip, dance, and perform exclusive shows just for you. Interact in real-time and make your fantasies come true.

✓ Live Streaming✓ Interactive Chat✓ Private Shows✓ HD Quality✓ Free Actions

Free to watch • No registration required • HD streaming

mass updating using #sql #ruby @postgresql

classic article showing raw sql is the fastest

bulk_updateable is a module for mass updating in postgres, obviously before upsert is generally available (coming up in 9.5). but this does not upsert, only updates in bulk.

bulk_insert seems to achieve the same thing as activerecord-import, which i already use

#pg #db #ruby #activerecord #activerecord-import

database query optimization

Completed 200 OK in 1364ms (Views: 658.7ms | ActiveRecord: 519.8ms) Completed 200 OK in 77ms (Views: 57.7ms | ActiveRecord: 4.2ms)

without transaction Completed 302 Found in 13438ms (ActiveRecord: 4055.4ms) Completed 200 OK in 645ms (Views: 603.9ms | ActiveRecord: 39.3ms)

with transaction Completed 302 Found in 11166ms (ActiveRecord: 3050.6ms) Completed 200 OK in 564ms (Views: 393.8ms | ActiveRecord: 78.1ms)

without transaction Completed 302 Found in 12376ms (ActiveRecord: 3641.6ms) Completed 200 OK in 78ms (Views: 69.8ms | ActiveRecord: 6.8ms)

with transaction Completed 302 Found in 10167ms (ActiveRecord: 2133.2ms) Completed 200 OK in 642ms (Views: 458.3ms | ActiveRecord: 70.6ms)

with transaction outside Completed 302 Found in 10215ms (ActiveRecord: 2227.3ms) Completed 200 OK in 558ms (Views: 483.6ms | ActiveRecord: 24.5ms)

with transaction outside Completed 302 Found in 9564ms (ActiveRecord: 2008.1ms)

double transaction Completed 302 Found in 9630ms (ActiveRecord: 1929.7ms)

without Completed 302 Found in 12418ms (ActiveRecord: 3495.7ms)

with multiple create (timing_per_row) Completed 302 Found in 20588ms (ActiveRecord: 3302.2ms)

with multiple create at top level (alltimings) Completed 302 Found in 16489ms (ActiveRecord: 5036.2ms) Completed 200 OK in 63ms (Views: 55.7ms | ActiveRecord: 6.1ms)

Completed 302 Found in 15355ms (ActiveRecord: 4982.4ms) Completed 200 OK in 64ms (Views: 56.8ms | ActiveRecord: 6.0ms)

double transaction

UPDATE : 5/1/2014 - not much diff; this is in development

Completed 302 Found in 9703ms (ActiveRecord: 3558.0ms)

running into server timeout in production

NOTES:

http://stackoverflow.com/questions/2509320/saving-multiple-objects-in-a-single-call-in-rails

http://stackoverflow.com/questions/15784305/batch-insertion-in-rails-3

http://www.postgresql.org/docs/9.2/static/sql-copy.html

http://ruby-journal.com/how-to-import-millions-records-via-activerecord-within-minutes-not-hours/

http://stackoverflow.com/questions/15317837/bulk-insert-records-into-active-record-table

#sql #activerecord #performance #db #activerecord-import #upsert

Speeding up CSV imports with rails

I was recently working on a project that required frequent importing of S3-hosted CSV files containing hundreds of thousands of users. My first pass at the import was fairly standard:

require 'rubygems' require 'fog' require 'csv' start_time = Time.now counter = 0 tenant = Tenant.first connection = Fog::Storage.new({ :provider => 'AWS', :aws_access_key_id => 'xxx', :aws_secret_access_key => 'xxx' }) directory = connection.directories.get("xxx") file = directory.files.get('imports/test-import-medium.txt') body = file.body CSV.parse(body, col_sep: "|", headers: true) do |row| row_hash = row.to_hash user = User.new( first_name: row_hash["FirstName"], last_name: row_hash["LastName"], address: row_hash["Address1"], address2: row_hash["Address2"], city: row_hash["City"], state: row_hash["State"], zip: row_hash["ZipCode"], email: row_hash["Email"], gender: row_hash["Gender"], ) user.set_random_password user.memberships.build(tenant: tenant, status: Membership::STATUSES[:created]) user.save! counter += 1 end end_time = Time.now puts "#{counter} users imported #{((end_time - start_time) / 60).round(2)} minutes (#{( counter / (end_time - start_time)).round(2)} users/second)"

I ran that version against a test file containing 25,000 records. The result was -- well, I'm not sure what the result was because I stopped it over an hour in when it had only imported 5,000 records.

After scratching my head for a minute and doing a little debugging, I discovered that the slowest part of the script was the "user.save!." Sure, connecting to s3 and streaming down the file wasn't lightning-fast, but the real culprit was saving a user. It was extraordinarily slow for some reason.

Then it hit me, the reason I was setting a password via my custom "set_random_password" method was because the User model used has_secure_password. My workflow didn't even require the user to have a password at this point, but has_secure_password requires a password on creation, so I was passing in a dummy password to pass the validation. The generating the password was fast, but encrypting the passwords was where things were slowing down. I was taking a significant hit for something that I didn't even need.

So, my first step was to rip out has_secure_password and write my own encryption/authentication methods. That way, I had control over the validation which meant I no longer needed to pass in a password during this import process. That alone was a huge gain. After that change, I re-ran the import and the result was:

25000 users imported 12.25 minutes (34.01 users/second)

12 minutes was a significant improvement from the first pass, but at that rate, it would still take 3-5 hours to process a file containing 200k rows (assuming the rate slowed with larger files). So, I began looking for big optimization gains.

After some quick googling, I found https://github.com/zdennis/activerecord-import. This looked extremely promising, so I cranked out some code to test it out

start_time = Time.now counter = 0 tenant = Tenant.first users = [] ... CSV.parse(body, col_sep: "|", headers: true) do |row| row_hash = row.to_hash user = User.new( first_name: row_hash["FirstName"], last_name: row_hash["LastName"], address: row_hash["Address1"], address2: row_hash["Address2"], city: row_hash["City"], state: row_hash["State"], zip: row_hash["ZipCode"], email: row_hash["Email"], gender: row_hash["Gender"] ) user.memberships.build(tenant: tenant, status: Membership::STATUSES[:created]) users << user counter += 1 end User.import users end_time = Time.now puts "#{counter} users imported #{((end_time - start_time) / 60).round(2)} minutes (#{( counter / (end_time - start_time)).round(2)} users/second)"

The good news: it was FAST

25000 users imported 1.57 minutes (265.99 users/second)

The bad news, it completely ignored the build() association and didn't save the membership records. This was because activerecord-import can't handle associations. When it inserts records, nothing is returned, so building an associated model just wouldn't work. As frustrating as that was though, I wasn't willing to abandon the activerecord-import path -- 265 records/second was a huge impremovent over 34.

My next pass is where things got a little less attractive, but I was willing to sacrafice some elegance for the sake of speed. I decided to import the users, then loop through the CSV again, get the User.id for each freshly inserted record, manually build a Membership model, then import an array of Memberships. My theory was that doing two loops and a lookup of every user to get their id would still be faster than a non activerecord-import solution.

require 'rubygems' require 'fog' require 'csv' start_time = Time.now counter = 0 tenant = Tenant.first users = [] memberships = [] ... CSV.parse(body, col_sep: "|", headers: true) do |row| row_hash = row.to_hash user = User.new( first_name: row_hash["FirstName"], last_name: row_hash["LastName"], address: row_hash["Address1"], address2: row_hash["Address2"], city: row_hash["City"], state: row_hash["State"], zip: row_hash["ZipCode"], email: row_hash["Email"], gender: row_hash["Gender"] ) users << user counter += 1 end User.import users counter = 0 CSV.parse(body, col_sep: "|", headers: true) do |row| row_hash = row.to_hash user = User.where("email = ?",row_hash["Email"]).first membership = Membership.new( tenant: tenant, status: Membership::STATUSES[:created], user: user ) memberships << membership counter += 1 end Membership.import memberships

I ran it again and the results were:

25000 users imported 2.72 minutes (153.3 users/second)

OK, so not as fast as before, but still plenty fast. Even with 200k records, this would certainly finish within 30-45 minutes. I felt good about the path I was on, but began looking for other improvements. First, I was getting an entire user record in the second loop, when I really only needed the User.id:

user = User.where("email = ?",row_hash["Email"]).first

So, I replaced that line with the following:

user_id = User.where("email = ?",email).select(:id).first.id

I was also making unecessary calls within the loop to get the value of Membership::STATUSES[:created], so I moved that out of the loop and set a "status" variable once. I was also passing in an entire Tenant object, when again, all I needed was the Tenant.id, so I fixed that too.

My final version looked like this

... start_time = Time.now users = [] memberships = [] counter = 0 tenant_id = Tenant.first.id status = Membership::STATUSES[:created] ... CSV.parse(body, col_sep: "|", headers: true) do |row| row_hash = row.to_hash user = User.new( first_name: row_hash["FirstName"], last_name: row_hash["LastName"], address: row_hash["Address1"], address2: row_hash["Address2"], city: row_hash["City"], state: row_hash["State"], zip: row_hash["ZipCode"], email: row_hash["Email"], gender: row_hash["Gender"] ) users << user counter += 1 end User.import users counter = 0 CSV.parse(body, col_sep: "|", headers: true) do |row| row_hash = row.to_hash user_id = User.where("email = ?",row_hash["Email"]).select(:id).first.id membership = Membership.new( tenant_id: tenant_id, status: status, user_id: user_id ) memberships << membership counter += 1 end Membership.import memberships end_time = Time.now puts "#{counter} users imported in #{((end_time - start_time) / 60).round(2)} minutes."

And when I ran it, I got the following

25000 users imported 2.4 minutes (173.71 users/second)

Those last little optimizations had a 20 records per second improvement. From where I started, the import process went from 5-10 hours to under thirty minutes for 200k records. That's a huge improvement and the time savings will yield huge rewards for my client. Not a bad day's work.

#rails #active-record #optimization #CSV #activerecord-import

multi-level #activerecord-import

dont know why pull request comments do not show up readily in google search, but this was very relevant to what i need

#activerecord-import #gem