5/03/2021

Python - pyodbc and Batch Inserts to SQL Server (or pyodbc fast_executemany, not so fast...)

 

I recently had a project in which I needed to transfer a 60 GB SQLite database to SQL Server.  After some research I found the sqlite3 and pyodbc modules, and set about scripting connections and insert statements.  

The basic form of my script is to import the modules, setup the database connections, and iterate (via cursor) over the rows of the select statement creating insert statements and executing them.  

The issue here is that this method results in single inserts being sent one at a time yielding less than satisfactory performance.  Inserting 35m+ rows in this fashion takes ~5hrs on my system.

After some researching I found the general community suggesting solutions like the following:

In addition, based on my prior DBA experience, my initial thought was to create a string with BEGIN TRANSACTION, concatenate a batch of INSERT statements, end the string with a COMMIT TRANSACTION, and finally pass the batches of transactions as strings to pyodbc.execute.

I did some testing, please find my results below.

For this experiment I used a production dataset consisting of a single SQLite table with 18,050,355 rows.  The table has four columns with datatypes of integer, integer, text, and real.  The real column was converted to an integer as it was improperly defined in the SQLite database and was validated to be an integer.

1 by 1 inserts (~80 minutes)

The 1 by 1 inserts were executed by performing inserts as such:

select_stmt="SELECT col0, col1,col2,CAST(col3 AS INT) FROM table"
for row in sqlite3Cursor.execute(select_stmt):
    insert_stmt = f"INSERT INTO table(col0,col1,col2,col3) VALUES ({row[0]},{row[1]},'{row[2]}',{row[3]})"
    pyodbcCursor.execute(insert_stmt)

Using SQL Server Profiler I could see the 1 by 1 inserts happening:

executemany (~95 minutes)

For this test I modified the insert statement into a prepared statement and fed in the insert statement and the select from SQLite as the input parameters for the executemany function:

insert_stmt = f"INSERT INTO table(col0,col1,col2,col3) VALUES (?,?,?,?)"
pyodbcCursor.executemany(insert_stmt,sqlite3Cursor.execute(select_stmt))

Via profiler we can see that we're now running prepared statements:

Unfortunately, this took longer than the 1 by 1 inserts.

fast_executemany (~75 minutes)

For this test I simply added a line of code:

pyodbcCursor.fast_executemany = True

On the profiler side of things this looked exactly like executemany, with repetitive executions of single prepared statements.

This did run quite a bit faster then executemany, however it only ran 5 minutes faster than the 1 by 1 inserts.

SQL Server Transaction (~20 minutes)

For the code modification on this test, I reverted back to a row by row cursor loop which builds the batches of 100 inserts per transaction.  It looks like this:

counts = 0
select_stmt="SELECT col0, col1,col2,CAST(col3 AS INT) FROM table"
insert_stmt="BEGIN TRANSACTION \r\n"
for row in sqlite3Cursor.execute(select_stmt):
    counts += 1
    insert_stmt = f"INSERT INTO table(col0,col1,col2,col3) VALUES ({row[0]},{row[1]},'{row[2]}',{row[3]}) \r\n"
    #The modulus is the batch size
    if counts % 100 == 0:    
        insert_stmt+="COMMIT TRANSACTION"
        pyodbcCursor.execute(insert_stmt)
        pyodbcCursor.commit()
        insert_stmt="BEGIN TRANSACTION \r\n"
#get the last batch
if insert_stmt!="BEGIN TRANSACTION \r\n":
    insert_stmt+="COMMIT TRANSACTION"
    pyodbcCursor.execute(insert_stmt)
    pyodbcCursor.commit()

From profiler we can now see batches happening, 100 inserts at a time!
This also completes in 20 minutes which is 3 to 4 times faster than the next best test.

For what it's worth I was able to get this down to 12 mins with a batch size of 400 rows.  Batches larger than ~450 rows started dropping inserts.  Troubleshooting resulted in the determination that I was running into this issue:
https://docs.microsoft.com/en-us/troubleshoot/sql/connect/fail-run-large-batch-sql-statements

TL;DR; I had to add SET NOCOUNT ON; in order to address this, however there was not a significant speed improvement with batch sizes larger than 400, for me.

Counter Log Review

I also took a performance counter log during testing.  The tests were performed in the order outlined above: 1 by 1, executemany, fast_executemany, and SQL Server Transactions.  You can see in the graphs below, 4 very distinct groups of executions.  The last test, SQL Server Transactions 100 inserts per batch ended at 1:30.  There was another failed test of 1000 transactions per batch between 1:30 and the end of the graphs below.

Red is the SQL Server Transaction log drive (I’m in SIMPLE recovery mode).
Orange is the SQL Server Disk for data files.

Memory looked fine, it did not get overwhelmed.




11 comments:

  1. Hi everyone, blogger stats is showing high interest in this blog post, so I thought I'd add that you can further increase performance through usage of multithreading:
    from multiprocessing.dummy import Pool as ThreadPool

    ReplyDelete
    Replies
    1. How would you go by doing this? With the example above, would you have for example 5 workers and run each batch?

      Delete
    2. I used:
      from multiprocessing.dummy import Pool as ThreadPool

      I then created a function to run the batched insert statements
      Last, from '__main__' I called the function:
      pool = ThreadPool(2)
      results = pool.map([function], [variable])
      pool.close()
      pool.join()

      In my case the variable was a list of table names and I sent in a list of about a dozen.

      HTH

      Delete
    3. i can insert 42.000 rows in 2.5 minutos, the problem was if any row is not correct, all trasaccion is cancel. :(. my solve was . this rows send one by one. tnks for the code

      Delete
  2. Hi, thanks a lot for posting this!
    I've been struggling with inserting data using python into SQL Server and getting horrendous speeds. I've tried all the suggestions you link in the beginning and then some other simple/already written solutions. None has given any clear benefit. This post convinced me that I should follow in your footsteps and try that as the next thing.

    ReplyDelete
  3. Is there an error in "SQL Server Transaction" Code?

    I don't see how the new insert_stmt values are being appended.

    Should:
    insert_stmt = f"INSERT INTO table(col0,col1,col2,col3) VALUES ({row[0]},{row[1]},'{row[2]}',{row[3]}) \r\n"

    Be this?:
    insert_stmt += f"INSERT INTO table(col0,col1,col2,col3) VALUES ({row[0]},{row[1]},'{row[2]}',{row[3]}) \r\n"

    OR:
    insert_stmt = f"{insert_stmt}INSERT INTO table(col0,col1,col2,col3) VALUES ({row[0]},{row[1]},'{row[2]}',{row[3]}) \r\n"

    ReplyDelete
  4. good catch, yes it should be insert_stmt += ...

    ReplyDelete
  5. Thanks for your post. I am working on migrating data from sybase to sql server and this was very helpful

    ReplyDelete
  6. One issue you may have is that autocommit wasn't set to false being running the fast_executemany test. Set this to false before you run it and should act as a bulk insert instead.

    ReplyDelete
  7. Thank you so much! This has worked wonders. 21million rows copied in 10 minutes.

    ReplyDelete