Jon Morisi's SQL Blog: Python - pyodbc and Batch Inserts to SQL Server (or pyodbc fast

I recently had a project in which I needed to transfer a 60 GB SQLite database to SQL Server. After some research I found the sqlite3 and pyodbc modules, and set about scripting connections and insert statements.

The basic form of my script is to import the modules, setup the database connections, and iterate (via cursor) over the rows of the select statement creating insert statements and executing them.

The issue here is that this method results in single inserts being sent one at a time yielding less than satisfactory performance. Inserting 35m+ rows in this fashion takes ~5hrs on my system.

After some researching I found the general community suggesting solutions like the following:

executemany
fast_executemany
https://towardsdatascience.com/how-i-made-inserts-into-sql-server-100x-faster-with-pyodbc-5a0b5afdba5
dumping data to csv files and using bulk insert
https://github.com/mkleehammer/pyodbc/issues/619
using pandas and sqlalchemy
https://github.com/mkleehammer/pyodbc/issues/812

In addition, based on my prior DBA experience, my initial thought was to create a string with BEGIN TRANSACTION, concatenate a batch of INSERT statements, end the string with a COMMIT TRANSACTION, and finally pass the batches of transactions as strings to pyodbc.execute.

I did some testing, please find my results below.

For this experiment I used a production dataset consisting of a single SQLite table with 18,050,355 rows. The table has four columns with datatypes of integer, integer, text, and real. The real column was converted to an integer as it was improperly defined in the SQLite database and was validated to be an integer.

1 by 1 inserts (~80 minutes)

The 1 by 1 inserts were executed by performing inserts as such:

select_stmt="SELECT col0, col1,col2,CAST(col3 AS INT) FROM table"
for row in sqlite3Cursor.execute(select_stmt):
insert_stmt = f"INSERT INTO table(col0,col1,col2,col3) VALUES ({row[0]},{row[1]},'{row[2]}',{row[3]})"
pyodbcCursor.execute(insert_stmt)

Using SQL Server Profiler I could see the 1 by 1 inserts happening:

executemany (~95 minutes)

For this test I modified the insert statement into a prepared statement and fed in the insert statement and the select from SQLite as the input parameters for the executemany function:

insert_stmt = f"INSERT INTO table(col0,col1,col2,col3) VALUES (?,?,?,?)"

pyodbcCursor.executemany(insert_stmt,sqlite3Cursor.execute(select_stmt))

Via profiler we can see that we're now running prepared statements:

Unfortunately, this took longer than the 1 by 1 inserts.

fast_executemany (~75 minutes)

For this test I simply added a line of code:

pyodbcCursor.fast_executemany = True

On the profiler side of things this looked exactly like executemany, with repetitive executions of single prepared statements.

This did run quite a bit faster then executemany, however it only ran 5 minutes faster than the 1 by 1 inserts.

SQL Server Transaction (~20 minutes)

For more info on SQL Server Transactions see:
https://docs.microsoft.com/en-us/sql/t-sql/language-elements/begin-transaction-transact-sql?view=sql-server-ver15

For the code modification on this test, I reverted back to a row by row cursor loop which builds the batches of 100 inserts per transaction. It looks like this:

counts = 0
select_stmt="SELECT col0, col1,col2,CAST(col3 AS INT) FROM table"
insert_stmt="BEGIN TRANSACTION \r\n"
for row in sqlite3Cursor.execute(select_stmt):
   counts += 1
   insert_stmt = f"INSERT INTO table(col0,col1,col2,col3) VALUES ({row[0]},{row[1]},'{row[2]}',{row[3]}) \r\n"
#The modulus is the batch size
   if counts % 100 == 0:
       insert_stmt+="COMMIT TRANSACTION"
       pyodbcCursor.execute(insert_stmt)
       pyodbcCursor.commit()
       insert_stmt="BEGIN TRANSACTION \r\n"
#get the last batch
if insert_stmt!="BEGIN TRANSACTION \r\n":
   insert_stmt+="COMMIT TRANSACTION"
  pyodbcCursor.execute(insert_stmt)
  pyodbcCursor.commit()

From profiler we can now see batches happening, 100 inserts at a time!
This also completes in 20 minutes which is 3 to 4 times faster than the next best test.

For what it's worth I was able to get this down to 12 mins with a batch size of 400 rows. Batches larger than ~450 rows started dropping inserts. Troubleshooting resulted in the determination that I was running into this issue:
https://docs.microsoft.com/en-us/troubleshoot/sql/connect/fail-run-large-batch-sql-statements

TL;DR; I had to add SET NOCOUNT ON; in order to address this, however there was not a significant speed improvement with batch sizes larger than 400, for me.

Counter Log Review

I also took a performance counter log during testing. The tests were performed in the order outlined above: 1 by 1, executemany, fast_executemany, and SQL Server Transactions. You can see in the graphs below, 4 very distinct groups of executions. The last test, SQL Server Transactions 100 inserts per batch ended at 1:30. There was another failed test of 1000 transactions per batch between 1:30 and the end of the graphs below.

Red is the SQL Server Transaction log drive (I’m in SIMPLE recovery mode).
Orange is the SQL Server Disk for data files.

Memory looked fine, it did not get overwhelmed.

11 comments:

Jon Morisi5/11/21, 9:32 AM
Hi everyone, blogger stats is showing high interest in this blog post, so I thought I'd add that you can further increase performance through usage of multithreading:
from multiprocessing.dummy import Pool as ThreadPool
Anonymous5/25/21, 1:25 AM
Hi, thanks a lot for posting this!
I've been struggling with inserting data using python into SQL Server and getting horrendous speeds. I've tried all the suggestions you link in the beginning and then some other simple/already written solutions. None has given any clear benefit. This post convinced me that I should follow in your footsteps and try that as the next thing.
Unknown8/31/21, 12:10 PM
Is there an error in "SQL Server Transaction" Code?

I don't see how the new insert_stmt values are being appended.

Should:
insert_stmt = f"INSERT INTO table(col0,col1,col2,col3) VALUES ({row[0]},{row[1]},'{row[2]}',{row[3]}) \r\n"

Be this?:
insert_stmt += f"INSERT INTO table(col0,col1,col2,col3) VALUES ({row[0]},{row[1]},'{row[2]}',{row[3]}) \r\n"

OR:
insert_stmt = f"{insert_stmt}INSERT INTO table(col0,col1,col2,col3) VALUES ({row[0]},{row[1]},'{row[2]}',{row[3]}) \r\n"
Jon Morisi9/1/21, 7:11 AM
good catch, yes it should be insert_stmt += ...
William Alexander López10/12/21, 10:13 AM
Thanks for your post. I am working on migrating data from sybase to sql server and this was very helpful
Anonymous11/18/21, 12:26 PM
thanks!
Anonymous12/17/21, 6:20 AM
One issue you may have is that autocommit wasn't set to false being running the fast_executemany test. Set this to false before you run it and should act as a bulk insert instead.
S Low6/17/22, 3:41 AM
Thank you so much! This has worked wonders. 21million rows copied in 10 minutes.

Jon Morisi's SQL Blog

Pages

5/03/2021

Python - pyodbc and Batch Inserts to SQL Server (or pyodbc fast_executemany, not so fast...)