Python - pyodbc and Batch Inserts to SQL Server (or pyodbc fast_executemany, not so fast...)


I recently had a project in which I needed to transfer a 60 GB SQLite database to SQL Server.  After some research I found the sqlite3 and pyodbc modules, and set about scripting connections and insert statements.  

The basic form of my script is to import the modules, setup the database connections, and iterate (via cursor) over the rows of the select statement creating insert statements and executing them.  

The issue here is that this method results in single inserts being sent one at a time yielding less than satisfactory performance.  Inserting 35m+ rows in this fashion takes ~5hrs on my system.

After some researching I found the general community suggesting solutions like the following:

In addition, based on my prior DBA experience, my initial thought was to create a string with BEGIN TRANSACTION, concatenate a batch of INSERT statements, end the string with a COMMIT TRANSACTION, and finally pass the batches of transactions as strings to pyodbc.execute.

I did some testing, please find my results below.

For this experiment I used a production dataset consisting of a single SQLite table with 18,050,355 rows.  The table has four columns with datatypes of integer, integer, text, and real.  The real column was converted to an integer as it was improperly defined in the SQLite database and was validated to be an integer.

1 by 1 inserts (~80 minutes)

The 1 by 1 inserts were executed by performing inserts as such:

select_stmt="SELECT col0, col1,col2,CAST(col3 AS INT) FROM table"
for row in sqlite3Cursor.execute(select_stmt):
    insert_stmt = f"INSERT INTO table(col0,col1,col2,col3) VALUES ({row[0]},{row[1]},'{row[2]}',{row[3]})"

Using SQL Server Profiler I could see the 1 by 1 inserts happening:

executemany (~95 minutes)

For this test I modified the insert statement into a prepared statement and fed in the insert statement and the select from SQLite as the input parameters for the executemany function:

insert_stmt = f"INSERT INTO table(col0,col1,col2,col3) VALUES (?,?,?,?)"

Via profiler we can see that we're now running prepared statements:

Unfortunately, this took longer than the 1 by 1 inserts.

fast_executemany (~75 minutes)

For this test I simply added a line of code:

pyodbcCursor.fast_executemany = True

On the profiler side of things this looked exactly like executemany, with repetitive executions of single prepared statements.

This did run quite a bit faster then executemany, however it only ran 5 minutes faster than the 1 by 1 inserts.

SQL Server Transaction (~20 minutes)

For the code modification on this test, I reverted back to a row by row cursor loop which builds the batches of 100 inserts per transaction.  It looks like this:

counts = 0
select_stmt="SELECT col0, col1,col2,CAST(col3 AS INT) FROM table"
insert_stmt="BEGIN TRANSACTION \r\n"
for row in sqlite3Cursor.execute(select_stmt):
    counts += 1
    insert_stmt = f"INSERT INTO table(col0,col1,col2,col3) VALUES ({row[0]},{row[1]},'{row[2]}',{row[3]}) \r\n"
    #The modulus is the batch size
    if counts % 100 == 0:    
        insert_stmt+="COMMIT TRANSACTION"
        insert_stmt="BEGIN TRANSACTION \r\n"
#get the last batch
if insert_stmt!="BEGIN TRANSACTION \r\n":
    insert_stmt+="COMMIT TRANSACTION"

From profiler we can now see batches happening, 100 inserts at a time!
This also completes in 20 minutes which is 3 to 4 times faster than the next best test.

For what it's worth I was able to get this down to 12 mins with a batch size of 400 rows.  Batches larger than ~450 rows started dropping inserts.  Troubleshooting resulted in the determination that I was running into this issue:

TL;DR; I had to add SET NOCOUNT ON; in order to address this, however there was not a significant speed improvement with batch sizes larger than 400, for me.

Counter Log Review

I also took a performance counter log during testing.  The tests were performed in the order outlined above: 1 by 1, executemany, fast_executemany, and SQL Server Transactions.  You can see in the graphs below, 4 very distinct groups of executions.  The last test, SQL Server Transactions 100 inserts per batch ended at 1:30.  There was another failed test of 1000 transactions per batch between 1:30 and the end of the graphs below.

Red is the SQL Server Transaction log drive (I’m in SIMPLE recovery mode).
Orange is the SQL Server Disk for data files.

Memory looked fine, it did not get overwhelmed.