Recently I had a situation where one of the SQL Servers we are supporting had over 1 TB of tempdb log and after a short digging around, one of my colleagues noticed that this is caused by a long running purge in SQL Server MDW (Management Data Warehouse). The purge job (mdw_purge_data) was running for more than 15 days until we noticed the problem. It was quite strange to me and I started troubleshooting it. The purge job should be quite quick and as far as my expperience goes, it runs in several minutes tops. But this was not the case, though. So I looked first at what exactly the job does – it is a stored procedure in the MDW database – core.sp_purge_data.
Looking at the definition, the procedure was doing a lot of things and starting from the beginning would take me quite a lot of time and effort to troubleshoot. Luckily, the procedure was still running and I was able to trace where exactly the purging got stuck. Ultimately it was one of the stored procedures called from within the purge job – [core].[sp_purge_orphaned_notable_query_text]. Going further, this procedure’s definition is quite simple – it is deleting by default records from mdw database on batches of 500 records, using the following statement:
DELETE TOP (@delete_batch_size) snapshots.notable_query_text FROM snapshots.notable_query_text AS qt, #tmp_notable_query_text AS tmp WHERE tmp.[sql_handle] = qt.[sql_handle]
OK, now I got it – the #tmp_notable_query_text should be somehow involved. I checked how this table is populated:
SELECT qt.[sql_handle] INTO #tmp_notable_query_text FROM snapshots.notable_query_text AS qt WHERE NOT EXISTS ( SELECT snapshot_id FROM snapshots.query_stats AS qs WHERE qs.[sql_handle] = qt.[sql_handle])
Obviously this statement was causing me the headaches – I tried to execute the select statement to see how many records would it return. I have to say, that my impression so far was that everything around MDW – collection, display and purging is optimized and I would not be suffering from not optimized stored procedures/code. Well, it is not the case it seems – my select statement was not completing in more than 2 minutes, so I immediately canceled it and looked at the estimated execution plan. What worried me there is that there were two clustered index scans and the one on core.query_stats table was with relative cost of 99%. Looking at the table snapshot.query_stats, the records there were more than a million (the purge was not running for some days now) and the scan was not effective at all, especially when both of the default created indexes were clustered and included almost all of the columns in the tables. Based on the query I have, I went with creating non-clustered index on sql_handle column in both tables – snapshots.query_stats and snapshots.notable_query_text.
After doing that, I ran the select statement again and the result was returned immediately. So I went and ran the purge procedure right away and waited for some minutes to complete. The results you can see for yourselves – last successful runs took almost 4 days each and the last one, which we cancelled was running for 17 days. After the optimization – the purge ran for 32 minutes and have been running since then for not more than minute.
So be careful with MDW and monitor the purge job :)
Edit 27.03.2013 – one of my colleagues just implemented the same NC index on another server and the purge job ran from 1 day and 3 hours to 36 seconds!! :-)