Update: This issue was fixed in SQL 2008 R2 SP1. For more details, check out this KB article.Click here to download the workaround script.
With the release of SQL Server 2008 came the Management Data Warehouse, a built-in way to collect performance data from SQL, the server it is running on and even your own performance data through the use of custom data collectors. There are plenty of resources on how to set this up and make your own collections, so I won’t focus too much on that. Instead, this will be about improving performance for the nightly purge job that is created when you setup the MDW.
When you setup your MDW database, a job named mdw_purge_data_[<your MDW db name>] is created. This job runs the stored procedure core.sp_purge_data which performs the purge in two parts:
- Snapshots needing purged get deleted from core.snapshots_internal. Data in the related tables (snapshots.performance_counter_instance_values, snapshots.query_stats, etc) is then deleted via cascading triggers.
- Query plans and text are removed from snapshots.notable_query_text and snapshots.notable_query_plans. Since these tables do not have a snapshot_id associated with them, they have to be removed based on whether the corresponding sql_handle is missing from snapshots.query_stats or not. (By the way, this was added in SQL 2008 R2 and CU5 for SQL 2008 to correct the problem with orphan plans taking up all the space in the database)
Seems fairly straight-forward, right? On our system, #1 would run quickly.. usually in just a few minutes on our 200GB MDW database. #2, however, would take hours. In fact, if you didn’t set the @run_duration parameter, it would run for days and block incoming query stats in the process. This was a problem.
The query below accounts for 99% of the runtime:
WHILE (@rows_affected = @delete_batch_size) BEGIN DELETE TOP (@delete_batch_size) snapshots.notable_query_plan FROM snapshots.notable_query_plan AS qp WHERE NOT EXISTS ( SELECT snapshot_id FROM snapshots.query_stats AS qs WHERE qs.[sql_handle] = qp.[sql_handle] AND qs.plan_handle = qp.plan_handle AND qs.plan_generation_num = qp.plan_generation_num AND qs.statement_start_offset = qp.statement_start_offset AND qs.statement_end_offset = qp.statement_end_offset AND qs.creation_time = qp.creation_time); SET @rows_affected = @@ROWCOUNT; IF(@rows_affected > 0) BEGIN RAISERROR ('Deleted %d orphaned rows from snapshots.notable_query_plan', 0, 1, @rows_affected) WITH NOWAIT; END
According to the comment above the query, it was done this way to avoid lock escalation and transaction logs growing due to the deletion of the 10-50MB (each) query plans. Ok, this is a concern, but it’s not worth joining on the query_stats table that, in my environment, has 43 million rows and uses 15GB of disk space. Joining on it (especially on 6 different columns) is expensive and not something you want to do over and over again. That being said, there is a missing index that improves query time somewhat. Without it, the purge ran for days. With it, it ran in <12 hours.
CREATE NONCLUSTERED INDEX [Ix_query_stats_sql_handle] ON [snapshots].[query_stats] ([sql_handle] ASC) WITH (PAD_INDEX = OFF, STATISTICS_NORECOMPUTE = OFF, SORT_IN_TEMPDB = OFF, IGNORE_DUP_KEY = OFF, DROP_EXISTING = OFF, ONLINE = OFF, ALLOW_ROW_LOCKS = ON, ALLOW_PAGE_LOCKS = ON) ON [PRIMARY]
Still, 12 hours for a purge is bad, so I decided to rework the query. Throwing caution to the wind, I decided to do two things: first, I’d grab the handles for the plans that I was going to be removing and shove them into a temp table. Second, I would then delete all the query plans by joining on that temp table rather than query_stats. This improved things dramatically. The purge went from running in 12 hours to 11 minutes, purging 2200 rows from notable_query_plan at last run.
Here’s the rewrite of the query above:
select sql_handle,plan_handle, plan_generation_num,statement_start_offset,statement_end_offset,creation_time into #nqp FROM snapshots.notable_query_plan qp WHERE NOT EXISTS ( SELECT * FROM snapshots.query_stats AS qs WHERE qs.[sql_handle] = qp.[sql_handle] AND qs.plan_handle = qp.plan_handle AND qs.plan_generation_num = qp.plan_generation_num AND qs.statement_start_offset = qp.statement_start_offset AND qs.statement_end_offset = qp.statement_end_offset AND qs.creation_time = qp.creation_time) SET @rows_affected = @delete_batch_size; WHILE (@rows_affected>0) BEGIN DELETE TOP (@delete_batch_size) FROM snapshots.notable_query_plan from #nqp n where n.sql_handle=notable_query_plan.sql_handle and notable_query_plan.plan_handle = n.plan_handle and notable_query_plan.plan_generation_num=n.plan_generation_num and notable_query_plan.statement_end_offset=n.statement_end_offset and notable_query_plan.statement_start_offset=n.statement_start_offset and notable_query_plan.creation_time=n.creation_time SET @rows_affected = @@ROWCOUNT; IF(@rows_affected > 0) BEGIN select @errormsg=CAST(getdate() as nvarchar)+' Deleted %d orphaned rows from snapshots.notable_query_plan' RAISERROR (@errormsg, 0, 1, @rows_affected) WITH NOWAIT; END END drop table #nqp
The cost on the transaction log and tempdb really wasn’t that great. The log grew to about 6.5GB and tempdb to about 200mb. Not bad for 2200 rows.
Here is the script to modify your sp_purge_data procedure to include the fix.