Author Archives: todd

MDW Data Collection Upload may use up all available ports on a large server

Click here to download the workaround package.

Recently we started having odd issues with a few of our larger clusters.  Windows authentication would randomly fail, accompanied with Netlogon errors in the event log stating the domain was unavailable.  Even odder was when this was happening, if you did a nslookup on the server’s domain you’d only get IPv6 addresses back, even though normally they are accompanied by several IPv4 addresses as well.

We started looking into the obvious things:  DC problems, network issues, DNS resolution issues, firewalls, IP conflicts, network card issues but could find nothing.  This only happened on a couple clusters out of many servers, so if it was a network or DC issue, you’d think you’d see it on other servers.

These particular clusters had some things in common.  They were both 2-node clusters made with similar hardware:  Dell R810’s with 32 cores and hyperthreading enabled.  These also had 4 SQL instances installed and about 20 SAN drives assigned to each cluster.  They were running Windows 2008 R2 and SQL 2008 R2.  They also had MDW data collection enabled on all their instances.

I noticed one day that when we had the problem, netstat would show several thousand outbound connections being made and quickly dropped to our MDW server.  The local port number on those connections would climb to and reach 65534 and then our authentication/dns problems would show themselves.  While all of these connections were in a TIME_WAIT state, the port # is not immediately dropped after a connection is dropped so it can’t be reused immediately.  So, no more outbound connections to the DNS server or DC can be made for authentication, causing our problems.

I tracked the cause down to the upload job for the ServerActivity collection set.  Specifically, the SSIS package that is run to upload data (located at Data Collector\PerfCountersUpload on every SQL 2008 server), has a lookup task under the data flow “DFT – Bulk Insert Collected data into MDW” which grabs the performance counter ID from snapshots.performance_counter_instances to be populated when the data load to snapshots.performance_counter_values happens later in the data flow.  This is set to Partial Cache, which means for each and every performance_counter_value, it seemingly does the following:

1) Logs into the server hosting the MDW database.
2) Does a select from performance_counter_instances for a single performance counter path.
3) Logs out.

On a smaller server, this may not be a problem as there aren’t many counters to get the ID for.  On these particular servers because of the number of instances, drives, processors, processes running on the server, there are a lot of instances of the LogicalDisk, Process, Processor and MSSQL* performance counters.  Therefore, a lot of connections are made and quickly dropped, eating up all the available port numbers.

The workaround I came up with is to set the lookup task “LKU – Lookup into performance_counter_instances to obtain performance_counter_id for all counter paths that get inserted” to Full cache instead of Partial.  This grabs the whole performance_counter_instance table, but in my environment with about 35 instances running collection, it is only 5000 rows.  I’ve also experimented with changing the lookup to a merge join, but the results are similar.

Performing this fix isn’t as simple as editing the package in BIDS and uploading it.  There are a couple constraints that must be dropped before you can replace the package.  Not to mention it has to be done on every server with the Server Activity (or any custom performance counter collector) enabled.  The script below will disable the constraints, creates a backup copy, upload the package, then reenable the constraints.  Naturally, you’ll need to replace the package source location with one that makes sense in your environment.  Also, if you don’t have xp_cmdshell enabled on your server, you can just omit that part and do it manually, or copy it using whatever method you’re used to.

USE [msdb]
GO

IF  EXISTS (SELECT * FROM sys.foreign_keys WHERE object_id = OBJECT_ID(N’[dbo].[FK_syscollector_collector_types_internal_upload_sysssispackages]‘) AND parent_object_id = OBJECT_ID(N’[dbo].[syscollector_collector_types_internal]‘))
ALTER TABLE [dbo].[syscollector_collector_types_internal] DROP CONSTRAINT [FK_syscollector_collector_types_internal_upload_sysssispackages]
GO

declare @path nvarchar(255), @packagename nvarchar(255), @packagefoldername nvarchar(255)
select @packagename = ‘PerfCountersUpload’, @packagefoldername=’Data Collector’
select @path=’\\mdwserver\MDWDeployment\’+@packagename+’.dtsx’
declare @cmd nvarchar(1000)
select @cmd=’dtutil /SQL “‘+@packagefoldername+’\'+@packagename+’” /SourceServer ‘+@@SERVERNAME+’ /DestServer ‘+@@SERVERNAME+’ /Q /COPY SQL;”‘+@packagefoldername+’\'+@packagename+’BAK”‘
exec xp_cmdshell @cmd
select @cmd=’dtutil /FILE “‘+@path+’” /DestServer ‘+@@SERVERNAME+’ /Q /COPY SQL;”‘+@packagefoldername+’\'+@packagename+’”‘

exec xp_cmdshell @cmd
GO

ALTER TABLE [dbo].[syscollector_collector_types_internal]  WITH CHECK ADD  CONSTRAINT [FK_syscollector_collector_types_internal_upload_sysssispackages] FOREIGN KEY([upload_package_folderid], [upload_package_name])
REFERENCES [dbo].[sysssispackages] ([folderid], [name])
GO

ALTER TABLE [dbo].[syscollector_collector_types_internal] CHECK CONSTRAINT [FK_syscollector_collector_types_internal_upload_sysssispackages]
GO

You can download the edited package here.

I posted this on connect as well.

SQL 2008 Management Data Warehouse – Fixing Long-running purges

Update:  This issue was fixed in SQL 2008 R2 SP1.  For more details, check out this KB article.

Click here to download the workaround script.

With the release of SQL Server 2008 came the Management Data Warehouse, a built-in way to collect performance data from SQL, the server it is running on and even your own performance data through the use of custom data collectors.  There are plenty of resources on how to set this up and make your own collections, so I won’t focus too much on that. Instead, this will be about improving performance for the nightly purge job that is created when you setup the MDW.

When you setup your MDW database, a job named mdw_purge_data_[<your MDW db name>] is created.  This job runs the stored procedure core.sp_purge_data which performs the purge in two parts:

  1. Snapshots needing purged get deleted from core.snapshots_internal.  Data in the related tables (snapshots.performance_counter_instance_values, snapshots.query_stats, etc) is then deleted via cascading triggers.
  2. Query plans and text are removed from snapshots.notable_query_text and snapshots.notable_query_plans.  Since these tables do not have a snapshot_id associated with them, they have to be removed based on whether the corresponding sql_handle is missing from snapshots.query_stats or not. (By the way, this was added in SQL 2008 R2 and CU5 for SQL 2008 to correct the problem with orphan plans taking up all the space in the database)

Seems fairly straight-forward, right?  On our system, #1 would run quickly.. usually in just a few minutes on our 200GB MDW database.  #2, however, would take hours.  In fact, if you didn’t set the @run_duration parameter, it would run for days and block incoming query stats in the process.  This was a problem.

The query below accounts for 99% of the runtime:

WHILE (@rows_affected = @delete_batch_size)
BEGIN
DELETE TOP (@delete_batch_size) snapshots.notable_query_plan
FROM snapshots.notable_query_plan AS qp
WHERE NOT EXISTS (
SELECT snapshot_id
FROM snapshots.query_stats AS qs
WHERE qs.[sql_handle] = qp.[sql_handle] AND qs.plan_handle = qp.plan_handle
AND qs.plan_generation_num = qp.plan_generation_num
AND qs.statement_start_offset = qp.statement_start_offset
AND qs.statement_end_offset = qp.statement_end_offset
AND qs.creation_time = qp.creation_time);
SET @rows_affected = @@ROWCOUNT;
IF(@rows_affected > 0)
BEGIN
RAISERROR ('Deleted %d orphaned rows from snapshots.notable_query_plan', 0, 1, @rows_affected) WITH NOWAIT;
END

According to the comment above the query, it was done this way to avoid lock escalation and transaction logs growing due to the deletion of the 10-50MB (each) query plans.  Ok, this is a concern, but it’s not worth joining on the query_stats table that, in my environment, has 43 million rows and uses 15GB of disk space.  Joining on it (especially on 6 different columns) is expensive and not something you want to do over and over again.  That being said, there is a missing index that improves query time somewhat.  Without it, the purge ran for days.  With it, it ran in <12 hours.

CREATE NONCLUSTERED INDEX [Ix_query_stats_sql_handle] ON [snapshots].[query_stats]
([sql_handle] ASC) WITH (PAD_INDEX  = OFF, STATISTICS_NORECOMPUTE  = OFF, SORT_IN_TEMPDB = OFF, IGNORE_DUP_KEY = OFF, DROP_EXISTING = OFF, ONLINE = OFF, ALLOW_ROW_LOCKS  = ON, ALLOW_PAGE_LOCKS  = ON) ON [PRIMARY]

Still, 12 hours for a purge is bad, so I decided to rework the query.  Throwing caution to the wind, I decided to do two things:  first, I’d grab the handles for the plans that I was going to be removing and shove them into a temp table.  Second, I would then delete all the query plans by joining on that temp table rather than query_stats.  This improved things dramatically.  The purge went from running in 12 hours to 11 minutes, purging 2200 rows from notable_query_plan at last run.

Here’s the rewrite of the query above:

select sql_handle,plan_handle, plan_generation_num,statement_start_offset,statement_end_offset,creation_time into #nqp FROM snapshots.notable_query_plan qp
 WHERE NOT EXISTS (
 SELECT *
 FROM snapshots.query_stats AS qs  
 WHERE qs.[sql_handle] = qp.[sql_handle] AND qs.plan_handle = qp.plan_handle
 AND qs.plan_generation_num = qp.plan_generation_num
 AND qs.statement_start_offset = qp.statement_start_offset
 AND qs.statement_end_offset = qp.statement_end_offset
 AND qs.creation_time = qp.creation_time)
 SET @rows_affected = @delete_batch_size;
 WHILE (@rows_affected>0)
 BEGIN
 DELETE TOP (@delete_batch_size) FROM snapshots.notable_query_plan from #nqp n where n.sql_handle=notable_query_plan.sql_handle and notable_query_plan.plan_handle = n.plan_handle and notable_query_plan.plan_generation_num=n.plan_generation_num
 and notable_query_plan.statement_end_offset=n.statement_end_offset and notable_query_plan.statement_start_offset=n.statement_start_offset and notable_query_plan.creation_time=n.creation_time
 SET @rows_affected = @@ROWCOUNT;
 IF(@rows_affected > 0)
 BEGIN
 select @errormsg=CAST(getdate() as nvarchar)+' Deleted %d orphaned rows from snapshots.notable_query_plan'
 RAISERROR (@errormsg, 0, 1, @rows_affected) WITH NOWAIT;
 END
 END
drop table #nqp

The cost on the transaction log and tempdb really wasn’t that great.  The log grew to about 6.5GB and tempdb to about 200mb.  Not bad for 2200 rows.

Here is the script to modify your sp_purge_data procedure to include the fix.

http://www.toddbaker.org/blog/wp-content/uploads/2010/12/sp_purge_data.sql

Hello!

After a long wait I’ve finally decided to make a blog.  Hopefully this will be a source of info for SQL Server professionals out there (including myself) and will be a small addition to the seemingly never-ending source of knowledge coming from the SQL Server community.

A little about me:  I’ve been a production SQL Server DBA since 2004, and in IT since college.  I’ve always been interested in computers and technology and always try to push the envelope for new and better ways to do things.  Outside of work I enjoy golf, riding my bike and have spent far too many hours with a Playstation controller in my hands.

Ok, so enough of the boring introductions.. on to some real stuff!