Recently we started having odd issues with a few of our larger clusters. Windows authentication would randomly fail, accompanied with Netlogon errors in the event log stating the domain was unavailable. Even odder was when this was happening, if you did a nslookup on the server’s domain you’d only get IPv6 addresses back, even though normally they are accompanied by several IPv4 addresses as well.
We started looking into the obvious things: DC problems, network issues, DNS resolution issues, firewalls, IP conflicts, network card issues but could find nothing. This only happened on a couple clusters out of many servers, so if it was a network or DC issue, you’d think you’d see it on other servers.
These particular clusters had some things in common. They were both 2-node clusters made with similar hardware: Dell R810’s with 32 cores and hyperthreading enabled. These also had 4 SQL instances installed and about 20 SAN drives assigned to each cluster. They were running Windows 2008 R2 and SQL 2008 R2. They also had MDW data collection enabled on all their instances.
I noticed one day that when we had the problem, netstat would show several thousand outbound connections being made and quickly dropped to our MDW server. The local port number on those connections would climb to and reach 65534 and then our authentication/dns problems would show themselves. While all of these connections were in a TIME_WAIT state, the port # is not immediately dropped after a connection is dropped so it can’t be reused immediately. So, no more outbound connections to the DNS server or DC can be made for authentication, causing our problems.
I tracked the cause down to the upload job for the ServerActivity collection set. Specifically, the SSIS package that is run to upload data (located at Data Collector\PerfCountersUpload on every SQL 2008 server), has a lookup task under the data flow “DFT – Bulk Insert Collected data into MDW” which grabs the performance counter ID from snapshots.performance_counter_instances to be populated when the data load to snapshots.performance_counter_values happens later in the data flow. This is set to Partial Cache, which means for each and every performance_counter_value, it seemingly does the following:
1) Logs into the server hosting the MDW database.
2) Does a select from performance_counter_instances for a single performance counter path.
3) Logs out.
On a smaller server, this may not be a problem as there aren’t many counters to get the ID for. On these particular servers because of the number of instances, drives, processors, processes running on the server, there are a lot of instances of the LogicalDisk, Process, Processor and MSSQL* performance counters. Therefore, a lot of connections are made and quickly dropped, eating up all the available port numbers.
The workaround I came up with is to set the lookup task “LKU – Lookup into performance_counter_instances to obtain performance_counter_id for all counter paths that get inserted” to Full cache instead of Partial. This grabs the whole performance_counter_instance table, but in my environment with about 35 instances running collection, it is only 5000 rows. I’ve also experimented with changing the lookup to a merge join, but the results are similar.
Performing this fix isn’t as simple as editing the package in BIDS and uploading it. There are a couple constraints that must be dropped before you can replace the package. Not to mention it has to be done on every server with the Server Activity (or any custom performance counter collector) enabled. The script below will disable the constraints, creates a backup copy, upload the package, then reenable the constraints. Naturally, you’ll need to replace the package source location with one that makes sense in your environment. Also, if you don’t have xp_cmdshell enabled on your server, you can just omit that part and do it manually, or copy it using whatever method you’re used to.
IF EXISTS (SELECT * FROM sys.foreign_keys WHERE object_id = OBJECT_ID(N’[dbo].[FK_syscollector_collector_types_internal_upload_sysssispackages]‘) AND parent_object_id = OBJECT_ID(N’[dbo].[syscollector_collector_types_internal]‘))
ALTER TABLE [dbo].[syscollector_collector_types_internal] DROP CONSTRAINT [FK_syscollector_collector_types_internal_upload_sysssispackages]
declare @path nvarchar(255), @packagename nvarchar(255), @packagefoldername nvarchar(255)
select @packagename = ‘PerfCountersUpload’, @packagefoldername=’Data Collector’
declare @cmd nvarchar(1000)
select @cmd=’dtutil /SQL “‘+@packagefoldername+’\'+@packagename+’” /SourceServer ‘+@@SERVERNAME+’ /DestServer ‘+@@SERVERNAME+’ /Q /COPY SQL;”‘+@packagefoldername+’\'+@packagename+’BAK”‘
exec xp_cmdshell @cmd
select @cmd=’dtutil /FILE “‘+@path+’” /DestServer ‘+@@SERVERNAME+’ /Q /COPY SQL;”‘+@packagefoldername+’\'+@packagename+’”‘
exec xp_cmdshell @cmd
ALTER TABLE [dbo].[syscollector_collector_types_internal] WITH CHECK ADD CONSTRAINT [FK_syscollector_collector_types_internal_upload_sysssispackages] FOREIGN KEY([upload_package_folderid], [upload_package_name])
REFERENCES [dbo].[sysssispackages] ([folderid], [name])
ALTER TABLE [dbo].[syscollector_collector_types_internal] CHECK CONSTRAINT [FK_syscollector_collector_types_internal_upload_sysssispackages]
You can download the edited package here.
I posted this on connect as well.