Missing agents, vulnerable communications channels, secret principal names, and invalid names… a day in the life of an OpsMgr 2007 user.

Have you even been trying to configure an MS Operations Manager 2007 agent on a system and had it report no errors, but still have its status listed as “not monitored” in the OpsMgr console?  Have you wasted countless hours doing packet captures and advanced system debugging for weeks under the incorrect assumption that this was a network communications problem?  Have you ever had your consultant change the primary DNS suffix  on a server that you monitor without telling you, thus creating the whole problem in the first place?  No?  We read on anyway… if it happens to you later, you will know what to do.

A few months ago we had a consultant on site to set up services on two new Windows Server 2003 hosts.  It was another one of those fun “n-tier J2EE” things.  DNS names were requested for “hyperion10.uvm.edu” and “hyperion11.uvm.edu”.  However, since the hyperion hosts were connected to our  “campus.ad.uvm.edu” domain, their “internal” names were appended with the “campus.ad.uvm.edu” suffix.  Thus, these servers thought of themselves as “hyperion1x.campus.ad.uvm.edu”, even though there was no legitimate DNS entry for these names. 

For most services, this is not a problem.  However, we discovered that some hyperion services advertise themselves using this internal computer name, rather than a name chosen by the application administrator.  To work around the issue, we requested manual DNS entries be generated for “hyperion10.campus.ad.uvm.edu” and “hyperion11.campus.ad.uvm.edu”.  Unfortunately, the decision was made to change the hyperion hosts internal computer name DNS suffixes instead of waiting for the new DNS entries.  This solved his problem and did not create any immediate issues, so he moved on.  Months later, this decision would make the OpsMgr admin very unhappy.

Here is what broke… the Hyperion systems now tried to update the “DNS Suffix” attributes of their computer objects in Active Directory.  By default, Server 2003 AD performs “validation” on DNS suffix registrations, and disallows names that are not in the AD forest.  Thus, the DNS suffix change was denied in AD, and a event was logged in the System event log:

Event ID: 5789
Description:  Attempt to update DNS Host Name of the computer object in Active Directory failed.  The updated value was ‘HYPERION10.uvm.edu’.  The following error occured:
The parameter is incorrect.

This is pretty innocuous, and went unnoted.  However, the mismatch of the computer’s perceived FQDN and its registered FQDN in AD completely broke Kerberos authentication on this system.  Because AD did not know of a host called “hyperion10.uvm.edu”, it never generated a Kerberos SPN (Service Principal Name) for this host.  A legitimate SPN is required for Kerberos auth to function.  NTLM authentication still worked, so no one noticed the problem again. 

Two months ago, we installed an Operations Manager 2007 server.  All of our managed servers took their agents without complaint, except for the blasted Hyperion servers.  Since these systems were on the opposite side of a firewall, we naturally blamed the firewall and spent a lot of time performing “Wireshark” packet captures, looking at “netstat” output, and running “procmon” on the management server.  

The breakthrough finally came yesterday when I had a look at the Operations manager event logs on the hyperion servers (which were running the OpsMgr agents).   The following error was found in the log several times:

Source: OpsMgr Connector
Event ID: 21016
Description:  OpsMgr was unable to set up a secure channel to <fqdn of RMS> and there are no failover hosts…

I did some poking at news.microsoft.com in the operations manager groups.  I searched for threads with “agent” and “monitored” (as in the “not monitored” status of the agents in the console).  There I found the suggestion that Kerberos problems can prevent secure communications between OpsMgr agents and the RMS.  There was a suggestion that Kerberos loggin be enabled to rule this out as a problem.  Thus, I added the following reg values to the Hyperion servers:

Key: HKLMSYSTEMCurrentControlSetControlLsaKerberosParameters
Value: REG_DWORD LogLevel
Data: 1

A reboot was necessary to activate logging.  Soon we had the culprit captured in the system log:

Source: Kerberos
Event ID: 3
Description:  A Kerberos Error Message was recieved:
on logon session

Server Name: host/hyperion10.uvm.edu

Ah!  No principal existed for hyperion10.uvm.edu!  And thus, the OpsMgr agent could not create a secure channel with the server using Kerberos, which is the only method implemented in OpsMgr without resorting to certificate-based authentication.

Now that I knew what the problem really was, fixing the problem was easier (although not easy).  The following KB contained info on fixing DNS mismatches between the host and Active Directory:

There, we are instructed to add the required Service Principal Name directly to Active Directory.  This was pretty easy… we just need the Windows Server 2003 Resource Kit Tools, and then we run:

setspn -a host/hyperion10.uvm.edu hyperion10

We also needed to fix the mismatch in DNS suffixes.  The KB above suggests removing the requirement for client computer DNS suffix validation throughout the entire domain.  This sounded like a bad idea to me, so I did some investigating, and found that you can modify the ACL a computer object in Active Directory to allow the “SELF” object to have “Write DNS Host Name Attributes” rights under the “Properties” tab in the AD Users and Computers MMC (also, there is “Write dNSHostName”… probably the same thing).  I added this right, then rebooted the servers.  The Event IDs discussed above all went away!  Start the party!  Pop the cork!  A quick agent re-install and our bloody Hyperion systems are now being monitored.

I am not sure what the moral of the story is… always grant your consultants rights to your DNS server?  Watch your consultants like a hawk 24×7?  Don’t bother with system monitoring as it is a time sink?  Always take a nap under your desk at lunch time?  Feel free to draw your own conclusions…

Exposing VSS Shapshots as a drive letter

Here is about the most useful bit of script magic I have seen for Windows in quite awhile:

The script shown here allows you to create a persistent snapshot of a Windows Server 2003 volume, and then expose it as a drive letter. This opens up all sorts of other scripting possibilities. Most immediately, it allows me synchronize filesystems from a point-in-time copy on demand on in a scheduled task. I need this for replication of Windows deployment points, and for refreshing our pre-prod ApplicationXtender server from the production environment.

This script uses “VSHADOW.exe”, part of the VSS SDK available here:

VSHADOW cannot expose existing “client accessible” shapshots that were generated by standard Volume Shadow Sopy Service scheduled tasks, but you can use this script to schedule your own snaps. VSHADOW can even create a snap, execute arbitrary code (such as “robocopy”), and then delete the snap immediately (using “non-persistent snapshots).