Category Archives: Virtual Servers

Server 2012 boot hang in vSphere/ESXi

Over the past year or so we have been having some problems with Server 2012 and 2012 R2 virtual machines hanging during reboot operations. The systems hang at the “spash screen”, showing the Windows logo and the ring of spinning dots… forever!

Finally I was able to find an fix for this problem here:
http://kb.vmware.com/kb/2092807

From a tip-off here:
https://social.technet.microsoft.com/Forums/windowsserver/en-US/595c3048-4d70-48ad-a78e-9380df1bbd70/windows-2012-r2-sometimes-hangs-at-splash-screen-after-reboot?forum=winserver8gen

The problem? Well, probably it is best that you just read the TechNet social thread, if you really want to know. It is none too exciting, and all very aggravating. The fix? Run a PowerShell script, then vMotion your machines to force ESXi to re-read the VMX file for your guests.

I am posting my variation on the script in the KB here, because VMware’s script is incomplete, and difficult to read.

#Source: http://kb.vmware.com/selfservice/microsites/search.do?language=en_US&cmd=displayKC&externalId=2092807
#Script intended to correct the bug where Windows Server 2012+ systems on VM Hardware v10 will hang at the 
#boot-up splash screen.  Problem is caused by the failure of ESX to clear the "TS" counter on system reboot.

Set-PSDebug -Strict

#Initialize the VIToolkit:
if ( (Get-PSSnapin -Name VMware.VimAutomation.Core -ErrorAction SilentlyContinue) -eq $null ) {
    Add-PsSnapin VMware.VimAutomation.Core
}
[Reflection.Assembly]::LoadWithPartialName("VMware.Vim")

#Connect to your virtual center:
$viServ = "myVCenter.domain.com"
Connect-VIServer -Server $viServ

#Get all VMs in the vCenter:
$vms = Get-VM 
#Loop though the VMs:
ForEach ($vm in $vms){
	#Get a "View" object for each VM.  Views expose useful data that is not contined in the VM object:
	$vmv = Get-VM $vm | Get-View
	$name = $vmv.Name
	$guestid = $vmv.Summary.Config.GuestId
	
	if ($guestid -like "windows8*Guest") {
		#windows8*Guest will match Windows 8 client 32-bit and 64-bit, as well as 
		# "windows8Server64Guest" (which is Windows Server 2012 and 2012 R2).
		
		#We need to update the VMX file for the VM, which is dong using a VirtualMachineConfigSpec.
		$vmx = New-Object VMware.Vim.VirtualMachineConfigSpec
		$vmx.extraConfig += New-Object VMware.Vim.OptionValue
		$vmx.extraConfig[0].key = "monitor_control.enable_softResetClearTSC"
		$vmx.extraConfig[0].value = "TRUE"
		($vmv).ReconfigVM_Task($vmx)
		write-host "Edited" $vmv.name  
		$vmv.name | out-file -FilePath c:localtempsoftTscOut.txt -NoClobber -Append
	}
}
Advertisements

VMware View – Provisioning/Composing hangs, Event log failures, and more!

VMware Horizon View… great product. View Composer? Thorn in my side.

Two weeks back I completed the upgrade of our View infrastructure from 5.3.2 to 6.0.1. It was a smooth upgrade, seemingly, and I was pretty pleased with how little time it took to complete the job. Victory for our team? Not so much.

Over the next week, I had dozens of complaints from IT staff that recompose operations were failing, searches for events related to these failures were returning no results (or just not completing at all), and there were multiple odd “I am getting this weird error on my desktops!” complaints.  The desktop errors all turned out to be unrelated to the upgrade (the template was out of disk space, so the user profile could not load, the View Agent installation was broken, etc. etc.), but sorting out the event log and composer problems were harder…

View 6 Event Log database bug:

Following the upgrade, I was looking into increasing the View Event Log query limit per the request of a client, who was not able to view more than the past few hours of events for his pool owing to the default event query limit of 2000 events.  I noticed that these queries, in addition to being short on useful information, also were taking several minutes to complete.  After bumping the query limit to 6000 events, we found that the queries were taking over 30 minutes to complete, and hogging up all the CPU on the Virtual Center server (where the events database is hosted)!  I verified that memory and disk were not bottlenecked on the SQL database (I could not add more CPU because I already was at the SQL Standard Edition max of four cores), and set SQL tracing to look for deadlock events.  After running into a bunch of dead ends, I finally opened a support case with VMware.

Unsurprisingly, the first response was “well, lower your query limit.”  I explained that no, I was not going to do that.  I also pointed out that selecting 6000 records from a 2.4 Gb database really should not take 30 minutes, and that engineering just needed to buckle down and fix whatever index was causing the problem.  A few days later, I was given one line of T-SQL to run against the View Events database to add a missing index.  Query got executed, index created, and voila!  Event queries started running in seconds, not hours.  Here is the T-SQL:

CREATE INDEX IX_eventid ON dbo.VE_event_data (eventid)

Your table name might be slightly different, depending on the table prefix you selected when setting up the events database.

Composer Failures:

We have seen this before… someone recomposes a pool, the job half-finishes then stops, no error.  The task cannot be canceled, the pool cannot be deleted, and all other Composer operations in the infrastructure grind to a halt.  Why?  If you call VMware support, the first thing they will tell you is “cache corruption”.  The next is “stale desktops”.  Huh?

Deleting Stale Desktops:

http://kb.vmware.com/selfservice/search.do?cmd=displayKC&docType=kc&docTypeID=DT_KB_1_1&externalId=2015112

Clearing the Connection Server Cache:

No KB for this one that I am aware of.  Here is that they always tell me to do… ready?  You are going to like this…

  1. Shut down all of the connection servers in your farm.
  2. Turn the connection servers back on, one at a time.

Augh!

The worst part is, that neither of these solutions worked.  However, what I did find was that after powering the connection servers back on, some composer operations would succeed, but it was only a matter of time before one job failed an brought operations to a halt.  Finally I noticed that when rebooting one of the connection servers (the newest one, used for testing security settings), jammed jobs would immediately resume.  After digging into the logs in C:ProgramDataVMwareVDMlogs, I found that the Connection Server was reporting literally thousands of “could not connect to vCenter server at URL…” errors per day.  Why?  Because like a noob I did not give this connection server in interface to the vCenter server.  Bad on me.  However, these critical failures do not show up in the Windows event logs, nor do they get reported up to the View Administrator console.  I had a bad connection server in my environment that was killing Composer operations, and View Administrator thinks everything is peachy.  Boo!  I have complained to VMware support, for what it is worth.  I also fixed the connection server, and things are back to “normal”, whatever that means.

I also got my manager to approve using Splunk to collect all View log files, so that I at least will have an easier time of discovering errors when they arise in the future.

vSphere 5.1 – Train Wreck in Slow Motion

vSphere 5.1 arrived this summer to no great fan-fare. We waited a few weeks, heard no sounds of howling pain (we did not listen very hard, I guess), and decided to proceed with upgrading vCenter.  I have been digging out of the wreckage ever since.

How do you know if upgrading to vSphere 5.1 is right for you?  Here are a few bullet points to help you decide:

  • Do you have CA-signed (externally trusted, or in-house Enterprise CA server) certificates in use in your current vSphere environment?
  • Are you using an external MS SQL Server to host your vCenter database?  Are you using mirrored SQL databases?
  • Is your environment currently stable and reliable?

Is you answered “yes” to any of these questions, do not upgrade to vSphere 5.1.  At least, not yet. Do deceive yourself that that the vSphere 5.1.0a release will be any help, either.

What is the big problem, you ask?  The major source of pain in this release is the new “Single Sign-On Service” that handles authentication and authorization for all of the other vSphere components.  This component of vSphere has twitchy SSL certificate requirements that are poorly documented by VMware.  The SSL requirements are so touchy that in our case, even the self-signed certs generated by the installer did not work.  Unlike all of the other current vSphere components, it does not support mirrored SQL databases.  It has new permissions requirements in AD that are not documented at all, and at the time of our installation, did not even have a KB entry.  The installer is very buggy, most notably in that it requests that you set and admin password for the SSO Service, and demands password complexity, but it does not inform you when your password is unacceptably long (i.e. longer than 32 characters) or when your password contains illegal characters (i.e. most regular expression special characters).

So, if you do upgrade, be prepared for an extended service outage.  Give yourself a long service window.  Have your VMware support contract numbers handy.  Familiarize yourself with the myriad of locations that are used to log vCenter data.  Learn to use PowerShell (get-childitem -recurse | select-string -pattern “configSettingThatThevCenterInstallerBorkedUp”) and keep this page bookmarked:

http://derek858.blogspot.com/2012/09/vmware-vcenter-51-installation-part-1.html

Here are UVM we are indebted to Derek Seaman for his thorough documentation of the vSphere 5.1 installation process and detailed SSL certificate generation instructions.

Following are some installation quirks that we encountered, presented mainly for my own reference, but maybe you will find them useful as well:

  1. “Performance Charts Experienced an Internal Error” seen in the vSphere client after the upgrade:
    This happened because vCenter Web Services did not read the database mirroring configuration from our defined ODBC data sources… it grabbed the primary database only, and not the mirror data.  The fix?  Edit:
    “%ProgramData%VMwareVMware VirtualCentervcdb.properties”
    Find the “url=” line, and append:
    ;failoverPartner=[mirrorServer]
    (Where [mirrorServer] the the actual DB mirror host name.  Don’t forget the “” before the “=”.)
  2. Some users with permissions to vCenter 5.0 cannot log in after the upgrade.  In the vSphere web client, these users are marked as “disabled”:
    This occurred for use for two reasons:

    1. The SSO Service installer prompts us for a service account to use during install.  Following installation, the service is seen to be running as “SYSTEM”, and not the specified service account.  Change the Service to run with your planned service account using services.msc after the installation.  As an alternative, you could specify those credentials  in the vSphere Web Client -> Administration ->Sign-On and Discovery -> Configuration -> Identity Sources.  Edit your identity source, and under “Authentication Source” select “password”, then enter your service account credentials.
    2. The SSO Service needs to read account attributes that cannot be read by a standard user account (at least, not in an AD forest at a Server 2008 R2 functional level).  When we asked VMware support to define the required permissions, they replied: “an account has to have at least read-only permissions over the user and group Organization Units furthermore read permissions also on the properties of the users, such as UserAccessControl.”  After some experimentation, I just gave the SSO Service account “read all properties” rights to the account OU, and login abilities were restored.
  3. Our SSO Service broke when the mirrored database servers that we currently use for vCenter services had a failover event.  During install, I used the standard “failoverPartner=” JDBC connection string property to specify our failover database server.  Unfortunately, the SSO service ignores this property.  I could not identify an acceptable workaround for this problem. Ultimately, I installed a SQL Express instance on our vCenter server to house just the SSO database.  I tried:
    1. Using SQL Aliases, but this failed because the JDBC driver is not aware of SQL Aliases.
    2. Using a script that edits the local “hosts” file on a database failover event.  I then used this host name alias for the database connections.  This almost worked.  I edited the following files to use the host alias, instead of the actual database server host name:
      %ProgramFiles%VMwareInfrastructureSSOServerwebappsimsWEB-INFclassesjndi.properties
      and:
      %ProgramFiles%VMwareInfrastructureSSOServerwebappslookupserviceWEB-INFclassesconfig.properties
      Upon restart, the SSO Service was able to connect to the database, but it did not survive a failover.  Apparently the old database connection information was still in use somewhere, and VMware support was not helpful in identifying all of the database configuration locations for SSO.
    3. While VMware does have command line configuration tools that could have been used to script reconfiguration of the database connection strings, I have deemed that they are too fragile for production use.
  4. The option to authenticate using Windows session credentials in the vSphere Client (traditional version) stopped working after the 5.1 upgrade.  This is a bug that is fixed with the 5.1.0a release.  Unfortunately, the SSO installer for 5.1.0a does not work in upgrade mode.  Aargh!  I had to uninstall the SSO service to get the updated files into place.  Guess what the uninstaller does?  That’s right… it erases the SSO Service database (drops all tables!  Gah!), and deletes all configuration files for the service.  Before you upgrade, make sure that you have an SSO Service backup bundle.  I did, but it was outdated.  I had to re-register all of the vCenter components with SSO manually, which was a pain in the butt.
  5. vSphere Update Manager registered with vCenter using the wrong DNS name.  We could not scan ESXi hosts for updates, because vCenter was telling them to connect to an invalid URL.  To fix, I needed to search the registry for the incorrect host name, and replace with the correct one:
    “HKEY_LOCAL_MACHINESOFTWAREWow6432NodeVMware, Inc.VMware Update ManagerVUMServer”
    For good measure I also edited:
    %Program Files(x86)%VMwareInfrastructureUpdate Managerextension.xml
    To contain the correct host name.  Then we restart the Update Manager services, and we are back in business.
  6. Other fun related to VMware Update Manager… the SQL Account used by Update Manager cannot have a password that exceeds 24 characters in length. Special characters in the SQL Account password also may cause problems.

So, VMware is not my favorite company this month.  On to solve more problems.  We still cannot add new permissions to vCenter, and Performance Charts are loading like a slug in winter.

VMware Performance Charts broken again – fix your connection string.

Following upgrade of our Virtual Center server to the vSphere 5 version, we have been struggling with crashing services and memory exhaustion.  Well, the server got an upgrade from 8Gb to 16Gb of RAM this am, so it is now swimming in excess memory.  Despite this, the Perf Charts in vCenter have gone dead again.

We were seeing errors similar to those described here:
http://kb.vmware.com/selfservice/microsites/search.do?language=en_US&cmd=displayKC&externalId=1012812
We got the “”perf charts service experienced an internal error” when looking at the Performance tab for any object in the vSphere Client.  A look at the latest “vctomcat-stderr[…].log” file in “C:Program FilesVMwareInfrastructuretomcatlogs” reveals JDBC connection errors.  The damn service is trying to connect to the standby partner of our SQL database mirrored pair!  So where are the connection strings stored for this service.  Well, no thanks to VMware documentation, I discovered the connection string stored here:
C:ProgramDataVMwareVMware VirtualCentervcdb.properties

All I had to do was append “;failoverPartner=[hostName]” to the line starting with “url=”, then restart the tomcat service (the “VMware VirtualCenter Management WebServices” service).  Viola… performance reports are back.

Now back to fixing everything else that is broken… also known as “everything”.

Discovering orphaned vmdk files in vSphere

On occasion we have found abandoned vmdk files in our vSphere infrastructure. I often have thought we needed to take some time to hunt down and exterminate these orphans. As is often the case, someone else already did the initial research required to make automation of this task possible, but I fou nd I needed to do some updating of the source scripts for improved accuracy, improved formatting, and compatibility with vSphere 4.1:

# getOrphanVMDK.ps1
# Purpose : List all orphaned vmdk on all datastores in all VC's
# Version : v2.0
# Author  : J. Greg Mackinnon, from original by HJA van Bokhoven
# Change  : v1.1  2009.02.14  DE  angepasst an ESX 3.5, Email versenden und Filegrösse ausgeben
# Change  : v1.2  2011.07.12 EN  Updated for ESX 4, collapsed if loops into single conditional
# Change  : v2.0  2011.07.22 EN: 
	# Changed vmdk search to use the VMware.Vim.VmDiskFileQuery object to improve search accuracy
	# Change vmdk matching logic as a result of VmDiskFileQuery usage
	# Pushed discovered orphans into an array of custom PS objects
	# Simplified logging and email output
			
Set-PSDebug -Strict

#Initialize the VIToolkit:
add-pssnapin VMware.VimAutomation.Core
[Reflection.Assembly]::LoadWithPartialName("VMware.Vim")

#Main

[string]$strVC = "myViServer.mydomain.org"								# Virtual Center Server name
[string]$logfile = "c:localtempgetOrphanVMDK.log"
[string]$SMTPServer = "mysmtp.mydomain.org"							# Change to a SMTP server in your environment
[string]$mailfrom = "GetOrphanVMDK@myViServer.mydomain.org"	# Change to email address you want emails to be coming from
[string]$mailto = "vmware@mydomain.org"							# Change to email address you would like to receive emails
[string]$mailreplyto = "vmware@mydomain.org"						# Change to email address you would like to reply emails

[int]$countOrphaned = 0
[int64]$orphanSize = 0

# vmWare Datastore Browser query parameters
# See http://pubs.vmware.com/vi3/sdk/ReferenceGuide/vim.host.DatastoreBrowser.SearchSpec.html
$fileQueryFlags = New-Object VMware.Vim.FileQueryFlags
$fileQueryFlags.FileSize = $true
$fileQueryFlags.FileType = $true
$fileQueryFlags.Modification = $true
$searchSpec = New-Object VMware.Vim.HostDatastoreBrowserSearchSpec
$searchSpec.details = $fileQueryFlags
#The .query property is used to scope the query to only active vmdk files (excluding snaps and change block tracking).
$searchSpec.Query = (New-Object VMware.Vim.VmDiskFileQuery)
#$searchSpec.matchPattern = "*.vmdk" # Alternative VMDK match method.
$searchSpec.sortFoldersFirst = $true

if ([System.IO.File]::Exists($logfile)) {
    Remove-Item $logfile
}

#Time stamp the log file
(Get-Date –f "yyyy-MM-dd HH:mm:ss") + "  Searching Orphaned VMDKs..." | Tee-Object -Variable logdata
$logdata | Out-File -FilePath $logfile -Append
#Connect to vCenter Server
Connect-VIServer $strVC

#Collect array of all VMDK hard disk files in use:
[array]$UsedDisks = Get-View -ViewType VirtualMachine | % {$_.Layout} | % {$_.Disk} | % {$_.DiskFile}
#The following three lines were used before adding the $searchSpec.query property.  We now want to exclude template and snapshot disks from the in-use-disks array.
# [array]$UsedDisks = Get-VM | Get-HardDisk | %{$_.filename}
# $UsedDisks += Get-VM | Get-Snapshot | Get-HardDisk | %{$_.filename}
# $UsedDisks += Get-Template | Get-HardDisk | %{$_.filename}

#Collect array of all Datastores:
#$arrDS is a list of datastores, filtered to exclude ESX local datastores (all of which end with "-local1" in our environment), and our ISO storage datastore.
[array]$allDS = Get-Datastore | select -property name,Id | ? {$_.name -notmatch "-local1"} | ? {$_.name -notmatch "-iso$"} | Sort-Object -Property Name

[array]$orphans = @()
Foreach ($ds in $allDS) {
	"Searching datastore: " + [string]$ds.Name | Tee-Object -Variable logdata
	$logdata | Out-File -FilePath $logfile -Append
	$dsView = Get-View $ds.Id
	$dsBrowser = Get-View $dsView.browser
	$rootPath = "["+$dsView.summary.Name+"]"
	$searchResult = $dsBrowser.SearchDatastoreSubFolders($rootPath, $searchSpec)
	foreach ($folder in $searchResult) {
	    foreach ($fileResult in $folder.File) {
			if ($UsedDisks -notcontains ($folder.FolderPath + $fileResult.Path) -and ($fileResult.Path.length -gt 0)) {
				$countOrphaned++
				IF ($countOrphaned -eq 1) {
					("Orphaned VMDKs Found: ") | Tee-Object -Variable logdata
					$logdata | Out-File -FilePath $logfile -Append
				}
				$orphan = New-Object System.Object
				$orphan | Add-Member -type NoteProperty -name Name -value ($folder.FolderPath + $fileResult.Path)
				$orphan | Add-Member -type NoteProperty -name SizeInGB -value ([Math]::Round($fileResult.FileSize/1gb,2))
				$orphan | Add-Member -type NoteProperty -name LastModified -value ([string]$fileResult.Modification.year + "-" + [string]$fileResult.Modification.month + "-" + [string]$fileResult.Modification.day)
				$orphans += $orphan
				$orphanSize += $fileResult.FileSize
				$orphan | ft -autosize | out-string | Tee-Object -Variable logdata
				$logdata | Out-File -FilePath $logfile -Append
				[string]("Total Size or orphaned files: " + ([Math]::Round($orphanSize/1gb,2)) + " GB") | Tee-Object -Variable logdata
				$logdata | Out-File -FilePath $logfile -Append
				Remove-Variable orphan
			}
		}
	}
}
(Get-Date –f "yyyy-MM-dd HH:mm:ss") + "  Finished (" + $countOrphaned + " Orphaned VMDKs Found.)" | Tee-Object -Variable logdata
$logdata | Out-File -FilePath $logfile -Append

if ($countOrphaned -gt 0) {
	[string]$body = "Orphaned VMDKs Found: `n"
	$body += $orphans | Sort-Object -Property LastModified| ft -AutoSize | out-string
	$body += [string]("Total Size or orphaned files: " + ([Math]::Round($orphanSize/1gb,2)) + "GB")
    $SmtpClient = New-Object system.net.mail.smtpClient
    $SmtpClient.host = $SMTPServer
    $MailMessage = New-Object system.net.mail.mailmessage
    $MailMessage.from = $mailfrom
    $MailMessage.To.add($mailto)
    $MailMessage.replyto = $mailreplyto
    $MailMessage.IsBodyHtml = 0
    $MailMessage.Subject = "Info: VMware orphaned VMDKs"
    $MailMessage.Body = $body
	"Mailing report... " | Tee-Object -Variable logdata
	$logdata | Out-File -FilePath $logfile -Append
    $SmtpClient.Send($MailMessage)
}
Disconnect-VIServer -Confirm:$False

Adding Drivers to the built-in Windows Recovery Environment

Windows 7 and Windows 2008 R2 feature an out-of-box installation of the very useful Windows Recovery Environment (WinRE).  WinRE can save your buttocks… but what if your system is using storage drivers that are not available in the out-of-box WinRE environment?  Such as the VMware Paravirtual SCSI driver (PVSCSI)?

Fortunately, WinRE is just a modified WinPE image, so you can add drivers using DISM.exe, right?  Sure… if you can find the WinPE image that is used by WinRE!  Fortunately, there is a tool for this.

Open a command prompt on your Server 2008 R2 system, and run “REAgentC.exe /info”… the output will tell you where to find the image:

Recovery Environment: \?GLOBALROOTdeviceharddisk0partition2Recoverybb338b68-0d2c-11df-be64-84e1223bd0bb
BCD Id: bb338b68-0d2c-11df-be64-84e1223bd0bb

So, on the second partition of the first disk (also known as the “C:” drive, according to “Diskpart”), you will find a hidden “recovery” directory, with subdirectory “bb338b68-0d2c-11df-be64-84e1223bd0bb”.  Within here is “winre.wim”.

Now there is simply the matter of injecting the drivers. First, place all the drivers you wish to inject into an easily accessible directory (such as c:localtemp, in our example), and then run the following commands:

mkdir c:wimtemp
dism /mount-wim /WimFile:c:recoverybb338b68-0d2c-11df-be64-84e1223bd0bbwinre.wim /index:1 /mountdir:c:wimtemp
dism /image:c:wimtemp /add-driver /driver:C:localtemp /recurse
dism /unmount-wim /mountdir:c:wimtemp /commit
rmdir /q c:wimtemp

Et voila! We press “f8” on next reboot, select the recovery environment, and suddenly we have full access to the local disk.