Unexpected errors when removing or adding node in SQL Server cluster

Case #

After running SQL Server setup wizard to remove a node from a two-node SQL cluster, various cluster resources are removed and the SQL clustered instance remains in a failed state, even though the SQL Setup wizard had completed with exit code 0 (no errors). When trying to re-add the node, a series of errors occur, as described in this article.

When trying to start the failed resources in the clustered SQL instance, the following error occurs (SQL error code 0x80071736):

and/or the following error (SQL error code 0x8007139a), depending on whether the failed cluster resource has other failing dependency cluster resources or not.

Also when trying to re-add the SQL node to the failed SQL instance, the client network access name is not shown (blank) and the following error occurs (SQL error code 0x84C00001):

Solution #

Before you remove a node from the Windows Server operating system WFC cluster (evict node), it is important to first uninstall/remove any application-level clusters and other dependencies, for instance SQL Server clustered instances. In some cases however it appears that this can cause issues. In these cases, simply evicting the node from the WFC cluster is technically feasible and will not cause issues since the SQL clustered role basically relies on the WFC cluster information and keeps no separate configuration files, other than the registry. If the node which is being removed will be totally removed from the environment, then simply evicting the node can be justified.

First investigate WFC cluster logs, SQL instance error logs, SQL server event logs and SQL Server setup logs.

Check the cluster log on all cluster nodes, as per: https://blog.sqlauthority.com/2015/07/01/sql-server-steps-to-generate-windows-cluster-log/.

The SQL Server instance error log can be reviewed from inside the instance in SQL Server Management Studio. Each execution of Setup creates log files are created with a new timestamped log folder at %programfiles%\Microsoft SQL Server\100\Setup Bootstrap\Log. The time-stamped log folder name format is YYYYMMDD_hhmmss. When Setup is run in an unattended mode, the logs are created at % temp%\sqlsetup.log. All files in the logs folder are archived into the Log.cab file in their respective log folder. The SQL Server Setup logs should be investigated as per instructions at: https://docs.microsoft.com/en-us/previous-versions/sql/sql-server-2008-r2/ms143702(v=sql.105).

Investigate all important cluster registry key groups:

  • Cluster - Computer\HKEY_LOCAL_MACHINE\Cluster
  • Cluster Resources - Computer\HKEY_LOCAL_MACHINE\Cluster\Resources
  • Microsoft SQL Server - Computer\HKEY_LOCAL_MACHINE\SOFTWARE\Microsoft\Microsoft SQL Server

I had same issue and checked again the cluster.log file, where I found following entries:
r : [sqsrvres] Dependency expression for resource 'INSTANCENAME' is '([af9669ae-e2b9-4548-994f-172d5ab167ed])'
000022b0.00000958::2021/04/05-16:51:40.485 ERR [RES] SQL Server : [sqsrvres] Unable to open the SQL Server instance registry of 'Cluster' with error: 2. Please contact customer support
000022b0.00002fc0::2021/04/05-16:51:40.486 ERR [RES] SQL Server : [sqsrvres] GetRegKeyAccessMask: Could not get registry access mask for registry key Software\Microsoft\Microsoft SQL Server\MSSQL15.INSTANCENAME\Cluster (status 2)).
000022b0.00002fc0::2021/04/05-16:51:40.486 ERR [RES] SQL Server : [sqsrvres] Worker Thread (1C8ECDE0): Failed to retrieve the SQL Server cluster registry key (last error = 2).
000022b0.00000958::2021/04/05-16:51:40.486 ERR [RES] SQL Server : [sqsrvres] SQL Cluster shared data upgrade failed with error 0 (worker retval = 2). Please contact customer support
000022b0.00000958::2021/04/05-16:51:40.486 ERR [RES] SQL Server : [sqsrvres] Failed to prepare environment for online. See previous message for detail. Please contact customer support
000022b0.00000958::2021/04/05-16:51:40.486 INFO [RES] SQL Server : [sqsrvres] SQL Server resource state is changed from 'ClusterResourceOnlinePending' to 'ClusterResourceFailed'
000022b0.00000958::2021/04/05-16:51:40.486 WARN [RHS] Online for resource SQL Server (INSTANCENAME) failed.

For some reason the SQL uninstall node setup had erroneously deleted some registry keys from path Software\Microsoft\Microsoft SQL Server\MSSQL15.INSTANCENAME\Cluster. When i recreated them by using a healthy instance as reference, I was able to run the SQL Server repair wizard successfully.

It appears that one good way to "reset" the registry key paths and values for a specific node of a WFC cluster is to evict the node from the WFC cluster and re-add it. This action needs to be performed with an administrator user which has all required permissions on the WFC cluster (file system, domain and registry) as well as permissions on the SQL cluster, if an SQL cluster is involved. You can try using the default domain admin user and then retest after refining the exact permissions needed.

If there are still registry keys missing and you encounter akward behavior when running the SQL Server "Add node to existing cluster" wizard (i.e. SQL Server clustered instances not shown/populated by the SQL wizard), you should run ProcMon on the affected server at the time when the issue occurs and inspect which registry keys are being accessed. It is important to understand the registry path/key structure for all cluster related components in each WFC node. For this you should more sophisticated registry management tools, such as reg compare or the Nirsoft registry tool (for full registry snapshots or WinMerge for comparison of reg files).

Run Powershell cmdlets to check available cluster resource types and cluster resource state:

  • Get-ClusterResource
  • Get-ClusterResourceType
  • Get-ClusterResourceGroup
  • Update-ClusterNetworkNameResource

Check that all AD accounts (VNO, VCO) and permissions requirements have been met as per: https://docs.microsoft.com/en-us/previous-versions/windows/it-pro/windows-server-2008-R2-and-2008/cc731002(v=ws.10)?redirectedfrom=MSDN#BKMK_requirements.

Also check that DNS forward and reverse resolution works consistently for all clustered instances from all WFC cluster nodes, as per: https://support.delphix.com/Delphix_Virtualization_Engine/MSSQL_Server/Resolving_Error_%22Failed_to_discover_cluster_address_for_cluster_environment%22_(KBA1594)

Also check SPN status for all related computer objects and Windows services, as per: https://docs.microsoft.com/en-us/previous-versions/windows/it-pro/windows-server-2012-r2-and-2012/cc731241(v=ws.11).

Check also that all WFC related network ports and protocols are allowed via the Windows Firewall in all WFC nodes, as per: https://docs.microsoft.com/en-us/troubleshoot/windows-server/networking/service-overview-and-network-port-requirements.

Ensure also that the CLIUSR is present and enabled in all WFC nodes, as per https://docs.microsoft.com/en-gb/archive/blogs/askcore/so-what-exactly-is-the-cliusr-account. It is possible in some cases that a corruption occurs in the CLIUSR profile or more CLIUSR users are created and then deleted, which can lead to a corrupt configuration, as shown in the example below. If you encounter the below case, you will need to reboot the server in question and run Delfprof afterwards to cleanup the invalid user profiles.

Try a SQL Server node repair using SQL Server installation media (only on the passive node each time) as per the following article: https://www.mssqltips.com/sqlservertip/2592/fix-sql-server-agent-on-windows-failover-cluster/. When you repair an instance of SQL Server:

  • All missing or corrupt files are replaced.
  • All missing or corrupt registry keys are replaced.
  • All missing or invalid configuration values are set to default values.

Before you continue, for SQL Server failover clusters, review the following important information:

Possible and preferred cluster resource owners #

It is very important to check both the possible and preferred owner of each cluster resource at two levels:

  • Cluster role level

Right click the SQL Server role, click Properties and check all relevant nodes in the General tab.

  • Cluster role resource level (each role has many cluster resources underneath)

Right click each of the cluster resources underneath the SQL role, click Properties and check all relevant nodes in the Advanced Policies tab.

You need to ensure that the possible and preferred owners are consistent at both the cluster role level and the cluster role resource level for each subordinate resource. It appears that these settings can become inconsistent after removing a SQL Server node from the cluster by running the SQL Server wizard.

Sources #













Powered by BetterDocs