Always On Availability groups resolving state after failover - Remote harden of transaction...
Should corporate security training be tailored based on a users' job role?
Does human life have innate value over that of other animals?
Limit involving inverse functions
Father gets chickenpox, but doesn't infect his two children. How is this possible?
Is it possible to re-order a List of continous entries?
Who, if anyone, was the first astronaut to return to earth in a different vessel?
The totem pole can be grouped into
Identical projects by students at two different colleges: still plagiarism?
Sci fi book, man buys a beat up spaceship and intervenes in a civil war on a planet and eventually becomes a space cop
How to not forget my phone in the bathroom?
Animate an airplane in Beamer
Is the "hot network questions" element on Stack Overflow a dark pattern?
Headless horseman claims new head
Coworker is trying to get me to sign his petition to run for office. How to decline politely?
Why did Tywin never remarry?
What happens to someone who dies before their clone has matured?
How to write a character over another character
How can a kingdom keep the secret of a missing monarchy from the public?
Taking an academic pseudonym?
Is 'bad luck' with former employees a red flag?
Is candidate anonymity at all practical?
Does a star need to be inside a galaxy?
Negotiating 1-year delay to my Assistant Professor Offer
Why do we charge our flashes?
Always On Availability groups resolving state after failover - Remote harden of transaction 'GhostCleanupTask' failed
Specifying the failover partner when using Availability GroupsAvailability mode - manual failover mode, best practice for availability mode?Always On AG is Down when service is stoppedPrimary Role DB Not AccessibleSQL Server Cluster Service Online Pending StatusPrimary Server is “Resolving state” & Quorum node is in “failed” stateAlways on Failover ClusterSQL Cluster failover doesn't work after applying SQL Server 2016 Service Pack 2AlwaysOn Availability Groups: connecting correctly?Remote harden of transaction 'CREATE INDEX' - when moving an index to a different filegroup
My explanation will be opinion based as it is based on my findings. If more information is needed, I am always happy to add it.
The setup
We have 4 SQL Server 2016 Instances, and all 4 underlying servers are in a WSFC (In my eyes a strange setup since Always On AG's are 2 by 2. But that is a different question.)
4 nodes + Fileshare witness
Version: 13.0.5026.0 (all nodes)
The instances that are having the failover issues are instances 1 & 2.
They have 3 AG's between them. AG1 is primary on instance 1 and is having the issue. AG2 and AG3 are primary on instance 2.
Instances 3 and 4 have their own separate AG's (2).
It does not seem like any succesful failover has happened since recreating the AG a couple of months ago, the servers have Updates pending aswell. (We will be doing reboots in maintenance window)
AG1's settings
The week before
Last week, after checking for long running queries (none) and the synchronization state (green accross the board) we initiated a manual failover without data loss on AG1 from instance1 (primary) to instance2 (secondary).
The failover failed, leaving both the former primary (instance1) and the former secondary (instance2) in a temporary resolving state. Two databases on the former primary even showed Reverting / In Recovery for a few minutes. In the error log, a message was shown on many occasions:
The databases where not accessible during the outage, the outage took about 10 minutes total.
Before failing over I checked:
- Current running queries (no long running queries)
DBCC OPENTRAN
- Always On dashboard (Green accross the board)
We found a post by microsoft to enable a traceflag, 3923. We enabled this one on both instances after the outage.
Take #2, today
We tried testing the manual failover without data loss again to see if the issues where resolved.
Spoiler alert, they where not resolved.
Remote harden of transaction 'GhostCleanupTask' (ID 0x00000000141b9d97 0000:000cad62) started at Feb 19 2019 11:17AM in database 'DB1' at LSN (102:23518:2) failed.
Remote harden of transaction 'GhostCleanupTask' (ID 0x00000000141b9d97 0000:000cad62) started at Feb 19 2019 11:17AM in database 'DB1' at LSN (102:23518:2) failed.
Remote harden of transaction 'GhostCleanupTask' (ID 0x00000000141b9da2 0000:078d645c) started at Feb 19 2019 11:17AM in database 'DB2' at LSN (5909:54307:137) failed.
Remote harden of transaction 'GhostCleanupTask' (ID 0x00000000141b9dee 0000:0017e40a) started at Feb 19 2019 11:17AM in database 'DB1' at LSN (164:21769:2) failed.
Remote harden of transaction 'GhostCleanupTask' (ID 0x00000000141ba1a8 0000:0024e912) started at Feb 19 2019 11:17AM in database 'DB3' at LSN (1246:7285:2) failed.
Edit
There are different remote harden error's, on FTEPersistIndex & IFTSAutoNested & INSERT
Remote harden of transaction 'FTEPersistIndex' (ID 0x00000000141b98b5 0000:078d6459) started at Feb 19 2019 11:16AM in database 'DB2' at LSN (5909:54266:104) failed.
Remote harden of transaction 'IFTSAutoNested' (ID 0x00000000141b98bf 0000:1d20f419) started at Feb 19 2019 11:16AM in database 'DB4' at LSN (93341:183352:3) failed.
Remote harden of transaction 'INSERT' (ID 0x00000000141b989c 0000:1d20f418) started at Feb 19 2019 11:16AM in database 'DB4' at LSN (93341:183346:17) failed.
We canceled the failover and it failed back to the former primary.
The actual question(s)
- The beginning question is, am i right to assume that the failover
not happening is due to theGhostCleanupTask
and other tasks not being able to harden the transaction log, or is it a consequence of the failure
and not the cause. - If it is the cause, what could be done to resolve this?
Something I am looking into, since the problem does not seem to be due to any long running queries rolling back, that the secondary databases are in an inconsistent state? But they show up as Synchronized.
Extra info
The only interesting message in the clusterlog from last week appeared to be this one:
mscs::TopologyPersister::TryGetNetworkPrivateProperties: (2)' because of 'OpenSubKey failed.'
Failover cluster events
Cluster resource 'AG1' of type 'SQL Server Availability Group' in
clustered role 'AG1' failed.
Based on the failure policies for the resource and role, the cluster
service may try to bring the resource online on this node or move the
group to another node of the cluster and then restart it. Check the
resource and group state using Failover Cluster Manager or the
Get-ClusterResource Windows PowerShell cmdlet.
The Cluster service failed to bring clustered role 'AG1'
completely online or offline. One or more resources may be in a failed
state. This may impact the availability of the clustered role.
Event log secondary server
First message
Windows cannot load the extensible counter DLL
InterceptCountersManager. The first four bytes (DWORD) of the Data
section contains the Windows error code.
Second message
The configuration information of the performance library
"perf-MSSQLSERVER-sqlctr13.2.5026.0.dll" for the "MSSQLSERVER" service
does not match the trusted performance library information stored in
the registry. The functions in this library will not be treated as
trusted.
SQL Server Log
Big part of the sql server log when the failover happened here
The State is switching from sync to not sync a lot during the resolving period
DbMgrPartnerCommitPolicy::SetSyncAndRecoveryPoint:
1B55DB6F-066D-442B-B97E-E32D6990756A:1
DbMgrPartnerCommitPolicy::SetSyncAndRecoveryPoint:
1B55DB6F-066D-442B-B97E-E32D6990756A:4
DbMgrPartnerCommitPolicy::SetSyncAndRecoveryPoint:
1B55DB6F-066D-442B-B97E-E32D6990756A:4
DbMgrPartnerCommitPolicy::SetSyncAndRecoveryPoint:
1B55DB6F-066D-442B-B97E-E32D6990756A:1
DbMgrPartnerCommitPolicy::SetSyncAndRecoveryPoint:
1B55DB6F-066D-442B-B97E-E32D6990756A:1
DbMgrPartnerCommitPolicy::SetSyncAndRecoveryPoint:
1B55DB6F-066D-442B-B97E-E32D6990756A:4
DbMgrPartnerCommitPolicy::SetSyncAndRecoveryPoint:
1B55DB6F-066D-442B-B97E-E32D6990756A:1
...
sql-server sql-server-2016 availability-groups failover
|
show 8 more comments
My explanation will be opinion based as it is based on my findings. If more information is needed, I am always happy to add it.
The setup
We have 4 SQL Server 2016 Instances, and all 4 underlying servers are in a WSFC (In my eyes a strange setup since Always On AG's are 2 by 2. But that is a different question.)
4 nodes + Fileshare witness
Version: 13.0.5026.0 (all nodes)
The instances that are having the failover issues are instances 1 & 2.
They have 3 AG's between them. AG1 is primary on instance 1 and is having the issue. AG2 and AG3 are primary on instance 2.
Instances 3 and 4 have their own separate AG's (2).
It does not seem like any succesful failover has happened since recreating the AG a couple of months ago, the servers have Updates pending aswell. (We will be doing reboots in maintenance window)
AG1's settings
The week before
Last week, after checking for long running queries (none) and the synchronization state (green accross the board) we initiated a manual failover without data loss on AG1 from instance1 (primary) to instance2 (secondary).
The failover failed, leaving both the former primary (instance1) and the former secondary (instance2) in a temporary resolving state. Two databases on the former primary even showed Reverting / In Recovery for a few minutes. In the error log, a message was shown on many occasions:
The databases where not accessible during the outage, the outage took about 10 minutes total.
Before failing over I checked:
- Current running queries (no long running queries)
DBCC OPENTRAN
- Always On dashboard (Green accross the board)
We found a post by microsoft to enable a traceflag, 3923. We enabled this one on both instances after the outage.
Take #2, today
We tried testing the manual failover without data loss again to see if the issues where resolved.
Spoiler alert, they where not resolved.
Remote harden of transaction 'GhostCleanupTask' (ID 0x00000000141b9d97 0000:000cad62) started at Feb 19 2019 11:17AM in database 'DB1' at LSN (102:23518:2) failed.
Remote harden of transaction 'GhostCleanupTask' (ID 0x00000000141b9d97 0000:000cad62) started at Feb 19 2019 11:17AM in database 'DB1' at LSN (102:23518:2) failed.
Remote harden of transaction 'GhostCleanupTask' (ID 0x00000000141b9da2 0000:078d645c) started at Feb 19 2019 11:17AM in database 'DB2' at LSN (5909:54307:137) failed.
Remote harden of transaction 'GhostCleanupTask' (ID 0x00000000141b9dee 0000:0017e40a) started at Feb 19 2019 11:17AM in database 'DB1' at LSN (164:21769:2) failed.
Remote harden of transaction 'GhostCleanupTask' (ID 0x00000000141ba1a8 0000:0024e912) started at Feb 19 2019 11:17AM in database 'DB3' at LSN (1246:7285:2) failed.
Edit
There are different remote harden error's, on FTEPersistIndex & IFTSAutoNested & INSERT
Remote harden of transaction 'FTEPersistIndex' (ID 0x00000000141b98b5 0000:078d6459) started at Feb 19 2019 11:16AM in database 'DB2' at LSN (5909:54266:104) failed.
Remote harden of transaction 'IFTSAutoNested' (ID 0x00000000141b98bf 0000:1d20f419) started at Feb 19 2019 11:16AM in database 'DB4' at LSN (93341:183352:3) failed.
Remote harden of transaction 'INSERT' (ID 0x00000000141b989c 0000:1d20f418) started at Feb 19 2019 11:16AM in database 'DB4' at LSN (93341:183346:17) failed.
We canceled the failover and it failed back to the former primary.
The actual question(s)
- The beginning question is, am i right to assume that the failover
not happening is due to theGhostCleanupTask
and other tasks not being able to harden the transaction log, or is it a consequence of the failure
and not the cause. - If it is the cause, what could be done to resolve this?
Something I am looking into, since the problem does not seem to be due to any long running queries rolling back, that the secondary databases are in an inconsistent state? But they show up as Synchronized.
Extra info
The only interesting message in the clusterlog from last week appeared to be this one:
mscs::TopologyPersister::TryGetNetworkPrivateProperties: (2)' because of 'OpenSubKey failed.'
Failover cluster events
Cluster resource 'AG1' of type 'SQL Server Availability Group' in
clustered role 'AG1' failed.
Based on the failure policies for the resource and role, the cluster
service may try to bring the resource online on this node or move the
group to another node of the cluster and then restart it. Check the
resource and group state using Failover Cluster Manager or the
Get-ClusterResource Windows PowerShell cmdlet.
The Cluster service failed to bring clustered role 'AG1'
completely online or offline. One or more resources may be in a failed
state. This may impact the availability of the clustered role.
Event log secondary server
First message
Windows cannot load the extensible counter DLL
InterceptCountersManager. The first four bytes (DWORD) of the Data
section contains the Windows error code.
Second message
The configuration information of the performance library
"perf-MSSQLSERVER-sqlctr13.2.5026.0.dll" for the "MSSQLSERVER" service
does not match the trusted performance library information stored in
the registry. The functions in this library will not be treated as
trusted.
SQL Server Log
Big part of the sql server log when the failover happened here
The State is switching from sync to not sync a lot during the resolving period
DbMgrPartnerCommitPolicy::SetSyncAndRecoveryPoint:
1B55DB6F-066D-442B-B97E-E32D6990756A:1
DbMgrPartnerCommitPolicy::SetSyncAndRecoveryPoint:
1B55DB6F-066D-442B-B97E-E32D6990756A:4
DbMgrPartnerCommitPolicy::SetSyncAndRecoveryPoint:
1B55DB6F-066D-442B-B97E-E32D6990756A:4
DbMgrPartnerCommitPolicy::SetSyncAndRecoveryPoint:
1B55DB6F-066D-442B-B97E-E32D6990756A:1
DbMgrPartnerCommitPolicy::SetSyncAndRecoveryPoint:
1B55DB6F-066D-442B-B97E-E32D6990756A:1
DbMgrPartnerCommitPolicy::SetSyncAndRecoveryPoint:
1B55DB6F-066D-442B-B97E-E32D6990756A:4
DbMgrPartnerCommitPolicy::SetSyncAndRecoveryPoint:
1B55DB6F-066D-442B-B97E-E32D6990756A:1
...
sql-server sql-server-2016 availability-groups failover
1
~Those two look unrelated to me - I get those regularly on our secondary that fails over fine. Is that the only log in the cluster?
– George.Palacios
4 hours ago
1
If you can repro this, I would recommend opening a support incident with Microsoft.
– Tony Hinkle
4 hours ago
1
Is this the Bug which resurfaced
– Shanky
4 hours ago
1
I'd tend to agree with @TonyHinkle in that case - If there are no error messages there's not a lot you can do short of raising a support case - I assume your cluster validation reports are all okay?
– George.Palacios
4 hours ago
1
I had different kind of a problem with AGs before we tried everything with it but it did not solved the issue. We rebooted our servers one by one (First secondary servers and after that primary node.) unexpectedly it resolved the issue on its own.
– Shashank Tiwari
3 hours ago
|
show 8 more comments
My explanation will be opinion based as it is based on my findings. If more information is needed, I am always happy to add it.
The setup
We have 4 SQL Server 2016 Instances, and all 4 underlying servers are in a WSFC (In my eyes a strange setup since Always On AG's are 2 by 2. But that is a different question.)
4 nodes + Fileshare witness
Version: 13.0.5026.0 (all nodes)
The instances that are having the failover issues are instances 1 & 2.
They have 3 AG's between them. AG1 is primary on instance 1 and is having the issue. AG2 and AG3 are primary on instance 2.
Instances 3 and 4 have their own separate AG's (2).
It does not seem like any succesful failover has happened since recreating the AG a couple of months ago, the servers have Updates pending aswell. (We will be doing reboots in maintenance window)
AG1's settings
The week before
Last week, after checking for long running queries (none) and the synchronization state (green accross the board) we initiated a manual failover without data loss on AG1 from instance1 (primary) to instance2 (secondary).
The failover failed, leaving both the former primary (instance1) and the former secondary (instance2) in a temporary resolving state. Two databases on the former primary even showed Reverting / In Recovery for a few minutes. In the error log, a message was shown on many occasions:
The databases where not accessible during the outage, the outage took about 10 minutes total.
Before failing over I checked:
- Current running queries (no long running queries)
DBCC OPENTRAN
- Always On dashboard (Green accross the board)
We found a post by microsoft to enable a traceflag, 3923. We enabled this one on both instances after the outage.
Take #2, today
We tried testing the manual failover without data loss again to see if the issues where resolved.
Spoiler alert, they where not resolved.
Remote harden of transaction 'GhostCleanupTask' (ID 0x00000000141b9d97 0000:000cad62) started at Feb 19 2019 11:17AM in database 'DB1' at LSN (102:23518:2) failed.
Remote harden of transaction 'GhostCleanupTask' (ID 0x00000000141b9d97 0000:000cad62) started at Feb 19 2019 11:17AM in database 'DB1' at LSN (102:23518:2) failed.
Remote harden of transaction 'GhostCleanupTask' (ID 0x00000000141b9da2 0000:078d645c) started at Feb 19 2019 11:17AM in database 'DB2' at LSN (5909:54307:137) failed.
Remote harden of transaction 'GhostCleanupTask' (ID 0x00000000141b9dee 0000:0017e40a) started at Feb 19 2019 11:17AM in database 'DB1' at LSN (164:21769:2) failed.
Remote harden of transaction 'GhostCleanupTask' (ID 0x00000000141ba1a8 0000:0024e912) started at Feb 19 2019 11:17AM in database 'DB3' at LSN (1246:7285:2) failed.
Edit
There are different remote harden error's, on FTEPersistIndex & IFTSAutoNested & INSERT
Remote harden of transaction 'FTEPersistIndex' (ID 0x00000000141b98b5 0000:078d6459) started at Feb 19 2019 11:16AM in database 'DB2' at LSN (5909:54266:104) failed.
Remote harden of transaction 'IFTSAutoNested' (ID 0x00000000141b98bf 0000:1d20f419) started at Feb 19 2019 11:16AM in database 'DB4' at LSN (93341:183352:3) failed.
Remote harden of transaction 'INSERT' (ID 0x00000000141b989c 0000:1d20f418) started at Feb 19 2019 11:16AM in database 'DB4' at LSN (93341:183346:17) failed.
We canceled the failover and it failed back to the former primary.
The actual question(s)
- The beginning question is, am i right to assume that the failover
not happening is due to theGhostCleanupTask
and other tasks not being able to harden the transaction log, or is it a consequence of the failure
and not the cause. - If it is the cause, what could be done to resolve this?
Something I am looking into, since the problem does not seem to be due to any long running queries rolling back, that the secondary databases are in an inconsistent state? But they show up as Synchronized.
Extra info
The only interesting message in the clusterlog from last week appeared to be this one:
mscs::TopologyPersister::TryGetNetworkPrivateProperties: (2)' because of 'OpenSubKey failed.'
Failover cluster events
Cluster resource 'AG1' of type 'SQL Server Availability Group' in
clustered role 'AG1' failed.
Based on the failure policies for the resource and role, the cluster
service may try to bring the resource online on this node or move the
group to another node of the cluster and then restart it. Check the
resource and group state using Failover Cluster Manager or the
Get-ClusterResource Windows PowerShell cmdlet.
The Cluster service failed to bring clustered role 'AG1'
completely online or offline. One or more resources may be in a failed
state. This may impact the availability of the clustered role.
Event log secondary server
First message
Windows cannot load the extensible counter DLL
InterceptCountersManager. The first four bytes (DWORD) of the Data
section contains the Windows error code.
Second message
The configuration information of the performance library
"perf-MSSQLSERVER-sqlctr13.2.5026.0.dll" for the "MSSQLSERVER" service
does not match the trusted performance library information stored in
the registry. The functions in this library will not be treated as
trusted.
SQL Server Log
Big part of the sql server log when the failover happened here
The State is switching from sync to not sync a lot during the resolving period
DbMgrPartnerCommitPolicy::SetSyncAndRecoveryPoint:
1B55DB6F-066D-442B-B97E-E32D6990756A:1
DbMgrPartnerCommitPolicy::SetSyncAndRecoveryPoint:
1B55DB6F-066D-442B-B97E-E32D6990756A:4
DbMgrPartnerCommitPolicy::SetSyncAndRecoveryPoint:
1B55DB6F-066D-442B-B97E-E32D6990756A:4
DbMgrPartnerCommitPolicy::SetSyncAndRecoveryPoint:
1B55DB6F-066D-442B-B97E-E32D6990756A:1
DbMgrPartnerCommitPolicy::SetSyncAndRecoveryPoint:
1B55DB6F-066D-442B-B97E-E32D6990756A:1
DbMgrPartnerCommitPolicy::SetSyncAndRecoveryPoint:
1B55DB6F-066D-442B-B97E-E32D6990756A:4
DbMgrPartnerCommitPolicy::SetSyncAndRecoveryPoint:
1B55DB6F-066D-442B-B97E-E32D6990756A:1
...
sql-server sql-server-2016 availability-groups failover
My explanation will be opinion based as it is based on my findings. If more information is needed, I am always happy to add it.
The setup
We have 4 SQL Server 2016 Instances, and all 4 underlying servers are in a WSFC (In my eyes a strange setup since Always On AG's are 2 by 2. But that is a different question.)
4 nodes + Fileshare witness
Version: 13.0.5026.0 (all nodes)
The instances that are having the failover issues are instances 1 & 2.
They have 3 AG's between them. AG1 is primary on instance 1 and is having the issue. AG2 and AG3 are primary on instance 2.
Instances 3 and 4 have their own separate AG's (2).
It does not seem like any succesful failover has happened since recreating the AG a couple of months ago, the servers have Updates pending aswell. (We will be doing reboots in maintenance window)
AG1's settings
The week before
Last week, after checking for long running queries (none) and the synchronization state (green accross the board) we initiated a manual failover without data loss on AG1 from instance1 (primary) to instance2 (secondary).
The failover failed, leaving both the former primary (instance1) and the former secondary (instance2) in a temporary resolving state. Two databases on the former primary even showed Reverting / In Recovery for a few minutes. In the error log, a message was shown on many occasions:
The databases where not accessible during the outage, the outage took about 10 minutes total.
Before failing over I checked:
- Current running queries (no long running queries)
DBCC OPENTRAN
- Always On dashboard (Green accross the board)
We found a post by microsoft to enable a traceflag, 3923. We enabled this one on both instances after the outage.
Take #2, today
We tried testing the manual failover without data loss again to see if the issues where resolved.
Spoiler alert, they where not resolved.
Remote harden of transaction 'GhostCleanupTask' (ID 0x00000000141b9d97 0000:000cad62) started at Feb 19 2019 11:17AM in database 'DB1' at LSN (102:23518:2) failed.
Remote harden of transaction 'GhostCleanupTask' (ID 0x00000000141b9d97 0000:000cad62) started at Feb 19 2019 11:17AM in database 'DB1' at LSN (102:23518:2) failed.
Remote harden of transaction 'GhostCleanupTask' (ID 0x00000000141b9da2 0000:078d645c) started at Feb 19 2019 11:17AM in database 'DB2' at LSN (5909:54307:137) failed.
Remote harden of transaction 'GhostCleanupTask' (ID 0x00000000141b9dee 0000:0017e40a) started at Feb 19 2019 11:17AM in database 'DB1' at LSN (164:21769:2) failed.
Remote harden of transaction 'GhostCleanupTask' (ID 0x00000000141ba1a8 0000:0024e912) started at Feb 19 2019 11:17AM in database 'DB3' at LSN (1246:7285:2) failed.
Edit
There are different remote harden error's, on FTEPersistIndex & IFTSAutoNested & INSERT
Remote harden of transaction 'FTEPersistIndex' (ID 0x00000000141b98b5 0000:078d6459) started at Feb 19 2019 11:16AM in database 'DB2' at LSN (5909:54266:104) failed.
Remote harden of transaction 'IFTSAutoNested' (ID 0x00000000141b98bf 0000:1d20f419) started at Feb 19 2019 11:16AM in database 'DB4' at LSN (93341:183352:3) failed.
Remote harden of transaction 'INSERT' (ID 0x00000000141b989c 0000:1d20f418) started at Feb 19 2019 11:16AM in database 'DB4' at LSN (93341:183346:17) failed.
We canceled the failover and it failed back to the former primary.
The actual question(s)
- The beginning question is, am i right to assume that the failover
not happening is due to theGhostCleanupTask
and other tasks not being able to harden the transaction log, or is it a consequence of the failure
and not the cause. - If it is the cause, what could be done to resolve this?
Something I am looking into, since the problem does not seem to be due to any long running queries rolling back, that the secondary databases are in an inconsistent state? But they show up as Synchronized.
Extra info
The only interesting message in the clusterlog from last week appeared to be this one:
mscs::TopologyPersister::TryGetNetworkPrivateProperties: (2)' because of 'OpenSubKey failed.'
Failover cluster events
Cluster resource 'AG1' of type 'SQL Server Availability Group' in
clustered role 'AG1' failed.
Based on the failure policies for the resource and role, the cluster
service may try to bring the resource online on this node or move the
group to another node of the cluster and then restart it. Check the
resource and group state using Failover Cluster Manager or the
Get-ClusterResource Windows PowerShell cmdlet.
The Cluster service failed to bring clustered role 'AG1'
completely online or offline. One or more resources may be in a failed
state. This may impact the availability of the clustered role.
Event log secondary server
First message
Windows cannot load the extensible counter DLL
InterceptCountersManager. The first four bytes (DWORD) of the Data
section contains the Windows error code.
Second message
The configuration information of the performance library
"perf-MSSQLSERVER-sqlctr13.2.5026.0.dll" for the "MSSQLSERVER" service
does not match the trusted performance library information stored in
the registry. The functions in this library will not be treated as
trusted.
SQL Server Log
Big part of the sql server log when the failover happened here
The State is switching from sync to not sync a lot during the resolving period
DbMgrPartnerCommitPolicy::SetSyncAndRecoveryPoint:
1B55DB6F-066D-442B-B97E-E32D6990756A:1
DbMgrPartnerCommitPolicy::SetSyncAndRecoveryPoint:
1B55DB6F-066D-442B-B97E-E32D6990756A:4
DbMgrPartnerCommitPolicy::SetSyncAndRecoveryPoint:
1B55DB6F-066D-442B-B97E-E32D6990756A:4
DbMgrPartnerCommitPolicy::SetSyncAndRecoveryPoint:
1B55DB6F-066D-442B-B97E-E32D6990756A:1
DbMgrPartnerCommitPolicy::SetSyncAndRecoveryPoint:
1B55DB6F-066D-442B-B97E-E32D6990756A:1
DbMgrPartnerCommitPolicy::SetSyncAndRecoveryPoint:
1B55DB6F-066D-442B-B97E-E32D6990756A:4
DbMgrPartnerCommitPolicy::SetSyncAndRecoveryPoint:
1B55DB6F-066D-442B-B97E-E32D6990756A:1
...
sql-server sql-server-2016 availability-groups failover
sql-server sql-server-2016 availability-groups failover
edited 31 mins ago
Randi Vertongen
asked 5 hours ago
Randi VertongenRandi Vertongen
2,776721
2,776721
1
~Those two look unrelated to me - I get those regularly on our secondary that fails over fine. Is that the only log in the cluster?
– George.Palacios
4 hours ago
1
If you can repro this, I would recommend opening a support incident with Microsoft.
– Tony Hinkle
4 hours ago
1
Is this the Bug which resurfaced
– Shanky
4 hours ago
1
I'd tend to agree with @TonyHinkle in that case - If there are no error messages there's not a lot you can do short of raising a support case - I assume your cluster validation reports are all okay?
– George.Palacios
4 hours ago
1
I had different kind of a problem with AGs before we tried everything with it but it did not solved the issue. We rebooted our servers one by one (First secondary servers and after that primary node.) unexpectedly it resolved the issue on its own.
– Shashank Tiwari
3 hours ago
|
show 8 more comments
1
~Those two look unrelated to me - I get those regularly on our secondary that fails over fine. Is that the only log in the cluster?
– George.Palacios
4 hours ago
1
If you can repro this, I would recommend opening a support incident with Microsoft.
– Tony Hinkle
4 hours ago
1
Is this the Bug which resurfaced
– Shanky
4 hours ago
1
I'd tend to agree with @TonyHinkle in that case - If there are no error messages there's not a lot you can do short of raising a support case - I assume your cluster validation reports are all okay?
– George.Palacios
4 hours ago
1
I had different kind of a problem with AGs before we tried everything with it but it did not solved the issue. We rebooted our servers one by one (First secondary servers and after that primary node.) unexpectedly it resolved the issue on its own.
– Shashank Tiwari
3 hours ago
1
1
~Those two look unrelated to me - I get those regularly on our secondary that fails over fine. Is that the only log in the cluster?
– George.Palacios
4 hours ago
~Those two look unrelated to me - I get those regularly on our secondary that fails over fine. Is that the only log in the cluster?
– George.Palacios
4 hours ago
1
1
If you can repro this, I would recommend opening a support incident with Microsoft.
– Tony Hinkle
4 hours ago
If you can repro this, I would recommend opening a support incident with Microsoft.
– Tony Hinkle
4 hours ago
1
1
Is this the Bug which resurfaced
– Shanky
4 hours ago
Is this the Bug which resurfaced
– Shanky
4 hours ago
1
1
I'd tend to agree with @TonyHinkle in that case - If there are no error messages there's not a lot you can do short of raising a support case - I assume your cluster validation reports are all okay?
– George.Palacios
4 hours ago
I'd tend to agree with @TonyHinkle in that case - If there are no error messages there's not a lot you can do short of raising a support case - I assume your cluster validation reports are all okay?
– George.Palacios
4 hours ago
1
1
I had different kind of a problem with AGs before we tried everything with it but it did not solved the issue. We rebooted our servers one by one (First secondary servers and after that primary node.) unexpectedly it resolved the issue on its own.
– Shashank Tiwari
3 hours ago
I had different kind of a problem with AGs before we tried everything with it but it did not solved the issue. We rebooted our servers one by one (First secondary servers and after that primary node.) unexpectedly it resolved the issue on its own.
– Shashank Tiwari
3 hours ago
|
show 8 more comments
0
active
oldest
votes
Your Answer
StackExchange.ready(function() {
var channelOptions = {
tags: "".split(" "),
id: "182"
};
initTagRenderer("".split(" "), "".split(" "), channelOptions);
StackExchange.using("externalEditor", function() {
// Have to fire editor after snippets, if snippets enabled
if (StackExchange.settings.snippets.snippetsEnabled) {
StackExchange.using("snippets", function() {
createEditor();
});
}
else {
createEditor();
}
});
function createEditor() {
StackExchange.prepareEditor({
heartbeatType: 'answer',
autoActivateHeartbeat: false,
convertImagesToLinks: false,
noModals: true,
showLowRepImageUploadWarning: true,
reputationToPostImages: null,
bindNavPrevention: true,
postfix: "",
imageUploader: {
brandingHtml: "Powered by u003ca class="icon-imgur-white" href="https://imgur.com/"u003eu003c/au003e",
contentPolicyHtml: "User contributions licensed under u003ca href="https://creativecommons.org/licenses/by-sa/3.0/"u003ecc by-sa 3.0 with attribution requiredu003c/au003e u003ca href="https://stackoverflow.com/legal/content-policy"u003e(content policy)u003c/au003e",
allowUrls: true
},
onDemand: true,
discardSelector: ".discard-answer"
,immediatelyShowMarkdownHelp:true
});
}
});
Sign up or log in
StackExchange.ready(function () {
StackExchange.helpers.onClickDraftSave('#login-link');
});
Sign up using Google
Sign up using Facebook
Sign up using Email and Password
Post as a guest
Required, but never shown
StackExchange.ready(
function () {
StackExchange.openid.initPostLogin('.new-post-login', 'https%3a%2f%2fdba.stackexchange.com%2fquestions%2f230128%2falways-on-availability-groups-resolving-state-after-failover-remote-harden-of%23new-answer', 'question_page');
}
);
Post as a guest
Required, but never shown
0
active
oldest
votes
0
active
oldest
votes
active
oldest
votes
active
oldest
votes
Thanks for contributing an answer to Database Administrators Stack Exchange!
- Please be sure to answer the question. Provide details and share your research!
But avoid …
- Asking for help, clarification, or responding to other answers.
- Making statements based on opinion; back them up with references or personal experience.
To learn more, see our tips on writing great answers.
Sign up or log in
StackExchange.ready(function () {
StackExchange.helpers.onClickDraftSave('#login-link');
});
Sign up using Google
Sign up using Facebook
Sign up using Email and Password
Post as a guest
Required, but never shown
StackExchange.ready(
function () {
StackExchange.openid.initPostLogin('.new-post-login', 'https%3a%2f%2fdba.stackexchange.com%2fquestions%2f230128%2falways-on-availability-groups-resolving-state-after-failover-remote-harden-of%23new-answer', 'question_page');
}
);
Post as a guest
Required, but never shown
Sign up or log in
StackExchange.ready(function () {
StackExchange.helpers.onClickDraftSave('#login-link');
});
Sign up using Google
Sign up using Facebook
Sign up using Email and Password
Post as a guest
Required, but never shown
Sign up or log in
StackExchange.ready(function () {
StackExchange.helpers.onClickDraftSave('#login-link');
});
Sign up using Google
Sign up using Facebook
Sign up using Email and Password
Post as a guest
Required, but never shown
Sign up or log in
StackExchange.ready(function () {
StackExchange.helpers.onClickDraftSave('#login-link');
});
Sign up using Google
Sign up using Facebook
Sign up using Email and Password
Sign up using Google
Sign up using Facebook
Sign up using Email and Password
Post as a guest
Required, but never shown
Required, but never shown
Required, but never shown
Required, but never shown
Required, but never shown
Required, but never shown
Required, but never shown
Required, but never shown
Required, but never shown
1
~Those two look unrelated to me - I get those regularly on our secondary that fails over fine. Is that the only log in the cluster?
– George.Palacios
4 hours ago
1
If you can repro this, I would recommend opening a support incident with Microsoft.
– Tony Hinkle
4 hours ago
1
Is this the Bug which resurfaced
– Shanky
4 hours ago
1
I'd tend to agree with @TonyHinkle in that case - If there are no error messages there's not a lot you can do short of raising a support case - I assume your cluster validation reports are all okay?
– George.Palacios
4 hours ago
1
I had different kind of a problem with AGs before we tried everything with it but it did not solved the issue. We rebooted our servers one by one (First secondary servers and after that primary node.) unexpectedly it resolved the issue on its own.
– Shashank Tiwari
3 hours ago