Always On Availability groups resolving state after failover - Remote harden of transaction...

Should corporate security training be tailored based on a users' job role?

Does human life have innate value over that of other animals?

Limit involving inverse functions

Father gets chickenpox, but doesn't infect his two children. How is this possible?

Is it possible to re-order a List of continous entries?

Who, if anyone, was the first astronaut to return to earth in a different vessel?

The totem pole can be grouped into

Identical projects by students at two different colleges: still plagiarism?

Sci fi book, man buys a beat up spaceship and intervenes in a civil war on a planet and eventually becomes a space cop

How to not forget my phone in the bathroom?

Animate an airplane in Beamer

Is the "hot network questions" element on Stack Overflow a dark pattern?

Headless horseman claims new head

Coworker is trying to get me to sign his petition to run for office. How to decline politely?

Why did Tywin never remarry?

What happens to someone who dies before their clone has matured?

How to write a character over another character

How can a kingdom keep the secret of a missing monarchy from the public?

Taking an academic pseudonym?

Is 'bad luck' with former employees a red flag?

Is candidate anonymity at all practical?

Does a star need to be inside a galaxy?

Negotiating 1-year delay to my Assistant Professor Offer

Why do we charge our flashes?



Always On Availability groups resolving state after failover - Remote harden of transaction 'GhostCleanupTask' failed


Specifying the failover partner when using Availability GroupsAvailability mode - manual failover mode, best practice for availability mode?Always On AG is Down when service is stoppedPrimary Role DB Not AccessibleSQL Server Cluster Service Online Pending StatusPrimary Server is “Resolving state” & Quorum node is in “failed” stateAlways on Failover ClusterSQL Cluster failover doesn't work after applying SQL Server 2016 Service Pack 2AlwaysOn Availability Groups: connecting correctly?Remote harden of transaction 'CREATE INDEX' - when moving an index to a different filegroup













8















My explanation will be opinion based as it is based on my findings. If more information is needed, I am always happy to add it.



The setup



We have 4 SQL Server 2016 Instances, and all 4 underlying servers are in a WSFC (In my eyes a strange setup since Always On AG's are 2 by 2. But that is a different question.)



enter image description here



4 nodes + Fileshare witness



Version: 13.0.5026.0 (all nodes)



The instances that are having the failover issues are instances 1 & 2.
They have 3 AG's between them. AG1 is primary on instance 1 and is having the issue. AG2 and AG3 are primary on instance 2.



Instances 3 and 4 have their own separate AG's (2).



It does not seem like any succesful failover has happened since recreating the AG a couple of months ago, the servers have Updates pending aswell. (We will be doing reboots in maintenance window)



AG1's settings



enter image description here



The week before



Last week, after checking for long running queries (none) and the synchronization state (green accross the board) we initiated a manual failover without data loss on AG1 from instance1 (primary) to instance2 (secondary).



The failover failed, leaving both the former primary (instance1) and the former secondary (instance2) in a temporary resolving state. Two databases on the former primary even showed Reverting / In Recovery for a few minutes. In the error log, a message was shown on many occasions:



enter image description here



The databases where not accessible during the outage, the outage took about 10 minutes total.



Before failing over I checked:




  • Current running queries (no long running queries)

  • DBCC OPENTRAN

  • Always On dashboard (Green accross the board)


We found a post by microsoft to enable a traceflag, 3923. We enabled this one on both instances after the outage.



Take #2, today



We tried testing the manual failover without data loss again to see if the issues where resolved.



Spoiler alert, they where not resolved.



enter image description here



Remote harden of transaction 'GhostCleanupTask' (ID 0x00000000141b9d97 0000:000cad62) started at Feb 19 2019 11:17AM in database 'DB1' at LSN (102:23518:2) failed.

Remote harden of transaction 'GhostCleanupTask' (ID 0x00000000141b9d97 0000:000cad62) started at Feb 19 2019 11:17AM in database 'DB1' at LSN (102:23518:2) failed.

Remote harden of transaction 'GhostCleanupTask' (ID 0x00000000141b9da2 0000:078d645c) started at Feb 19 2019 11:17AM in database 'DB2' at LSN (5909:54307:137) failed.

Remote harden of transaction 'GhostCleanupTask' (ID 0x00000000141b9dee 0000:0017e40a) started at Feb 19 2019 11:17AM in database 'DB1' at LSN (164:21769:2) failed.

Remote harden of transaction 'GhostCleanupTask' (ID 0x00000000141ba1a8 0000:0024e912) started at Feb 19 2019 11:17AM in database 'DB3' at LSN (1246:7285:2) failed.


Edit



There are different remote harden error's, on FTEPersistIndex & IFTSAutoNested & INSERT



Remote harden of transaction 'FTEPersistIndex' (ID 0x00000000141b98b5 0000:078d6459) started at Feb 19 2019 11:16AM in database 'DB2' at LSN (5909:54266:104) failed.

Remote harden of transaction 'IFTSAutoNested' (ID 0x00000000141b98bf 0000:1d20f419) started at Feb 19 2019 11:16AM in database 'DB4' at LSN (93341:183352:3) failed.

Remote harden of transaction 'INSERT' (ID 0x00000000141b989c 0000:1d20f418) started at Feb 19 2019 11:16AM in database 'DB4' at LSN (93341:183346:17) failed.


We canceled the failover and it failed back to the former primary.



The actual question(s)




  1. The beginning question is, am i right to assume that the failover
    not happening is due to the GhostCleanupTask and other tasks not being able to harden the transaction log, or is it a consequence of the failure
    and not the cause.

  2. If it is the cause, what could be done to resolve this?




Something I am looking into, since the problem does not seem to be due to any long running queries rolling back, that the secondary databases are in an inconsistent state? But they show up as Synchronized.



Extra info



The only interesting message in the clusterlog from last week appeared to be this one:



 mscs::TopologyPersister::TryGetNetworkPrivateProperties: (2)' because of 'OpenSubKey failed.'


Failover cluster events




Cluster resource 'AG1' of type 'SQL Server Availability Group' in
clustered role 'AG1' failed.



Based on the failure policies for the resource and role, the cluster
service may try to bring the resource online on this node or move the
group to another node of the cluster and then restart it. Check the
resource and group state using Failover Cluster Manager or the
Get-ClusterResource Windows PowerShell cmdlet.



The Cluster service failed to bring clustered role 'AG1'
completely online or offline. One or more resources may be in a failed
state. This may impact the availability of the clustered role.




Event log secondary server



First message




Windows cannot load the extensible counter DLL
InterceptCountersManager. The first four bytes (DWORD) of the Data
section contains the Windows error code.




Second message




The configuration information of the performance library
"perf-MSSQLSERVER-sqlctr13.2.5026.0.dll" for the "MSSQLSERVER" service
does not match the trusted performance library information stored in
the registry. The functions in this library will not be treated as
trusted.




SQL Server Log



Big part of the sql server log when the failover happened here



The State is switching from sync to not sync a lot during the resolving period




DbMgrPartnerCommitPolicy::SetSyncAndRecoveryPoint:
1B55DB6F-066D-442B-B97E-E32D6990756A:1



DbMgrPartnerCommitPolicy::SetSyncAndRecoveryPoint:
1B55DB6F-066D-442B-B97E-E32D6990756A:4



DbMgrPartnerCommitPolicy::SetSyncAndRecoveryPoint:
1B55DB6F-066D-442B-B97E-E32D6990756A:4



DbMgrPartnerCommitPolicy::SetSyncAndRecoveryPoint:
1B55DB6F-066D-442B-B97E-E32D6990756A:1



DbMgrPartnerCommitPolicy::SetSyncAndRecoveryPoint:
1B55DB6F-066D-442B-B97E-E32D6990756A:1



DbMgrPartnerCommitPolicy::SetSyncAndRecoveryPoint:
1B55DB6F-066D-442B-B97E-E32D6990756A:4



DbMgrPartnerCommitPolicy::SetSyncAndRecoveryPoint:
1B55DB6F-066D-442B-B97E-E32D6990756A:1
...











share|improve this question




















  • 1





    ~Those two look unrelated to me - I get those regularly on our secondary that fails over fine. Is that the only log in the cluster?

    – George.Palacios
    4 hours ago






  • 1





    If you can repro this, I would recommend opening a support incident with Microsoft.

    – Tony Hinkle
    4 hours ago






  • 1





    Is this the Bug which resurfaced

    – Shanky
    4 hours ago






  • 1





    I'd tend to agree with @TonyHinkle in that case - If there are no error messages there's not a lot you can do short of raising a support case - I assume your cluster validation reports are all okay?

    – George.Palacios
    4 hours ago






  • 1





    I had different kind of a problem with AGs before we tried everything with it but it did not solved the issue. We rebooted our servers one by one (First secondary servers and after that primary node.) unexpectedly it resolved the issue on its own.

    – Shashank Tiwari
    3 hours ago
















8















My explanation will be opinion based as it is based on my findings. If more information is needed, I am always happy to add it.



The setup



We have 4 SQL Server 2016 Instances, and all 4 underlying servers are in a WSFC (In my eyes a strange setup since Always On AG's are 2 by 2. But that is a different question.)



enter image description here



4 nodes + Fileshare witness



Version: 13.0.5026.0 (all nodes)



The instances that are having the failover issues are instances 1 & 2.
They have 3 AG's between them. AG1 is primary on instance 1 and is having the issue. AG2 and AG3 are primary on instance 2.



Instances 3 and 4 have their own separate AG's (2).



It does not seem like any succesful failover has happened since recreating the AG a couple of months ago, the servers have Updates pending aswell. (We will be doing reboots in maintenance window)



AG1's settings



enter image description here



The week before



Last week, after checking for long running queries (none) and the synchronization state (green accross the board) we initiated a manual failover without data loss on AG1 from instance1 (primary) to instance2 (secondary).



The failover failed, leaving both the former primary (instance1) and the former secondary (instance2) in a temporary resolving state. Two databases on the former primary even showed Reverting / In Recovery for a few minutes. In the error log, a message was shown on many occasions:



enter image description here



The databases where not accessible during the outage, the outage took about 10 minutes total.



Before failing over I checked:




  • Current running queries (no long running queries)

  • DBCC OPENTRAN

  • Always On dashboard (Green accross the board)


We found a post by microsoft to enable a traceflag, 3923. We enabled this one on both instances after the outage.



Take #2, today



We tried testing the manual failover without data loss again to see if the issues where resolved.



Spoiler alert, they where not resolved.



enter image description here



Remote harden of transaction 'GhostCleanupTask' (ID 0x00000000141b9d97 0000:000cad62) started at Feb 19 2019 11:17AM in database 'DB1' at LSN (102:23518:2) failed.

Remote harden of transaction 'GhostCleanupTask' (ID 0x00000000141b9d97 0000:000cad62) started at Feb 19 2019 11:17AM in database 'DB1' at LSN (102:23518:2) failed.

Remote harden of transaction 'GhostCleanupTask' (ID 0x00000000141b9da2 0000:078d645c) started at Feb 19 2019 11:17AM in database 'DB2' at LSN (5909:54307:137) failed.

Remote harden of transaction 'GhostCleanupTask' (ID 0x00000000141b9dee 0000:0017e40a) started at Feb 19 2019 11:17AM in database 'DB1' at LSN (164:21769:2) failed.

Remote harden of transaction 'GhostCleanupTask' (ID 0x00000000141ba1a8 0000:0024e912) started at Feb 19 2019 11:17AM in database 'DB3' at LSN (1246:7285:2) failed.


Edit



There are different remote harden error's, on FTEPersistIndex & IFTSAutoNested & INSERT



Remote harden of transaction 'FTEPersistIndex' (ID 0x00000000141b98b5 0000:078d6459) started at Feb 19 2019 11:16AM in database 'DB2' at LSN (5909:54266:104) failed.

Remote harden of transaction 'IFTSAutoNested' (ID 0x00000000141b98bf 0000:1d20f419) started at Feb 19 2019 11:16AM in database 'DB4' at LSN (93341:183352:3) failed.

Remote harden of transaction 'INSERT' (ID 0x00000000141b989c 0000:1d20f418) started at Feb 19 2019 11:16AM in database 'DB4' at LSN (93341:183346:17) failed.


We canceled the failover and it failed back to the former primary.



The actual question(s)




  1. The beginning question is, am i right to assume that the failover
    not happening is due to the GhostCleanupTask and other tasks not being able to harden the transaction log, or is it a consequence of the failure
    and not the cause.

  2. If it is the cause, what could be done to resolve this?




Something I am looking into, since the problem does not seem to be due to any long running queries rolling back, that the secondary databases are in an inconsistent state? But they show up as Synchronized.



Extra info



The only interesting message in the clusterlog from last week appeared to be this one:



 mscs::TopologyPersister::TryGetNetworkPrivateProperties: (2)' because of 'OpenSubKey failed.'


Failover cluster events




Cluster resource 'AG1' of type 'SQL Server Availability Group' in
clustered role 'AG1' failed.



Based on the failure policies for the resource and role, the cluster
service may try to bring the resource online on this node or move the
group to another node of the cluster and then restart it. Check the
resource and group state using Failover Cluster Manager or the
Get-ClusterResource Windows PowerShell cmdlet.



The Cluster service failed to bring clustered role 'AG1'
completely online or offline. One or more resources may be in a failed
state. This may impact the availability of the clustered role.




Event log secondary server



First message




Windows cannot load the extensible counter DLL
InterceptCountersManager. The first four bytes (DWORD) of the Data
section contains the Windows error code.




Second message




The configuration information of the performance library
"perf-MSSQLSERVER-sqlctr13.2.5026.0.dll" for the "MSSQLSERVER" service
does not match the trusted performance library information stored in
the registry. The functions in this library will not be treated as
trusted.




SQL Server Log



Big part of the sql server log when the failover happened here



The State is switching from sync to not sync a lot during the resolving period




DbMgrPartnerCommitPolicy::SetSyncAndRecoveryPoint:
1B55DB6F-066D-442B-B97E-E32D6990756A:1



DbMgrPartnerCommitPolicy::SetSyncAndRecoveryPoint:
1B55DB6F-066D-442B-B97E-E32D6990756A:4



DbMgrPartnerCommitPolicy::SetSyncAndRecoveryPoint:
1B55DB6F-066D-442B-B97E-E32D6990756A:4



DbMgrPartnerCommitPolicy::SetSyncAndRecoveryPoint:
1B55DB6F-066D-442B-B97E-E32D6990756A:1



DbMgrPartnerCommitPolicy::SetSyncAndRecoveryPoint:
1B55DB6F-066D-442B-B97E-E32D6990756A:1



DbMgrPartnerCommitPolicy::SetSyncAndRecoveryPoint:
1B55DB6F-066D-442B-B97E-E32D6990756A:4



DbMgrPartnerCommitPolicy::SetSyncAndRecoveryPoint:
1B55DB6F-066D-442B-B97E-E32D6990756A:1
...











share|improve this question




















  • 1





    ~Those two look unrelated to me - I get those regularly on our secondary that fails over fine. Is that the only log in the cluster?

    – George.Palacios
    4 hours ago






  • 1





    If you can repro this, I would recommend opening a support incident with Microsoft.

    – Tony Hinkle
    4 hours ago






  • 1





    Is this the Bug which resurfaced

    – Shanky
    4 hours ago






  • 1





    I'd tend to agree with @TonyHinkle in that case - If there are no error messages there's not a lot you can do short of raising a support case - I assume your cluster validation reports are all okay?

    – George.Palacios
    4 hours ago






  • 1





    I had different kind of a problem with AGs before we tried everything with it but it did not solved the issue. We rebooted our servers one by one (First secondary servers and after that primary node.) unexpectedly it resolved the issue on its own.

    – Shashank Tiwari
    3 hours ago














8












8








8


2






My explanation will be opinion based as it is based on my findings. If more information is needed, I am always happy to add it.



The setup



We have 4 SQL Server 2016 Instances, and all 4 underlying servers are in a WSFC (In my eyes a strange setup since Always On AG's are 2 by 2. But that is a different question.)



enter image description here



4 nodes + Fileshare witness



Version: 13.0.5026.0 (all nodes)



The instances that are having the failover issues are instances 1 & 2.
They have 3 AG's between them. AG1 is primary on instance 1 and is having the issue. AG2 and AG3 are primary on instance 2.



Instances 3 and 4 have their own separate AG's (2).



It does not seem like any succesful failover has happened since recreating the AG a couple of months ago, the servers have Updates pending aswell. (We will be doing reboots in maintenance window)



AG1's settings



enter image description here



The week before



Last week, after checking for long running queries (none) and the synchronization state (green accross the board) we initiated a manual failover without data loss on AG1 from instance1 (primary) to instance2 (secondary).



The failover failed, leaving both the former primary (instance1) and the former secondary (instance2) in a temporary resolving state. Two databases on the former primary even showed Reverting / In Recovery for a few minutes. In the error log, a message was shown on many occasions:



enter image description here



The databases where not accessible during the outage, the outage took about 10 minutes total.



Before failing over I checked:




  • Current running queries (no long running queries)

  • DBCC OPENTRAN

  • Always On dashboard (Green accross the board)


We found a post by microsoft to enable a traceflag, 3923. We enabled this one on both instances after the outage.



Take #2, today



We tried testing the manual failover without data loss again to see if the issues where resolved.



Spoiler alert, they where not resolved.



enter image description here



Remote harden of transaction 'GhostCleanupTask' (ID 0x00000000141b9d97 0000:000cad62) started at Feb 19 2019 11:17AM in database 'DB1' at LSN (102:23518:2) failed.

Remote harden of transaction 'GhostCleanupTask' (ID 0x00000000141b9d97 0000:000cad62) started at Feb 19 2019 11:17AM in database 'DB1' at LSN (102:23518:2) failed.

Remote harden of transaction 'GhostCleanupTask' (ID 0x00000000141b9da2 0000:078d645c) started at Feb 19 2019 11:17AM in database 'DB2' at LSN (5909:54307:137) failed.

Remote harden of transaction 'GhostCleanupTask' (ID 0x00000000141b9dee 0000:0017e40a) started at Feb 19 2019 11:17AM in database 'DB1' at LSN (164:21769:2) failed.

Remote harden of transaction 'GhostCleanupTask' (ID 0x00000000141ba1a8 0000:0024e912) started at Feb 19 2019 11:17AM in database 'DB3' at LSN (1246:7285:2) failed.


Edit



There are different remote harden error's, on FTEPersistIndex & IFTSAutoNested & INSERT



Remote harden of transaction 'FTEPersistIndex' (ID 0x00000000141b98b5 0000:078d6459) started at Feb 19 2019 11:16AM in database 'DB2' at LSN (5909:54266:104) failed.

Remote harden of transaction 'IFTSAutoNested' (ID 0x00000000141b98bf 0000:1d20f419) started at Feb 19 2019 11:16AM in database 'DB4' at LSN (93341:183352:3) failed.

Remote harden of transaction 'INSERT' (ID 0x00000000141b989c 0000:1d20f418) started at Feb 19 2019 11:16AM in database 'DB4' at LSN (93341:183346:17) failed.


We canceled the failover and it failed back to the former primary.



The actual question(s)




  1. The beginning question is, am i right to assume that the failover
    not happening is due to the GhostCleanupTask and other tasks not being able to harden the transaction log, or is it a consequence of the failure
    and not the cause.

  2. If it is the cause, what could be done to resolve this?




Something I am looking into, since the problem does not seem to be due to any long running queries rolling back, that the secondary databases are in an inconsistent state? But they show up as Synchronized.



Extra info



The only interesting message in the clusterlog from last week appeared to be this one:



 mscs::TopologyPersister::TryGetNetworkPrivateProperties: (2)' because of 'OpenSubKey failed.'


Failover cluster events




Cluster resource 'AG1' of type 'SQL Server Availability Group' in
clustered role 'AG1' failed.



Based on the failure policies for the resource and role, the cluster
service may try to bring the resource online on this node or move the
group to another node of the cluster and then restart it. Check the
resource and group state using Failover Cluster Manager or the
Get-ClusterResource Windows PowerShell cmdlet.



The Cluster service failed to bring clustered role 'AG1'
completely online or offline. One or more resources may be in a failed
state. This may impact the availability of the clustered role.




Event log secondary server



First message




Windows cannot load the extensible counter DLL
InterceptCountersManager. The first four bytes (DWORD) of the Data
section contains the Windows error code.




Second message




The configuration information of the performance library
"perf-MSSQLSERVER-sqlctr13.2.5026.0.dll" for the "MSSQLSERVER" service
does not match the trusted performance library information stored in
the registry. The functions in this library will not be treated as
trusted.




SQL Server Log



Big part of the sql server log when the failover happened here



The State is switching from sync to not sync a lot during the resolving period




DbMgrPartnerCommitPolicy::SetSyncAndRecoveryPoint:
1B55DB6F-066D-442B-B97E-E32D6990756A:1



DbMgrPartnerCommitPolicy::SetSyncAndRecoveryPoint:
1B55DB6F-066D-442B-B97E-E32D6990756A:4



DbMgrPartnerCommitPolicy::SetSyncAndRecoveryPoint:
1B55DB6F-066D-442B-B97E-E32D6990756A:4



DbMgrPartnerCommitPolicy::SetSyncAndRecoveryPoint:
1B55DB6F-066D-442B-B97E-E32D6990756A:1



DbMgrPartnerCommitPolicy::SetSyncAndRecoveryPoint:
1B55DB6F-066D-442B-B97E-E32D6990756A:1



DbMgrPartnerCommitPolicy::SetSyncAndRecoveryPoint:
1B55DB6F-066D-442B-B97E-E32D6990756A:4



DbMgrPartnerCommitPolicy::SetSyncAndRecoveryPoint:
1B55DB6F-066D-442B-B97E-E32D6990756A:1
...











share|improve this question
















My explanation will be opinion based as it is based on my findings. If more information is needed, I am always happy to add it.



The setup



We have 4 SQL Server 2016 Instances, and all 4 underlying servers are in a WSFC (In my eyes a strange setup since Always On AG's are 2 by 2. But that is a different question.)



enter image description here



4 nodes + Fileshare witness



Version: 13.0.5026.0 (all nodes)



The instances that are having the failover issues are instances 1 & 2.
They have 3 AG's between them. AG1 is primary on instance 1 and is having the issue. AG2 and AG3 are primary on instance 2.



Instances 3 and 4 have their own separate AG's (2).



It does not seem like any succesful failover has happened since recreating the AG a couple of months ago, the servers have Updates pending aswell. (We will be doing reboots in maintenance window)



AG1's settings



enter image description here



The week before



Last week, after checking for long running queries (none) and the synchronization state (green accross the board) we initiated a manual failover without data loss on AG1 from instance1 (primary) to instance2 (secondary).



The failover failed, leaving both the former primary (instance1) and the former secondary (instance2) in a temporary resolving state. Two databases on the former primary even showed Reverting / In Recovery for a few minutes. In the error log, a message was shown on many occasions:



enter image description here



The databases where not accessible during the outage, the outage took about 10 minutes total.



Before failing over I checked:




  • Current running queries (no long running queries)

  • DBCC OPENTRAN

  • Always On dashboard (Green accross the board)


We found a post by microsoft to enable a traceflag, 3923. We enabled this one on both instances after the outage.



Take #2, today



We tried testing the manual failover without data loss again to see if the issues where resolved.



Spoiler alert, they where not resolved.



enter image description here



Remote harden of transaction 'GhostCleanupTask' (ID 0x00000000141b9d97 0000:000cad62) started at Feb 19 2019 11:17AM in database 'DB1' at LSN (102:23518:2) failed.

Remote harden of transaction 'GhostCleanupTask' (ID 0x00000000141b9d97 0000:000cad62) started at Feb 19 2019 11:17AM in database 'DB1' at LSN (102:23518:2) failed.

Remote harden of transaction 'GhostCleanupTask' (ID 0x00000000141b9da2 0000:078d645c) started at Feb 19 2019 11:17AM in database 'DB2' at LSN (5909:54307:137) failed.

Remote harden of transaction 'GhostCleanupTask' (ID 0x00000000141b9dee 0000:0017e40a) started at Feb 19 2019 11:17AM in database 'DB1' at LSN (164:21769:2) failed.

Remote harden of transaction 'GhostCleanupTask' (ID 0x00000000141ba1a8 0000:0024e912) started at Feb 19 2019 11:17AM in database 'DB3' at LSN (1246:7285:2) failed.


Edit



There are different remote harden error's, on FTEPersistIndex & IFTSAutoNested & INSERT



Remote harden of transaction 'FTEPersistIndex' (ID 0x00000000141b98b5 0000:078d6459) started at Feb 19 2019 11:16AM in database 'DB2' at LSN (5909:54266:104) failed.

Remote harden of transaction 'IFTSAutoNested' (ID 0x00000000141b98bf 0000:1d20f419) started at Feb 19 2019 11:16AM in database 'DB4' at LSN (93341:183352:3) failed.

Remote harden of transaction 'INSERT' (ID 0x00000000141b989c 0000:1d20f418) started at Feb 19 2019 11:16AM in database 'DB4' at LSN (93341:183346:17) failed.


We canceled the failover and it failed back to the former primary.



The actual question(s)




  1. The beginning question is, am i right to assume that the failover
    not happening is due to the GhostCleanupTask and other tasks not being able to harden the transaction log, or is it a consequence of the failure
    and not the cause.

  2. If it is the cause, what could be done to resolve this?




Something I am looking into, since the problem does not seem to be due to any long running queries rolling back, that the secondary databases are in an inconsistent state? But they show up as Synchronized.



Extra info



The only interesting message in the clusterlog from last week appeared to be this one:



 mscs::TopologyPersister::TryGetNetworkPrivateProperties: (2)' because of 'OpenSubKey failed.'


Failover cluster events




Cluster resource 'AG1' of type 'SQL Server Availability Group' in
clustered role 'AG1' failed.



Based on the failure policies for the resource and role, the cluster
service may try to bring the resource online on this node or move the
group to another node of the cluster and then restart it. Check the
resource and group state using Failover Cluster Manager or the
Get-ClusterResource Windows PowerShell cmdlet.



The Cluster service failed to bring clustered role 'AG1'
completely online or offline. One or more resources may be in a failed
state. This may impact the availability of the clustered role.




Event log secondary server



First message




Windows cannot load the extensible counter DLL
InterceptCountersManager. The first four bytes (DWORD) of the Data
section contains the Windows error code.




Second message




The configuration information of the performance library
"perf-MSSQLSERVER-sqlctr13.2.5026.0.dll" for the "MSSQLSERVER" service
does not match the trusted performance library information stored in
the registry. The functions in this library will not be treated as
trusted.




SQL Server Log



Big part of the sql server log when the failover happened here



The State is switching from sync to not sync a lot during the resolving period




DbMgrPartnerCommitPolicy::SetSyncAndRecoveryPoint:
1B55DB6F-066D-442B-B97E-E32D6990756A:1



DbMgrPartnerCommitPolicy::SetSyncAndRecoveryPoint:
1B55DB6F-066D-442B-B97E-E32D6990756A:4



DbMgrPartnerCommitPolicy::SetSyncAndRecoveryPoint:
1B55DB6F-066D-442B-B97E-E32D6990756A:4



DbMgrPartnerCommitPolicy::SetSyncAndRecoveryPoint:
1B55DB6F-066D-442B-B97E-E32D6990756A:1



DbMgrPartnerCommitPolicy::SetSyncAndRecoveryPoint:
1B55DB6F-066D-442B-B97E-E32D6990756A:1



DbMgrPartnerCommitPolicy::SetSyncAndRecoveryPoint:
1B55DB6F-066D-442B-B97E-E32D6990756A:4



DbMgrPartnerCommitPolicy::SetSyncAndRecoveryPoint:
1B55DB6F-066D-442B-B97E-E32D6990756A:1
...








sql-server sql-server-2016 availability-groups failover






share|improve this question















share|improve this question













share|improve this question




share|improve this question








edited 31 mins ago







Randi Vertongen

















asked 5 hours ago









Randi VertongenRandi Vertongen

2,776721




2,776721








  • 1





    ~Those two look unrelated to me - I get those regularly on our secondary that fails over fine. Is that the only log in the cluster?

    – George.Palacios
    4 hours ago






  • 1





    If you can repro this, I would recommend opening a support incident with Microsoft.

    – Tony Hinkle
    4 hours ago






  • 1





    Is this the Bug which resurfaced

    – Shanky
    4 hours ago






  • 1





    I'd tend to agree with @TonyHinkle in that case - If there are no error messages there's not a lot you can do short of raising a support case - I assume your cluster validation reports are all okay?

    – George.Palacios
    4 hours ago






  • 1





    I had different kind of a problem with AGs before we tried everything with it but it did not solved the issue. We rebooted our servers one by one (First secondary servers and after that primary node.) unexpectedly it resolved the issue on its own.

    – Shashank Tiwari
    3 hours ago














  • 1





    ~Those two look unrelated to me - I get those regularly on our secondary that fails over fine. Is that the only log in the cluster?

    – George.Palacios
    4 hours ago






  • 1





    If you can repro this, I would recommend opening a support incident with Microsoft.

    – Tony Hinkle
    4 hours ago






  • 1





    Is this the Bug which resurfaced

    – Shanky
    4 hours ago






  • 1





    I'd tend to agree with @TonyHinkle in that case - If there are no error messages there's not a lot you can do short of raising a support case - I assume your cluster validation reports are all okay?

    – George.Palacios
    4 hours ago






  • 1





    I had different kind of a problem with AGs before we tried everything with it but it did not solved the issue. We rebooted our servers one by one (First secondary servers and after that primary node.) unexpectedly it resolved the issue on its own.

    – Shashank Tiwari
    3 hours ago








1




1





~Those two look unrelated to me - I get those regularly on our secondary that fails over fine. Is that the only log in the cluster?

– George.Palacios
4 hours ago





~Those two look unrelated to me - I get those regularly on our secondary that fails over fine. Is that the only log in the cluster?

– George.Palacios
4 hours ago




1




1





If you can repro this, I would recommend opening a support incident with Microsoft.

– Tony Hinkle
4 hours ago





If you can repro this, I would recommend opening a support incident with Microsoft.

– Tony Hinkle
4 hours ago




1




1





Is this the Bug which resurfaced

– Shanky
4 hours ago





Is this the Bug which resurfaced

– Shanky
4 hours ago




1




1





I'd tend to agree with @TonyHinkle in that case - If there are no error messages there's not a lot you can do short of raising a support case - I assume your cluster validation reports are all okay?

– George.Palacios
4 hours ago





I'd tend to agree with @TonyHinkle in that case - If there are no error messages there's not a lot you can do short of raising a support case - I assume your cluster validation reports are all okay?

– George.Palacios
4 hours ago




1




1





I had different kind of a problem with AGs before we tried everything with it but it did not solved the issue. We rebooted our servers one by one (First secondary servers and after that primary node.) unexpectedly it resolved the issue on its own.

– Shashank Tiwari
3 hours ago





I had different kind of a problem with AGs before we tried everything with it but it did not solved the issue. We rebooted our servers one by one (First secondary servers and after that primary node.) unexpectedly it resolved the issue on its own.

– Shashank Tiwari
3 hours ago










0






active

oldest

votes











Your Answer








StackExchange.ready(function() {
var channelOptions = {
tags: "".split(" "),
id: "182"
};
initTagRenderer("".split(" "), "".split(" "), channelOptions);

StackExchange.using("externalEditor", function() {
// Have to fire editor after snippets, if snippets enabled
if (StackExchange.settings.snippets.snippetsEnabled) {
StackExchange.using("snippets", function() {
createEditor();
});
}
else {
createEditor();
}
});

function createEditor() {
StackExchange.prepareEditor({
heartbeatType: 'answer',
autoActivateHeartbeat: false,
convertImagesToLinks: false,
noModals: true,
showLowRepImageUploadWarning: true,
reputationToPostImages: null,
bindNavPrevention: true,
postfix: "",
imageUploader: {
brandingHtml: "Powered by u003ca class="icon-imgur-white" href="https://imgur.com/"u003eu003c/au003e",
contentPolicyHtml: "User contributions licensed under u003ca href="https://creativecommons.org/licenses/by-sa/3.0/"u003ecc by-sa 3.0 with attribution requiredu003c/au003e u003ca href="https://stackoverflow.com/legal/content-policy"u003e(content policy)u003c/au003e",
allowUrls: true
},
onDemand: true,
discardSelector: ".discard-answer"
,immediatelyShowMarkdownHelp:true
});


}
});














draft saved

draft discarded


















StackExchange.ready(
function () {
StackExchange.openid.initPostLogin('.new-post-login', 'https%3a%2f%2fdba.stackexchange.com%2fquestions%2f230128%2falways-on-availability-groups-resolving-state-after-failover-remote-harden-of%23new-answer', 'question_page');
}
);

Post as a guest















Required, but never shown

























0






active

oldest

votes








0






active

oldest

votes









active

oldest

votes






active

oldest

votes
















draft saved

draft discarded




















































Thanks for contributing an answer to Database Administrators Stack Exchange!


  • Please be sure to answer the question. Provide details and share your research!

But avoid



  • Asking for help, clarification, or responding to other answers.

  • Making statements based on opinion; back them up with references or personal experience.


To learn more, see our tips on writing great answers.




draft saved


draft discarded














StackExchange.ready(
function () {
StackExchange.openid.initPostLogin('.new-post-login', 'https%3a%2f%2fdba.stackexchange.com%2fquestions%2f230128%2falways-on-availability-groups-resolving-state-after-failover-remote-harden-of%23new-answer', 'question_page');
}
);

Post as a guest















Required, but never shown





















































Required, but never shown














Required, but never shown












Required, but never shown







Required, but never shown

































Required, but never shown














Required, but never shown












Required, but never shown







Required, but never shown







Popular posts from this blog

ORA-01691 (unable to extend lob segment) even though my tablespace has AUTOEXTEND onORA-01692: unable to...

Circunscripción electoral de Guipúzcoa Referencias Menú de navegaciónLas claves del sistema electoral en...