This is what I have found under normal circumstances:
- Everything is fine
- Then, you get the error
- You might get a couple of other errors, saying stuff doesn't work
- The agents on the affected management server all serve up a false heartbeat failure
- The agents fail to a different management server
- The heartbeat failures resolve
- The affected management server says, "oh, I'm fine now, turns out nothing was really wrong with me"
- The agents change back to their primary management server
- Everything is now fine again...except you now have inaccurate availability, a bunch of auto-resolved false alerts, unnecessary state stages, unnecessary e-mails, and probably an upset server team who received those e-mails.
- Then, it all happens over again...
A Microsoft article existed at one point, which provided a registry change to workaround/fix the issue, but then the article disappeared. If you make this registry change on all of your management servers, it might just fix the issue. It fixed it for me, as well as other people. As far as I know, this change is unsupported, but it is also easily reversible.
Here is the fix/workaround:
Some Notes First:
- As always, before making any changes, back up your registry and SCOM databases.
- Make this change on ONE Management Server at a time, and give the management server some time to recover after the restart - about 15 minutes should do.
- As far as I know, Microsoft does not support this, so use at your own risk.
The Actual Steps:
- Open you Registry editor (Start > Run > regedit)
- HKLM\SYSTEM\CurrentControlSet\services\HealthService\Parameters
- Select the PoolManager Folder/Key. (If it does not exist, create it under Parameters)
- Create 2 new D-Words
- PoolLeaseRequestPeriodSeconds
- Give it a Value of 600 (Decimal)
- PoolNetworkLatencySeconds
- Give it a Value of 120 (Decimal)
- Restart the Management Server
- Give it Time to Recover
- Repeat on all Management Servers
- Let things calm down (probably an hour or more)
Good luck!