Monday, December 31, 2012

All Management Servers Resource Pool Unavailable Hack/Fix

I have seen a lot of posts regarding the "All Management Servers Resource Pool Unavailable" error since SCOM 2012 RC, and I am still seeing numerous posts. I have also not seen anything in SP1 regarding fixing this error. The "All Management Servers Resource Pool Unavailable" error seems to cause trouble where no trouble really exists.

This is what I have found under normal circumstances:

  1. Everything is fine
  2. Then, you get the error
  3. You might get a couple of other errors, saying stuff doesn't work
  4. The agents on the affected management server all serve up a false heartbeat failure
  5. The agents fail to a different management server
  6. The heartbeat failures resolve
  7. The affected management server says, "oh, I'm fine now, turns out nothing was really wrong with me"
  8. The agents change back to their primary management server
  9. Everything is now fine again...except you now have inaccurate availability, a bunch of auto-resolved false alerts, unnecessary state stages, unnecessary e-mails, and probably an upset server team who received those e-mails.
  10. Then, it all happens over again...
A Microsoft article existed at one point, which provided a registry change to workaround/fix the issue, but then the article disappeared. If you make this registry change on all of your management servers, it might just fix the issue. It fixed it for me, as well as other people. As far as I know, this change is unsupported, but it is also easily reversible.

Here is the fix/workaround:

Some Notes First:
  1. As always, before making any changes, back up your registry and SCOM databases.
  2. Make this change on ONE Management Server at a time, and give the management server some time to recover after the restart - about 15 minutes should do.
  3. As far as I know, Microsoft does not support this, so use at your own risk.
The Actual Steps:
  1. Open you Registry editor (Start > Run > regedit)
  2. HKLM\SYSTEM\CurrentControlSet\services\HealthService\Parameters
  3. Select the PoolManager Folder/Key. (If it does not exist, create it under Parameters)
  4. Create 2 new D-Words
    1. PoolLeaseRequestPeriodSeconds
    2. Give it a Value of 600 (Decimal)
    3. PoolNetworkLatencySeconds
    4. Give it a Value of 120 (Decimal)
  5. Restart the Management Server
  6. Give it Time to Recover
  7. Repeat on all Management Servers
  8. Let things calm down (probably an hour or more)
Good luck!

Friday, November 16, 2012

Monitoring Service Manager 2012 with Operations Manager 2012

Note that if you are using System Center 2012: Service Manager, you will not be able to install Operations Manager 2012 agents on the machines. You must use agentless management; this is fine because the Service Manager 2012 Management Pack accounts for agentless monitoring. However, you might want to think about where you place you Service Manager Self Service Portal. If it is placed on a shared SharePoint server, you will not be able to install the 2012 agent to monitor that SharePoint server - It will have to be agentless as well.

The Service Manager MP is fairly thorough and now monitors workflows.
You can download the Management Pack Documentation at the link below. 

Friday, January 27, 2012

Notifications not working in SCOM 2012

Notifications not working in SCOM 2012: "Notifications not working in SCOM 2012"

'via Blog this'

Operations Manager 2012 RC Console Memory Leak

As you may know by now, the Operations Manager 2012 console has a memory leak. Microsoft has acknowledged the issue, and put it into the release notes for RC. It isn't that big of a deal. Just close the console and reopen it, if it begins taking too much memory. They will probably fix it in RTM.

Having said that, this is the real warning I want to give. Many administrators will log into the management server and use the console for various reasons. Many times an administrator will disconnect his session, rather than logging out. If you do this, make sure you close the console...because I didn't. The console "ate" all the memory and brought the management server to its knees. I closed and reopened the console, it freed the memory, and all was good.

Tuesday, January 17, 2012

My First Look at Network Monitoring in a Real Environment

I recently deployed System Center 2012: Operations Manager in a development environment. I don't mean a small virtual lab. I mean a real environment with multiple devices that can be monitored via SNMP.

I will try to categorize a little here to make reading more structured.

Initial Discovery
Without reading any documentation, I went into the administration pane, looked around a little and noticed the Network Management category, which includes Discovery rules. First, I created a recursive discovery rule. I decided to use a router as a seed, to which a couple of switches were connected. I only filtered to the subnet I was discovering. I was disappointed when the initial discovery only discovered the router and its interfaces, but NONE of the attached devices. I went back into the discovery, and this time I used a switch as a seed. By using the switch as the seed, I was able to discover all connected devices, as expected.

SO, question 1: Why am I unable to use a router as a seed device to use for discovery? I don't know the answer to this, and would love to find out.

Besides the one question I have, device discovery was a breeze. It was very easy, the steps are intuitive, and the nodes/interfaces were discovered and related as expected.
In terms of speed, the discovery was quite show for such a small number of devices on a Gigabit network. My Management Server was well-under utilized. I currently don't see this as an issue, but we will see where it takes us later.

Network Device Layouts and Dashboard
Microsoft did a pretty good job providing some nice out-of the box views. The views include a couple of dashboards, network vicinity, performance, availability, etc. You can see some screen shots here.

Rules and Monitoring
Besides the normal up/down monitors, Microsoft provides a nice set of SNMP rules and monitors to monitoring performance and availability. Many of the rules are off by default. If you want to view a list of the rules from the console, you can scope you rule view the the Node and Interface classes.

So it looks like Microsoft fell a little short on alerting. While alerting exists, there is not correlation. For example, ideally, if you have a router that goes down, and on that router you have a switch and ten devices connected to that switch, you would get ONE ALERT for the router itself, while all other alerts are suppressed. Unfortunately, this is not the case. You will get 12 ALERTS!!! One for the Router, One for the Switch, and ten for the Devices (assuming no backup link). Most of us can live with this, but I hope this is one of the first things that is changed.

In a sense, this is version 2 of SCOM Network monitoring. However, I am going to call it version 1, because they didn't really try to first time around in SCOM 2007. Overall, they did a fine job. If you are looking to replace your current network monitoring solution, you really need to bring SCOM 2012 up in your environment and take a look first. If you don't have a network monitoring solution, SCOM 2012 will provide a great foundation on which to build and customize.