Why my EM12c is giving Metric evaluation error for Exadata cell targets?

As part of my Cloud Control journey I encountered a strange problem where I got the following error for few Exadata Storage Server (cell) targets:

Metric evaluation error start - oracle.sysman.emSDK.agent.fetchlet.exception.FetchletException: em_error=Failed to execute_exadata_response.pl ssh -q -o ConnectTimeout=60 -o BatchMode=yes -o StrictHostKeyChecking=no -o PreferredAuthentications=publickey -i /home/oracle/.ssh/id_dsa -l cellmonitor cellcli -xml -e ' list cell attributes msStatus ':

Another symptom is that I received two mails from OEM, one saying that the cell and its services are up:

EM Event: Clear:exacel05.localhost.localdomain - exacel05.localhost.localdomain is Up. MS Status is RUNNING and Ping Status is SUCCESS.

and another one saying there is Metric evaluation error for the same target:

EM Event: Critical:exacel05.localhost.localdomain - Metric evaluation error start - oracle.sysman.emSDK.agent.fetchlet.exception.FetchletException: em_error=Failed to execute_exadata_response.pl ssh -q -o ConnectTimeout=60 -o BatchMode=yes -o StrictHostKeyChecking=no -o PreferredAuthentications=publickey -i /home/oracle/.ssh/id_dsa -l cellmonitor cellcli -xml -e ' list cell attributes msStatus ':

I have to say that the error didn’t came up by itself, but it manifested after I had to redeploy the Exadata plugin on few agents. If you ever had to do this you would know that before removing the plugin from an agent you need to make sure the agent is not primary monitoring agent for Exadata targets. In my case few of the agents were Monitoring Agents for the cells and I had to swap them with the Backup Monitoring Agent so I would be able to redeploy the plugin on the primary monitoring agent.

After I redeployed the plugin, I tried to revert back the initial configuration but for some reason the configuration messed up and I ended up with different agents monitoring different cell targets from what was at the beginning.

It turns out that one of the monitoring agents wasn’t able to connect to the cell and that’s why I got the email notifications and the Metric evaluation errors for the cells. Although that’s not a problem it’s quite annoying to receive such alerts and have all these targets with Metric collection error icons in OEM or having these targets reported with status Down.

Let's first check which are the monitoring agents for that cell target from the OEM repository:

SQL> select target_name, target_type, agent_name, agent_type, agent_is_master
where target_name = 'exacel05.localhost.localdomain';

TARGET_NAME                      TARGET_TYPE     	AGENT_NAME                         AGENT_TYPE AGENT_IS_MASTER
-------------------------------- --------------- 	---------------------------------- ---------- ---------------
exacel05.localhost.localdomain   oracle_exadata  	exadb03.localhost.localdomain:3872 oracle_emd               0
exacel05.localhost.localdomain   oracle_exadata  exadb02.localhost.localdomain:3872 oracle_emd               1

Looking on the cell secure log we can see that one of the monitoring agents wasn't able to connect because of failed publickey authentication:

Oct 23 11:39:54 exacel05 sshd[465]: Connection from port 14594
Oct 23 11:39:54 exacel05 sshd[465]: Failed publickey for cellmonitor from port 14594 ssh2
Oct 23 11:39:54 exacel05 sshd[466]: Connection closed by
Oct 23 11:39:55 exacel05 sshd[467]: Connection from port 27799
Oct 23 11:39:55 exacel05 sshd[467]: Found matching DSA key: 	cf:99:0a:37:1a:e5:84:dc:a8:8a:b9:6f:0c:fd:05:c5
Oct 23 11:39:55 exacel05 sshd[468]: Postponed publickey for cellmonitor from port 27799 ssh2
Oct 23 11:39:55 exacel05 sshd[467]: Found matching DSA key: 	cf:99:0a:37:1a:e5:84:dc:a8:8a:b9:6f:0c:fd:05:c5
Oct 23 11:39:55 exacel05 sshd[467]: Accepted publickey for cellmonitor from port 27799 ssh2
Oct 23 11:39:55 exacel05 sshd[467]: pam_unix(sshd:session): session opened for user cellmonitor by (uid=0)

That’s confirmed by checking ssh authorized_keys file, which also confirms which were initially configured monitoring agents:

[root@exacel05 .ssh]# grep exadb /home/cellmonitor/.ssh/authorized_keys | cut -d = -f 2

Another way to check which monitoring agent were configured initially is to check the snmpSubscriber attribute for that specific cell:

[root@exacel05 ~]# cellcli -e list cell attributes snmpSubscriber

It's obvious that exadb02 shouldn't be monitoring this target but it should be exadb04 instead. I believe that when I redeployed the Exadata plugin this agent wasn't eligible to monitor Exadata targets any more and was replaced with another one but that's just a guess.

There are two solutions for that problem:

  1. Move (relocate) target definition and monitoring to the correct agent:

I wasn’t able to find a way to do that through OEM Console and for that purpose I used emcli. Based on MGMT$AGENTS_MONITORING_TARGETS query and snmpSubscriber attribute I was able to find which agent was configured initially and which have to be removed. Then I used emcli to relocate the monitoring agent for that target to the correct agent, the one which was configured initially:

[oracle@oem ~]$ emcli relocate_targets -src_agent=exadb02.localhost.localdomain:3872 -dest_agent=exadb04.localhost.localdomain:3872 -target_name=exacel05.localhost.localdomain -target_type=oracle_exadata -copy_from_src
Moved all targets from exadb02.localhost.localdomain:3872 to exadb04.localhost.localdomain:3872
  1. Reconfigure the cell to use the new monitoring agent:

Add the current monitoring agent ssh publickey into the authorized_keys of the cell:

Place the oracle user DSA public key (/home/oracle/.ssh/id_dsa.pub) from exadb02 into exacel05:/home/cellmonitor/.ssh/authorized_keys

and also change the cell snmpSubscriber attribute:

[root@exacel05~]# cellcli -e "alter cell snmpSubscriber=((host='exadb03.localhost.localdomain',port=3872,community=public),(host='exadb02.localhost.localdomain',port=3872,community=public))"
Cell exacel05 successfully altered
[root@exacel05~]# cellcli -e list cell attributes snmpSubscriber

After that the status at OEM for the Exadata Storage Server (cell) target became up and also the metrics were fine now.