EMDIAG Repvfy 12c kit - troubleshooting part 1

The following blog post continue the EMDIAG repvfy kit series and will focus on how to troubleshoot and solve the problems reported by the kit.

The repository verification kit reports number of problems with our repository which we are about to troubleshoot and solve one by one. It’s important to notice that some of the problem are related so solving one problem could also solve another one.

Here is the output I’ve got for my OEM repository:

-- --------------------------------------------------------------------- --
-- REPVFY: 2014.0114     Repository: 12.1.0.3.0     23-Jan-2014 13:35:41 	----------------------------------------------------------------------------
-- Module:                                          Test:   0, Level: 2 --
-- --------------------------------------------------------------------- --

 verifyAGENTS

1004. Stuck PING jobs: 10
verifyASLM
verifyAVAILABILITY
1002. Disabled response metrics (16570376): 2
verifyBLACKOUTS
verifyCAT
verifyCORE
verifyECM
1002. Unregistered ECM metadata tables: 2
verifyEMDIAG
1001. Undefined verification modules: 1
verifyEVENTS
verifyEXADATA
2001. Exadata plugin version mismatches: 5
verifyJOBS
2001. System jobs running for more than 24hr: 1
verifyJVMD
verifyLOADERS
verifyMETRICS
verifyNOTIFICATIONS
verifyOMS
1002. Stuck PING jobs: 10
verifyPLUGINS
1003. Plugin metadata versions out of sync: 13
verifyREPOSITORY
verifyTARGETS
1021. Composite metric calculation with inconsistent dependant metadata 	versions: 3
2004. Targets without an ORACLE_HOME association: 2
2007. Targets with unpromoted ORACLE_HOME target: 2
verifyUSERS[/plain]

I usually follow this sequence of actions when troubleshoot repository problems:

Verify the module with detail option. Increasing the level also might show more problems or problems related to the current one.
Dump the module and check for any unusual activity.
Check repository database alert log for any errors.
Check emagent logs for any errors.
Check OMS logs for any errors.

Troubleshoot Stuck PING jobs

Looking on the first problem reported for verifyAGENTS – Stuck PING jobs we can easily spot the relation between verifyAGENTS, verifyJOBS and verifyOMS modules where the same problem is occurring. For some reason there are ten ping jobs which are stuck and running for more than 24hrs.

The best approach would be running verify against any of these modules with the –detail option. This will show more information and eventually help analyze the problem. Running detail report for AGENTS and OMS didn’t helped and didn’t show much information related to the stuck pings jobs. However running detailed report for the JOBS we were able to identify the job_id, job_name and when the job was started:

[oracle@oem bin]$ ./repvfy verify jobs –detail

JOB_ID                           EXECUTION_ID                     	JOB_NAME                                 START_TIME
-------------------------------- -------------------------------- 		---------------------------------------- --------------------
ECA6DE1A67B43914E0432084800AB548 ECA6DE1A67B63914E0432084800AB548 PINGCFMJOB_ECA6DE1A67B33914E0432084800AB 03-DEC-2013 19:02:29

So we can see that the stuck job was started on 19:02 at 3rd of December and the time of check was 23rd of January.

Now we can say that there is a problem with the jobs rather than agents or oms, the problems at these two modules appeared as a result of the stuck job and we should be focusing on the JOBS module.

Running analyze against the job will show the same thing as verify with detail option, it’s usage would be appropriate if we got multiple jobs issues and want to see the details for particular one.

Dumping the job will show a lot of info from MGMT_ tables that's useful, of particular interest are the details of the execution:

[oracle@oem bin]$ ./repvfy dump job -guid ECA6DE1A67B43914E0432084800AB548

[----- MGMT_JOB_EXEC_SUMMARY ------------------------------------------------]

EXECUTION_ID                     STATUS                           TLI QUEUE_ID                         TIMEZONE_REGION                SCHEDULED_TIME       EXPECTED_START_TIME  START_TIME           END_TIME                RETRIED
-------------------------------- ------------------------- ---------- -------------------------------- ------------------------------ -------------------- -------------------- -------------------- -------------------- ----------
ECA6DE1A67B63914E0432084800AB548 02-Running                         1                                  +00:00                         03-DEC-2013 18:59:25 03-DEC-2013 18:59:25 03-DEC-2013 19:02:29                               0

Again we can confirm that the job is still running and the next step would be to dump the execution which will show us on which step the job is waiting/hanging. That's just an example because in my case I didn't have any steps in my job execution:

[oracle@oem bin]$ ./repvfy dump execution -guid ECA6DE1A67B43914E0432084800AB548
[oracle@oem bin]$ ./repvfy dump step -id 739148[/plain]

Checking job system health could also be useful by showing some job history, scheduled jobs and some performance metrics:

[oracle@oem bin]$ ./repvfy dump job_health

Back to our problem we may query MGMT_JOB to get the job name and confirm that's system job run by SYSMAN:

SQL> SELECT JOB_ID, JOB_NAME,JOB_OWNER, JOB_DESCRIPTION,JOB_TYPE,SYSTEM_JOB,JOB_STATUS FROM MGMT_JOB WHERE UPPER(JOB_NAME) like '%PINGCFM%'

JOB_ID                           JOB_NAME                                                     JOB_OWNER  JOB_DESCRIPTION                                              JOB_TYPE        SYSTEM_JOB JOB_STATUS
-------------------------------- ------------------------------------------------------------ ---------- ------------------------------------------------------------ --------------- ---------- ----------
ECA6DE1A67B43914E0432084800AB548 PINGCFMJOB_ECA6DE1A67B33914E0432084800AB548                  SYSMAN     This is a Confirm EMD Down test job                          ConfirmEMDDown            2          0

We may try to stop the job using emcli and job name:

[oracle@oem bin]$ emcli stop_job -name=PINGCFMJOB_ECA6DE1A67B33914E0432084800AB
Error: The job/execution is invalid (or non-existent)[/plain]

If that doesn’t work then use emdiag kit to cleanup the repository part:

./repvfy verify jobs -test 1998 -fix

Please enter the SYSMAN password:

-- --------------------------------------------------------------------- --
-- REPVFY: 2014.0114     Repository: 12.1.0.3.0     27-Jan-2014 18:18:36 --
---------------------------------------------------------------------------
-- Module: JOBS                                     Test: 1998, Level: 2 --
-- --------------------------------------------------------------------- --
-- -- -- - Running in FIX mode: Data updated for all fixed tests - -- -- --
-- --------------------------------------------------------------------- --

The repository is now ok but it will not remove the stuck thread at the OMS level. In order for the OMS to get healthy again it needs to be restarted:

cd $OMS_HOME/bin
emctl stop oms
emctl start oms

After OMS was restarted there were no stuck jobs anymore!

I’ve still wanted to know why that happened. Although there were few bugs at MOS they were no very applicable and didn’t found any of the symptoms in my case. After checking repository database alertlog I found few disturbing messages:

 Tns error struct:
Time: 03-DEC-2013 19:04:01
TNS-12637: Packet receive failed
ns secondary err code: 12532
.....
opiodr aborting process unknown ospid (15301) as a result of ORA-609
 opiodr aborting process unknown ospid (15303) as a result of ORA-609
 opiodr aborting process unknown ospid (15299) as a result of ORA-609
2013-12-03 19:07:58.156000 +00:00

I also found a lot of similar message on the target databases:

  Time: 03-DEC-2013 19:05:08
TNS-12535: TNS:operation timed out
ns secondary err code: 12560
nt main err code: 505

That pretty much matches the time when the job got stuck - 19:02:29. So I assume there was some network glitch at that time causing the ping job to stuck. The solution was simply run the repvfy with fix option and then restart the OMS service.

In case after restart the job is stuck again consider increasing the oms property oracle.sysman.core.conn.maxConnForJobWorkers. Consider the following note if that's the case:
Jobs Are Hanging In Running State (Doc ID 1595628.1)

EMDIAG repvfy blog series:
EMDIAG Repvfy 12c kit – installation
EMDIAG Repvfy 12c kit – basics
EMDIAG Repvfy 12c kit – troubleshooting part 1
EMDIAG Repvfy 12c kit – troubleshooting part 2