EMDIAG Repvfy 12c kit - troubleshooting part 1
The following blog post continue the EMDIAG repvfy kit series and will focus on how to troubleshoot and solve the problems reported by the kit.
The repository verification kit reports number of problems with our repository which we are about to troubleshoot and solve one by one. It’s important to notice that some of the problem are related so solving one problem could also solve another one.
Here is the output I’ve got for my OEM repository:
-- --------------------------------------------------------------------- --
-- REPVFY: 2014.0114 Repository: 12.1.0.3.0 23-Jan-2014 13:35:41 ----------------------------------------------------------------------------
-- Module: Test: 0, Level: 2 --
-- --------------------------------------------------------------------- --
verifyAGENTS
1004. Stuck PING jobs: 10
verifyASLM
verifyAVAILABILITY
1002. Disabled response metrics (16570376): 2
verifyBLACKOUTS
verifyCAT
verifyCORE
verifyECM
1002. Unregistered ECM metadata tables: 2
verifyEMDIAG
1001. Undefined verification modules: 1
verifyEVENTS
verifyEXADATA
2001. Exadata plugin version mismatches: 5
verifyJOBS
2001. System jobs running for more than 24hr: 1
verifyJVMD
verifyLOADERS
verifyMETRICS
verifyNOTIFICATIONS
verifyOMS
1002. Stuck PING jobs: 10
verifyPLUGINS
1003. Plugin metadata versions out of sync: 13
verifyREPOSITORY
verifyTARGETS
1021. Composite metric calculation with inconsistent dependant metadata versions: 3
2004. Targets without an ORACLE_HOME association: 2
2007. Targets with unpromoted ORACLE_HOME target: 2
verifyUSERS[/plain]
I usually follow this sequence of actions when troubleshoot repository problems:
- Verify the module with detail option. Increasing the level also might show more problems or problems related to the current one.
- Dump the module and check for any unusual activity.
- Check repository database alert log for any errors.
- Check emagent logs for any errors.
- Check OMS logs for any errors.
Troubleshoot Stuck PING jobs
Looking on the first problem reported for verifyAGENTS – Stuck PING jobs we can easily spot the relation between verifyAGENTS, verifyJOBS and verifyOMS modules where the same problem is occurring. For some reason there are ten ping jobs which are stuck and running for more than 24hrs.
The best approach would be running verify against any of these modules with the –detail option. This will show more information and eventually help analyze the problem. Running detail report for AGENTS and OMS didn’t helped and didn’t show much information related to the stuck pings jobs. However running detailed report for the JOBS we were able to identify the job_id, job_name and when the job was started:
[oracle@oem bin]$ ./repvfy verify jobs –detail
JOB_ID EXECUTION_ID JOB_NAME START_TIME
-------------------------------- -------------------------------- ---------------------------------------- --------------------
ECA6DE1A67B43914E0432084800AB548 ECA6DE1A67B63914E0432084800AB548 PINGCFMJOB_ECA6DE1A67B33914E0432084800AB 03-DEC-2013 19:02:29
So we can see that the stuck job was started on 19:02 at 3rd of December and the time of check was 23rd of January.
Now we can say that there is a problem with the jobs rather than agents or oms, the problems at these two modules appeared as a result of the stuck job and we should be focusing on the JOBS module.
Running analyze against the job will show the same thing as verify with detail option, it’s usage would be appropriate if we got multiple jobs issues and want to see the details for particular one.
Dumping the job will show a lot of info from MGMT_ tables that's useful, of particular interest are the details of the execution:
[oracle@oem bin]$ ./repvfy dump job -guid ECA6DE1A67B43914E0432084800AB548
[----- MGMT_JOB_EXEC_SUMMARY ------------------------------------------------]
EXECUTION_ID STATUS TLI QUEUE_ID TIMEZONE_REGION SCHEDULED_TIME EXPECTED_START_TIME START_TIME END_TIME RETRIED
-------------------------------- ------------------------- ---------- -------------------------------- ------------------------------ -------------------- -------------------- -------------------- -------------------- ----------
ECA6DE1A67B63914E0432084800AB548 02-Running 1 +00:00 03-DEC-2013 18:59:25 03-DEC-2013 18:59:25 03-DEC-2013 19:02:29 0
Again we can confirm that the job is still running and the next step would be to dump the execution which will show us on which step the job is waiting/hanging. That's just an example because in my case I didn't have any steps in my job execution:
[oracle@oem bin]$ ./repvfy dump execution -guid ECA6DE1A67B43914E0432084800AB548
[oracle@oem bin]$ ./repvfy dump step -id 739148[/plain]
Checking job system health could also be useful by showing some job history, scheduled jobs and some performance metrics:
[oracle@oem bin]$ ./repvfy dump job_health
Back to our problem we may query MGMT_JOB to get the job name and confirm that's system job run by SYSMAN:
SQL> SELECT JOB_ID, JOB_NAME,JOB_OWNER, JOB_DESCRIPTION,JOB_TYPE,SYSTEM_JOB,JOB_STATUS FROM MGMT_JOB WHERE UPPER(JOB_NAME) like '%PINGCFM%'
JOB_ID JOB_NAME JOB_OWNER JOB_DESCRIPTION JOB_TYPE SYSTEM_JOB JOB_STATUS
-------------------------------- ------------------------------------------------------------ ---------- ------------------------------------------------------------ --------------- ---------- ----------
ECA6DE1A67B43914E0432084800AB548 PINGCFMJOB_ECA6DE1A67B33914E0432084800AB548 SYSMAN This is a Confirm EMD Down test job ConfirmEMDDown 2 0
We may try to stop the job using emcli and job name:
[oracle@oem bin]$ emcli stop_job -name=PINGCFMJOB_ECA6DE1A67B33914E0432084800AB
Error: The job/execution is invalid (or non-existent)[/plain]
If that doesn’t work then use emdiag kit to cleanup the repository part:
./repvfy verify jobs -test 1998 -fix
Please enter the SYSMAN password:
-- --------------------------------------------------------------------- --
-- REPVFY: 2014.0114 Repository: 12.1.0.3.0 27-Jan-2014 18:18:36 --
---------------------------------------------------------------------------
-- Module: JOBS Test: 1998, Level: 2 --
-- --------------------------------------------------------------------- --
-- -- -- - Running in FIX mode: Data updated for all fixed tests - -- -- --
-- --------------------------------------------------------------------- --
The repository is now ok but it will not remove the stuck thread at the OMS level. In order for the OMS to get healthy again it needs to be restarted:
cd $OMS_HOME/bin
emctl stop oms
emctl start oms
After OMS was restarted there were no stuck jobs anymore!
I’ve still wanted to know why that happened. Although there were few bugs at MOS they were no very applicable and didn’t found any of the symptoms in my case. After checking repository database alertlog I found few disturbing messages:
Tns error struct:
Time: 03-DEC-2013 19:04:01
TNS-12637: Packet receive failed
ns secondary err code: 12532
.....
opiodr aborting process unknown ospid (15301) as a result of ORA-609
opiodr aborting process unknown ospid (15303) as a result of ORA-609
opiodr aborting process unknown ospid (15299) as a result of ORA-609
2013-12-03 19:07:58.156000 +00:00
I also found a lot of similar message on the target databases:
Time: 03-DEC-2013 19:05:08
TNS-12535: TNS:operation timed out
ns secondary err code: 12560
nt main err code: 505
That pretty much matches the time when the job got stuck - 19:02:29. So I assume there was some network glitch at that time causing the ping job to stuck. The solution was simply run the repvfy with fix option and then restart the OMS service.
In case after restart the job is stuck again consider increasing the oms property oracle.sysman.core.conn.maxConnForJobWorkers. Consider the following note if that's the case:
Jobs Are Hanging In Running State (Doc ID 1595628.1)
EMDIAG repvfy blog series:
EMDIAG Repvfy 12c kit – installation
EMDIAG Repvfy 12c kit – basics
EMDIAG Repvfy 12c kit – troubleshooting part 1
EMDIAG Repvfy 12c kit – troubleshooting part 2