Speaking at Oracle Open World 2014

September 25th, 2014 No comments

I’m more than happy that I will be speaking at this year’s Oracle Open World. The first and only time I attended was back in 2010 and now I’m not only attending but speaking as well!

Both with Jason Arneil will talk about what we’ve learned on our Exadata implementations with two of the biggest UK retailers so please join us:
Session ID: CON2224
Session Title: Oracle Exadata Migrations: Lessons Learned from Retail
Venue / Room: Moscone South – 310
Date and Time: 9/30/14, 15:45 – 16:30

I would like to thank E-DBA and especially Jason for making this happen!

I’m also planning to attend Oaktable World 2014 and Oracle OpenWorld 2014 – Bloggers Meetup for the best part of OOW – really technical sessions and networking!

See you there!

Sve

oow-imspeaking-200x200-2225052

Categories: oracle Tags:

Speaking at BGOUG 2014 Spring conference

May 30th, 2014 No comments

I’ll be speaking at the spring conference of the BGOUG held between 13th and 15th of June. I was a regular attendee of the conference for eight years in a row but since I moved to UK I had to skip the last two conferences. My session is about Oracle GoldenGate – it will cover the basics, components, usage scenarios, installation and configuration, trail files and GG records and many more.

See you there in two weeks.

 

Categories: oracle Tags: ,

EMDIAG Repvfy 12c kit – troubleshooting part 1

March 26th, 2014 No comments

The following blog post continue the EMDIAG repvfy kit series and will focus on how to troubleshoot and solve the problems reported by the kit.

The repository verification kit reports number of problems with our repository which we are about to troubleshoot and solve one by one. It’s important to notice that some of the problem are related so solving one problem could also solve another one.

Here is the output I’ve got for my OEM repository:

-- --------------------------------------------------------------------- --
-- REPVFY: 2014.0114     Repository: 12.1.0.3.0     23-Jan-2014 13:35:41 --
---------------------------------------------------------------------------
-- Module:                                          Test:   0, Level: 2 --
-- --------------------------------------------------------------------- --

verifyAGENTS

1004. Stuck PING jobs: 10
verifyASLM
verifyAVAILABILITY
1002. Disabled response metrics (16570376): 2
verifyBLACKOUTS
verifyCAT
verifyCORE
verifyECM
1002. Unregistered ECM metadata tables: 2
verifyEMDIAG
1001. Undefined verification modules: 1
verifyEVENTS
verifyEXADATA
2001. Exadata plugin version mismatches: 5
verifyJOBS
2001. System jobs running for more than 24hr: 1
verifyJVMD
verifyLOADERS
verifyMETRICS
verifyNOTIFICATIONS
verifyOMS
1002. Stuck PING jobs: 10
verifyPLUGINS
1003. Plugin metadata versions out of sync: 13
verifyREPOSITORY
verifyTARGETS
1021. Composite metric calculation with inconsistent dependant metadata versions: 3
2004. Targets without an ORACLE_HOME association: 2
2007. Targets with unpromoted ORACLE_HOME target: 2
verifyUSERS

I usually follow this sequence of actions when troubleshoot repository problems:
1. Verify the module with detail option. Increasing the level also might show more problems or problems related to the current one.
2. Dump the module and check for any unusual activity.
3. Check repository database alert log for any errors.
4. Check emagent logs for any errors.
5. Check OMS logs for any errors.

Troubleshoot Stuck PING jobs

Looking on the first problem reported for verifyAGENTS – Stuck PING jobs we can easily spot the relation between verifyAGENTS, verifyJOBS and verifyOMS modules where the same problem is occurring. For some reason there are ten ping jobs which are stuck and running for more than 24hrs.

The best approach would be running verify against any of these modules with the –detail option. This will show more information and eventually help analyze the problem. Running detail report for AGENTS and OMS didn’t helped and didn’t show much information related to the stuck pings jobs. However running detailed report for the JOBS we were able to identify the job_id, job_name and when the job was started:

[oracle@oem bin]$ ./repvfy verify jobs –detail

JOB_ID                           EXECUTION_ID                     JOB_NAME                                 START_TIME
-------------------------------- -------------------------------- ---------------------------------------- --------------------
ECA6DE1A67B43914E0432084800AB548 ECA6DE1A67B63914E0432084800AB548 PINGCFMJOB_ECA6DE1A67B33914E0432084800AB 03-DEC-2013 19:02:29

So we can see that the stuck job was started on 19:02 at 3rd of December and the time of check was 23rd of January.

Now we can say that there is a problem with the jobs rather than agents or oms, the problems at these two modules appeared as a result of the stuck job and we should be focusing on the JOBS module.

Running analyze against the job will show the same thing as verify with detail option, it’s usage would be appropriate if we got multiple jobs issues and want to see the details for particular one.

Dumping the job will show a lot of info from MGMT_ tables that’s useful, of particular interest are the details of the execution:

[oracle@oem bin]$ ./repvfy dump job -guid ECA6DE1A67B43914E0432084800AB548

[----- MGMT_JOB_EXEC_SUMMARY ------------------------------------------------]

EXECUTION_ID                     STATUS                           TLI QUEUE_ID                         TIMEZONE_REGION                SCHEDULED_TIME       EXPECTED_START_TIME  START_TIME           END_TIME                RETRIED
-------------------------------- ------------------------- ---------- -------------------------------- ------------------------------ -------------------- -------------------- -------------------- -------------------- ----------
ECA6DE1A67B63914E0432084800AB548 02-Running                         1                                  +00:00                         03-DEC-2013 18:59:25 03-DEC-2013 18:59:25 03-DEC-2013 19:02:29                               0

Again we can confirm that the job is still running and the next step would be to dump the execution which will show us on which step the job is waiting/hanging. That’s just an example because in my case I didn’t have any steps in my job execution:

[oracle@oem bin]$ ./repvfy dump execution -guid ECA6DE1A67B43914E0432084800AB548
[oracle@oem bin]$ ./repvfy dump step -id 739148

Checking job system health could also be useful by showing some job history, scheduled jobs and some performance metrics:

[oracle@oem bin]$ ./repvfy dump job_health

Back to our problem we may query MGMT_JOB to get the job name and confirm that’s system job run by SYSMAN:

SQL> SELECT JOB_ID, JOB_NAME,JOB_OWNER, JOB_DESCRIPTION,JOB_TYPE,SYSTEM_JOB,JOB_STATUS FROM MGMT_JOB WHERE UPPER(JOB_NAME) like '%PINGCFM%'

JOB_ID                           JOB_NAME                                                     JOB_OWNER  JOB_DESCRIPTION                                              JOB_TYPE        SYSTEM_JOB JOB_STATUS
-------------------------------- ------------------------------------------------------------ ---------- ------------------------------------------------------------ --------------- ---------- ----------
ECA6DE1A67B43914E0432084800AB548 PINGCFMJOB_ECA6DE1A67B33914E0432084800AB548                  SYSMAN     This is a Confirm EMD Down test job                          ConfirmEMDDown            2          0

We may try to stop the job using emcli and job name:

[oracle@oem bin]$ emcli stop_job -name=PINGCFMJOB_ECA6DE1A67B33914E0432084800AB
Error: The job/execution is invalid (or non-existent)

If that doesn’t work then use emdiag kit to cleanup the repository part:

./repvfy verify jobs -test 1998 -fix

Please enter the SYSMAN password:

-- --------------------------------------------------------------------- --
-- REPVFY: 2014.0114     Repository: 12.1.0.3.0     27-Jan-2014 18:18:36 --
---------------------------------------------------------------------------
-- Module: JOBS                                     Test: 1998, Level: 2 --
-- --------------------------------------------------------------------- --
-- -- -- - Running in FIX mode: Data updated for all fixed tests - -- -- --
-- --------------------------------------------------------------------- --

The repository is now ok but it will not remove the stuck thread at the OMS level. In order for the OMS to get healthy again it needs to be restarted:

cd $OMS_HOME/bin
emctl stop oms
emctl start oms

After OMS was restarted there were no stuck jobs anymore!

I’ve still wanted to know why that happened. Although there were few bugs at MOS they were no very applicable and didn’t found any of the symptoms in my case. After checking repository database alertlog I found few disturbing messages:

  Tns error struct:
Time: 03-DEC-2013 19:04:01
TNS-12637: Packet receive failed
ns secondary err code: 12532
.....
opiodr aborting process unknown ospid (15301) as a result of ORA-609
 opiodr aborting process unknown ospid (15303) as a result of ORA-609
 opiodr aborting process unknown ospid (15299) as a result of ORA-609
2013-12-03 19:07:58.156000 +00:00

I also found a lot of similar message on the target databases:

  Time: 03-DEC-2013 19:05:08
TNS-12535: TNS:operation timed out
ns secondary err code: 12560
nt main err code: 505

That pretty much matches the time when the job got stuck – 19:02:29. So I assume there was some network glitch at that time causing the ping job to stuck. The solution was simply run the repvfy with fix option and then restart the OMS service.

In case after restart the job is stuck again consider increasing the oms property oracle.sysman.core.conn.maxConnForJobWorkers.  Consider the following note if that’s the case:

EMDIAG repvfy blog series:

 

Categories: oracle Tags: , ,

Troubleshooting Oracle DBFS mount issues

March 13th, 2014 No comments

On Exadata the local drives on the compute nodes are not big enough to allow larger exports and often dbfs is configured. In my case I had a 1.2 TB dbfs file system mounted under /dbfs_direct/.

While I was doing some exports yesterday I found that my dbfs wasn’t mounted, running quick crsctl command to bring it online failed:

[oracle@exadb01 ~]$ crsctl start resource dbfs_mount -n exadb01
 CRS-2672: Attempting to start 'dbfs_mount' on 'exadb01'
 CRS-2674: Start of 'dbfs_mount' on 'exadb01' failed
 CRS-2679: Attempting to clean 'dbfs_mount' on 'exadb01'
 CRS-2681: Clean of 'dbfs_mount' on 'exadb01' succeeded
 CRS-4000: Command Start failed, or completed with errors.

It doesn’t give you any error messages or reason why it’s failing, neither the other database and grid infrastructure logs does. The only useful solution is to enable tracing for dbfs client and see what’s happening. To enable tracing edit the mount script and insert the following MOUNT_OPTIONS:

vi $GI_HOME/crs/script/mount-dbfs.sh
MOUNT_OPTIONS=trace_level=1,trace_file=/tmp/dbfs_client_trace.$$.log,trace_size=100

Now start the resource one more time to get the log file generated. You can get this working with the client as well from the command line:

[oracle@exadb01 ~]$ dbfs_client dbfs_user@ -o allow_other,direct_io,trace_level=1,trace_file=/tmp/dbfs_client_trace.$$.log /dbfs_direct
Password:
Fail to connect to database server.

 

After checking the log file it’s clear now why dbfs was failing to mount, the dbfs database user has expired:

tail /tmp/dbfs_client_trace.100641.log.0
 [43b6c940 03/12/14 11:15:01.577723 LcdfDBPool.cpp:189         ] ERROR: Failed to create session pool ret:-1
 [43b6c940 03/12/14 11:15:01.577753 LcdfDBPool.cpp:399         ] ERROR: ERROR 28001 - ORA-28001: the password has expired

[43b6c940 03/12/14 11:15:01.577766 LcdfDBPool.cpp:251         ] DEBUG: Clean up OCI session pool...
 [43b6c940 03/12/14 11:15:01.577805 LcdfDBPool.cpp:399         ] ERROR: ERROR 24416 - ORA-24416: Invalid session Poolname was specified.

[43b6c940 03/12/14 11:15:01.577844 LcdfDBPool.cpp:444         ] CRIT : Fail to set up database connection.

 

The account had a default profile which had the default PASSWORD_LIFE_TIME of 180 days:

SQL> select username, account_status, expiry_date, profile from dba_users where username='DBFS_USER';

USERNAME                       ACCOUNT_STATUS                   EXPIRY_DATE       PROFILE
------------------------------ -------------------------------- ----------------- ------------------------------
DBFS_USER                      EXPIRED                          03-03-14 14:56:12 DEFAULT

Elapsed: 00:00:00.02
SQL> select password from sys.user$ where name= 'DBFS_USER';

PASSWORD
------------------------------
A4BC1A17F4AAA278

Elapsed: 00:00:00.00
SQL> alter user DBFS_USER identified by values 'A4BC1A17F4AAA278';

User altered.

Elapsed: 00:00:00.03
SQL> select username, account_status, expiry_date, profile from dba_users where username='DBFS_USER';

USERNAME                       ACCOUNT_STATUS                   EXPIRY_DATE       PROFILE
------------------------------ -------------------------------- ----------------- ------------------------------
DBFS_USER                      OPEN                             09-09-14 11:09:43 DEFAULT


SQL> select * from dba_profiles where resource_name = 'PASSWORD_LIFE_TIME';

PROFILE                        RESOURCE_NAME                    RESOURCE LIMIT
------------------------------ -------------------------------- -------- ----------------------------------------
DEFAULT                        PASSWORD_LIFE_TIME               PASSWORD 180

 

After resetting database user password dbfs successfully mounted!

If you are using dedicated database for dbfs make sure you have set the password_life_time to unlimited to avoid similar issues.

 

 

Categories: linux, oracle Tags: , ,

OEM 12c installation fails if parallel_max_servers too high

February 21st, 2014 No comments

Just a quick post regarding OEM 12c installation where recently I had to install OEM 12c and during the repository configuration step the installation fails with error:

ORA-12801: error signaled in parallel query server P151

This was caused by a known bug which requires decreasing the number of parallel queries of the repository databases and start over the installation. The database had cpu_count set to 64 and parallel_max_servers to 270. After setting the parallel_max_servers to lower value the installation completed successfully.

For more information refer to:
EM 12c: Enterprise Manager Cloud Control 12c Installation Fails At Repository Configuration With Error: ORA-12805: parallel query server died unexpectedly (Doc ID 1539444.1)

 

m4s0n501
Categories: linux, oracle Tags:

Consider database timezone when created during DST

February 13th, 2014 No comments

Not long time ago customer asked me to create new a database and refresh it from production. Nothing special here database was quickly created and then refreshed using network import for few schemas. Few weeks later I’ve been told that the database has a timestamp problem. The date and time were correct but the time zone was different from the production:

SQL> SQL> SELECT DBTIMEZONE FROM DUAL;

DBTIMEZONE
------
+01:00

Looking back I tried to find why that happened and I quickly found the answer in the documentation:
If you do not specify the SET TIME_ZONE clause, then the database uses the operating system time zone of the server.

Of course, by that time the time zone of the server was +1 (Daylight saving time) and the database inherited that time zone. The next logical thing was simply to change the time zone to correct one (UTC):

SQL> ALTER DATABASE SET TIME_ZONE='+00:00';
ALTER DATABASE SET TIME_ZONE='+00:00'
*
ERROR at line 1:
ORA-30079: cannot alter database timezone when database has TIMESTAMP WITH LOCAL TIME ZONE columns

Right, that won’t work for if there are tables with columns of type TIMESTAMP WITH LOCAL TIME ZONE and there is data within these tables. Unfortunately the only solution for that is to export the database, drop the users and then import back the data. Also for the change to take effect database must be restarted.

You can simply list the columns of that type and export just these tables, I had a lot of them and decided to export/import the whole database as it was small and used for testing anyway:

SQL> select owner, table_name, column_name, data_type from all_tab_columns where data_type like '%WITH LOCAL TIME ZONE' and owner='MY_USER';
MY_USER                         INVENTORY            DSTAMP    TIMESTAMP(6) WITH LOCAL TIME ZONE
Categories: oracle Tags:

RMAN fails to allocate channel with Tivoli Storage Manager

February 6th, 2014 No comments

I was recently configuring backup on the customers Exadata with IBM TSM Data Protection for Oracle and run into weird RMAN error. The configuration was Oracle Database 11.2, TSM client version 6.1 and TSM Server version 5.5 and this was the error:

[oracle@oraexa01 ~]$ rman target /

Recovery Manager: Release 11.2.0.3.0 - Production on Wed Jan 29 16:41:54 2014

Copyright (c) 1982, 2011, Oracle and/or its affiliates.  All rights reserved.

connected to target database: TESTDB (DBID=2128604199)

RMAN> run {
2> allocate channel c1 device type 'SBT_TAPE';
3> }

using target database control file instead of recovery catalog
RMAN-00571: ===========================================================
RMAN-00569: =============== ERROR MESSAGE STACK FOLLOWS ===============
RMAN-00571: ===========================================================
RMAN-03009: failure of allocate command on c1 channel at 01/29/2014 16:42:01
ORA-19554: error allocating device, device type: SBT_TAPE, device name:
ORA-27000: skgfqsbi: failed to initialize storage subsystem (SBT) layer
Linux-x86_64 Error: 106: Transport endpoint is already connected
Additional information: 7011
ORA-19511: Error received from media manager layer, error text:
SBT error = 7011, errno = 106, sbtopen: system error

You get this message because the Tivoli Storage Manager API error log file (errorlogname option specified in the dsm.sys file) is not writable by the Oracle user.

Just change the file permissions or change the parameter to point to a file under /<writable_path>/ and retry your backup:

[root@oraexa01 ~]# chmod a+w /usr/tivoli/tsm/client/ba/bin/dsmerror.log

This time RMAN allocates channel successfully:

[oracle@oraexa01 ~]$ rman target /

Recovery Manager: Release 11.2.0.3.0 - Production on Wed Jan 29 16:42:52 2014

Copyright (c) 1982, 2011, Oracle and/or its affiliates.  All rights reserved.

connected to target database: TESTDB (DBID=2128604199)

RMAN> run {
2> allocate channel c1 device type 'SBT_TAPE';
3> }

using target database control file instead of recovery catalog
allocated channel: c1
channel c1: SID=807 instance=TESTDB device type=SBT_TAPE
channel c1: Data Protection for Oracle: version 5.5.1.0
released channel: c1
Categories: linux, oracle Tags: , ,

Oracle GI 12.1 error when using NFS

January 16th, 2014 No comments

I had quite an interesting case recently where I had to build stretch cluster for a customer using Oracle GI 12.1 and placing quorum voting disk on NFS. There is a document at OTN regarding the stretch clusters and using NFS as a third location for voting disk but it has information for 11.2 only as of the moment. Assuming there is no difference in the NFS parameters I used the Linux parameters from that document and mounted the NFS share on the cluster nodes.

Later on when I tried to add the third voting disk within the ASM disk group I got this strange error:

SQL> ALTER DISKGROUP OCRVOTE ADD  QUORUM DISK '/vote_nfs/vote_3rd' SIZE 10000M /* ASMCA */
Thu Nov 14 11:33:55 2013
NOTE: GroupBlock outside rolling migration privileged region
Thu Nov 14 11:33:55 2013
Errors in file /install/app/oracle/diag/asm/+asm/+ASM1/trace/+ASM1_rbal_26408.trc:
ORA-17503: ksfdopn:3 Failed to open file /vote_nfs/vote_3rd
ORA-17500: ODM err:Operation not permitted
Thu Nov 14 11:33:55 2013
Errors in file /install/app/oracle/diag/asm/+asm/+ASM1/trace/+ASM1_ora_33427.trc:
ORA-17503: ksfdopn:3 Failed to open file /vote_nfs/vote_3rd
ORA-17500: ODM err:Operation not permitted
NOTE: Assigning number (1,3) to disk (/vote_nfs/vote_3rd)
NOTE: requesting all-instance membership refresh for group=1
Thu Nov 14 11:33:55 2013
ORA-15025: could not open disk "/vote_nfs/vote_3rd"
ORA-17503: ksfdopn:3 Failed to open file /vote_nfs/vote_3rd
ORA-17500: ODM err:Operation not permitted
WARNING: Read Failed. group:1 disk:3 AU:0 offset:0 size:4096
path:Unknown disk
incarnation:0xeada1488 asynchronous result:'I/O error'
subsys:Unknown library krq:0x7f715f012d50 bufp:0x7f715e95d600 osderr1:0x0 osderr2:0x0
IO elapsed time: 0 usec Time waited on I/O: 0 usec
NOTE: Disk OCRVOTE_0003 in mode 0x7f marked for de-assignment
Errors in file /install/app/oracle/diag/asm/+asm/+ASM1/trace/+ASM1_ora_33427.trc  (incident=83441):
ORA-00600: internal error code, arguments: [kfgscRevalidate_1], [1], [0], [], [], [], [], [], [], [], [], []
ORA-15080: synchronous I/O operation failed to read block 0 of disk 3 in disk group OCRVOTE

This happens because with 12c direct NFS is used by default and it will use ports above 1024 to initiate connections. On the other hand there is a default option on the NFS server – secure which will require any incoming connections from ports below 1024:
secure This  option requires that requests originate on an internet port less than IPPORT_RESERVED (1024). This option is on by default. To turn it off, specify insecure.

The solution for that is to add insecure parameters to the exporting NFS server, remount the NFS share and then retry the above operation.

For more information refer to:
12c GI Installation with ASM on NFS Disks Fails with ORA-15018 ORA-15072 ORA-15080 (Doc ID 1555356.1)

 

Categories: linux, oracle Tags: , ,

EMDIAG Repvfy 12c kit – basics

October 30th, 2013 No comments

Second post in the series of emdiag repvfy kit about the basics of the tool. Having the kit already installed in earlier post it is time now to get some basics before we start troubleshooting.

There are three main commands with repvfy:

  • verify – repository-wide verification
  • analyze – objects specific verification/analysis
  • dump  – dump specific repository object

As you can tell from the description, the verify is run against the repository and doesn’t require any arguments by default while analyze and dump commands require specific object to be given. To get a list of all available commands of the kit run repvfy -h1.

 

Verify command

Let’s make something clear in the begging, the verify command will run repository-wide verification against many tests which are FIRST grouped into modules and SECOND categorized in several levels. To get a list of available modules run repvfy -h4, there are more than 30 modules and I won’t go into detail for each, but most useful are – Agents, Plugins, Exadata, Repository, Targets. The list of levels can be found at the end of the post, it’s important to say that levels are cumulative and by default tests are run in level 2!

When investigate or debug a problem with the repository always start with verify command. It’s a good starting point to run verify without any arguments, it will go through all modules and give you summary if certain problems (violations) are present and also get an initial look on the health of the repository and then start debugging specific problem.

So here is how verify output looked for my OEM repository:

[oracle@oem bin]$ ./repvfy verify

Please enter the SYSMAN password:

-- --------------------------------------------------------------------- --
-- REPVFY: 2013.1008     Repository: 12.1.0.3.0     29-Oct-2013 11:30:37 --
---------------------------------------------------------------------------
-- Module:                                          Test:   0, Level: 2 --
-- --------------------------------------------------------------------- --

verifyAGENTS
verifyASLM
verifyAVAILABILITY
1002. Disabled response metrics (16570376): 2
verifyBLACKOUTS
verifyCAT
verifyCORE
verifyECM
1002. Unregistered ECM metadata tables: 2
verifyEMDIAG
verifyEVENTS
verifyEXADATA
2001. Exadata plugin version mismatches: 5
verifyJOBS
verifyJVMD
verifyLOADERS
verifyMETRICS
verifyNOTIFICATIONS
verifyOMS
verifyPLUGINS
1003. Plugin metadata versions out of sync: 13
verifyREPOSITORY
verifyTARGETS
1021. Composite metric calculation with inconsistent dependant metadata versions: 3
2004. Targets without an ORACLE_HOME association: 2
2007. Targets with unpromoted ORACLE_HOME target: 2
verifyUSERS

The verify command can also be run with -detail argument to get more details for the problem found. It will also show which test found the problem and what actions can be taken to correct the problem. That’s useful for another reason – it will print the target name and guid which can be used for further analysis using analyze and dump commands.

The command can also be run with -level argument, starting with zero for a fatal errors and increasing to nine for more minor errors and best practices, list of levels can be found at the end of the post.

 

Analyze command

Analyze command is run against specific target which can be specific either by its name or its unique identifier (guid). To get a list of supported targets run repvfy -h5. The analyze command is very similar to the verify command except it is run against specific target. Again it can be run with -level and -detail arguments, like this:

[oracle@oem bin]$ ./repvfy analyze exadata -guid 6744EED794F4CCCDBA79EC00332F65D3 -level 9

Please enter the SYSMAN password:

-- --------------------------------------------------------------------- --
-- REPVFY: 2013.1008     Repository: 12.1.0.3.0     29-Oct-2013 12:00:09 --
---------------------------------------------------------------------------
-- Module: EXADATA                                  Test:   0, Level: 9 --
-- --------------------------------------------------------------------- --

analyzeEXADATA
2001. Exadata plugin version mismatches: 1
6002. Exadata components without a backup Agent: 4
6006. Check for DB_LOST_WRITE_PROTECT: 1
6008. Check for redundant control files: 5

For that Exadata target we can see there are few more problems found with level 9 except the one found earlier about plugin version mismatch with level 2.

One of the next posts will be dedicated to troubleshooting and fixing problems in Exadata module.

 

Dump command

Dump command is used to dump all the information about specific repository object, as analyze command it expects either target name or target guid. For a list of supported targets run repvfy -h6.

I won’t show any example because it will dump all the details about that target – more than 2000 lines. If you run the dump command against the same target used in analyze you will get a ton of information  like – associated targets with this Exadata (hosts, iloms, databases, instances), list of monitoring agents, plugin version, some address details, long list of  targets alerts/warnings.

Seems rather useless because it just dumps a lot of information but actually it helped me identifying the problem I had about plugin version mismatch within the Exadata module.

 

Repository verification and object analysis levels:

0  - Fatal issues (functional breakdown)
 These test highlight fatal errors found in the repository. These errors will prevent EM from functioning
 normally and should get addressed straight away.

1  - Critical issues (functionality blocked)

2  - Severe issues (restricted functionality)

3  - Warning issues (potential functional issue)
 These tests are meant as 'warning', to highlight issues which could lead to potential problems.

4 - Informational issues (potential functional issue)
 These tests are informational only. They represent best practices, potential issues, or just areas to verify.

5 - Currently not used

6  - Best practice violations
 These test highlight discrepancies between the known best practices, and the actual implementation
 of the EM environment.

7  - Purging issues (obsolete data)
 These test highlight failures to clean up (all the) historical data, or problems with orphan data
 left behind in the repository.

8  - Failure Reports (historical failures)
 These test highlight historical errors that have occurred.

9  - Tests and internal verifications
 These tests are internal tests, or temporary and diagnostics tests added to resolve specific problems.
 They are not part of the 'regular' kit, and are usually added while debugging or testing specific issues.

 

In the next post I’ll troubleshoot and fix the errors I had within the Availability module – Disabled response metrics.

 

For more information and examples refer to following notes:
EMDIAG Repvfy 12c Kit – How to Use the Repvfy 12c kit (Doc ID 1427365.1)
EMDIAG REPVFY Kit – Overview (Doc ID 421638.1)

 

EMDIAG repvfy blog series:
  • EMDIAG Repvfy 12c kit – installation
  • EMDIAG Repvfy 12c kit – basics
  • EMDIAG Repvfy 12c kit – troubleshoot Availability module
  • EMDIAG Repvfy 12c kit – troubleshoot Exadata module
  • EMDIAG Repvfy 12c kit – troubleshoot Plugins module
  • EMDIAG Repvfy 12c kit – troubleshoot Targets module

 

Categories: oracle Tags: , ,

Why my EM12c is giving Metric evaluation error for Exadata cell targets?

October 25th, 2013 No comments

As part of my Cloud Control journey I encountered a strange problem where I got the following error for few Exadata Storage Server (cell) targets:

Metric evaluation error start - oracle.sysman.emSDK.agent.fetchlet.exception.FetchletException: em_error=Failed to execute_exadata_response.pl ssh -q -o ConnectTimeout=60 -o BatchMode=yes -o StrictHostKeyChecking=no -o PreferredAuthentications=publickey -i /home/oracle/.ssh/id_dsa -l cellmonitor 10.141.8.68 cellcli -xml -e ' list cell attributes msStatus ':

Another symptom is that I received two mails from OEM, one saying that the cell and its services are up:

EM Event: Clear:exacel05.localhost.localdomain - exacel05.localhost.localdomain is Up. MS Status is RUNNING and Ping Status is SUCCESS.

and another one saying there is Metric evaluation error for the same target:

EM Event: Critical:exacel05.localhost.localdomain - Metric evaluation error start - oracle.sysman.emSDK.agent.fetchlet.exception.FetchletException: em_error=Failed to execute_exadata_response.pl ssh -q -o ConnectTimeout=60 -o BatchMode=yes -o StrictHostKeyChecking=no -o PreferredAuthentications=publickey -i /home/oracle/.ssh/id_dsa -l cellmonitor 10.141.8.68 cellcli -xml -e ' list cell attributes msStatus ':

I have to say that the error didn’t came up by itself, but it manifested after I had to redeploy the Exadata plugin on few agents. If you ever had to do this you would know that before removing the plugin from an agent you need to make sure the agent is not primary monitoring agent for Exadata  targets. In my case few of the agents were Monitoring Agents for the cells and I had to swap them with the Backup Monitoring Agent so I would be able to redeploy the plugin on the primary monitoring agent.

After I redeployed the plugin, I tried to revert back the initial configuration but for some reason the configuration messed up and I ended up with different agents monitoring different cell targets from what was at the beginning.

It turns out that one of the monitoring agents wasn’t able to connect to the cell and that’s why I got the email notifications and the Metric evaluation errors for the cells. Although that’s not a problem it’s quite annoying to receive such alerts and have all these targets with Metric collection error icons in OEM or having these targets reported with status Down.

Let’s first check which are the monitoring agents for that cell target from the OEM repository:

SQL> select target_name, target_type, agent_name, agent_type, agent_is_master
from MGMT$AGENTS_MONITORING_TARGETS
where target_name = 'exacel05.localhost.localdomain';

TARGET_NAME                      TARGET_TYPE     AGENT_NAME                         AGENT_TYPE AGENT_IS_MASTER
-------------------------------- --------------- ---------------------------------- ---------- ---------------
exacel05.localhost.localdomain   oracle_exadata  exadb03.localhost.localdomain:3872 oracle_emd               0
exacel05.localhost.localdomain   oracle_exadata  exadb02.localhost.localdomain:3872 oracle_emd               1

Looking on the cell secure log we can see that one of the monitoring agents wasn’t  able to connect because of failed publickey authentication:

Oct 23 11:39:54 exacel05 sshd[465]: Connection from 10.141.8.65 port 14594
Oct 23 11:39:54 exacel05 sshd[465]: Failed publickey for cellmonitor from 10.141.8.65 port 14594 ssh2
Oct 23 11:39:54 exacel05 sshd[466]: Connection closed by 10.141.8.65
Oct 23 11:39:55 exacel05 sshd[467]: Connection from 10.141.8.66 port 27799
Oct 23 11:39:55 exacel05 sshd[467]: Found matching DSA key: cf:99:0a:37:1a:e5:84:dc:a8:8a:b9:6f:0c:fd:05:c5
Oct 23 11:39:55 exacel05 sshd[468]: Postponed publickey for cellmonitor from 10.141.8.66 port 27799 ssh2
Oct 23 11:39:55 exacel05 sshd[467]: Found matching DSA key: cf:99:0a:37:1a:e5:84:dc:a8:8a:b9:6f:0c:fd:05:c5
Oct 23 11:39:55 exacel05 sshd[467]: Accepted publickey for cellmonitor from 10.141.8.66 port 27799 ssh2
Oct 23 11:39:55 exacel05 sshd[467]: pam_unix(sshd:session): session opened for user cellmonitor by (uid=0)

That’s confirmed by checking ssh authorized_keys file, which also confirms which were initially configured monitoring agents:

 [root@exacel05 .ssh]# grep exadb /home/cellmonitor/.ssh/authorized_keys | cut -d = -f 2
oracle@exadb03.localhost.localdomain
oracle@exadb04.localhost.localdomain

Another way to check which monitoring agent were configured initially is to check the snmpSubscriber attribute for that specific cell:

[root@exacel05 ~]# cellcli -e list cell attributes snmpSubscriber
((host=exadb03.localhost.localdomain,port=3872,community=public),(host=exadb04.localhost.localdomain,port=3872,community=public))

It’s obvious that exadb02 shouldn’t be monitoring this target but it should be exadb04 instead. I believe that when I redeployed the Exadata plugin this agent wasn’t eligible to monitor Exadata targets any more and was replaced with another one but that’s just a guess.

There are two solutions for that problem:

1. Move (relocate) target definition and monitoring to the correct agent:

I wasn’t able to find a way to do that through OEM Console and for that purpose I used emcli. Based on MGMT$AGENTS_MONITORING_TARGETS query and snmpSubscriber attribute I was able to find which agent was configured initially and which have to be removed.  Then I used emcli to relocate the monitoring agent for that target to the correct agent, the one which was configured initially:

[oracle@oem ~]$ emcli relocate_targets -src_agent=exadb02.localhost.localdomain:3872 -dest_agent=exadb04.localhost.localdomain:3872 -target_name=exacel05.localhost.localdomain -target_type=oracle_exadata -copy_from_src
Moved all targets from exadb02.localhost.localdomain:3872 to exadb04.localhost.localdomain:3872

2. Reconfigure the cell to use the new monitoring agent:

Add the current monitoring agent ssh publickey into the authorized_keys of the cell:

Place the oracle user DSA public key (/home/oracle/.ssh/id_dsa.pub) from exadb02 into exacel05:/home/cellmonitor/.ssh/authorized_keys

and also change the cell snmpSubscriber attribute:

[root@exacel05~]# cellcli -e "alter cell snmpSubscriber=((host='exadb03.localhost.localdomain',port=3872,community=public),(host='exadb02.localhost.localdomain',port=3872,community=public))"
Cell exacel05 successfully altered
[root@exacel05~]# cellcli -e list cell attributes snmpSubscriber
((host=exadb03.localhost.localdomain,port=3872,community=public),(host=exadb02.localhost.localdomain,port=3872,community=public))

After that the status at OEM for the Exadata Storage Server (cell) target became up and also the metrics were fine now.

 

Categories: oracle Tags: , , , ,