ASM MEMBER disks showing as CANDIDATE

Some time ago I spent a few hours troubleshooting an issue where some 20 ASM member disks went all to either CANDIDATE or PROVISIONED status after the servers were rebooted. That’s odd because if the disks were really removed from the groups, they would appear as FORMER as indicated in the ASM documentation.

Some context: This happened in a RAC cluster running Oracle 10.2.0.5 on Linux 64bits. A handful of disks were new to the cluster, having been added to the servers a few days ago.

When the servers got restarted, ASM instances failed to mount the disk groups automatically as it would be expected. The error log in the ASM alert file was:

Sun Abc 01 01:01:01 CST 0101
ERROR: no PST quorum in group 2: required 2, found 0
Sun Abc 01 01:01:01 CST 0101
NOTE: cache dismounting group 2/0xDB297CEB (DATA_1)
NOTE: dbwr not being msg'd to dismount
ERROR: diskgroup DATA_1 was not mounted

Subsequent query to v$asm_disk showed HEADER_STATUSes as CANDIDATE or PROVISIONED:

GROUP_NUMBER DISK_NUMBER MODE_ST STATE TOTAL_MB FREE_MB DISKNAME PATH HEADER_STATUS
------------ ----------- ------- -------- ---------- ---------- ------------ ----------------------------- -------------
0 5 ONLINE NORMAL 276210 0 /oradata/asm/datalun01 PROVISIONED
0 2 ONLINE NORMAL 276210 0 /oradata/asm/datalun02 PROVISIONED
0 6 ONLINE NORMAL 276210 0 /oradata/asm/datalun03 CANDIDATE
0 7 ONLINE NORMAL 276210 0 /oradata/asm/datalun04 CANDIDATE

But nothing (no rows selected) on v$asm_diskgroup.

Disks were reachable to ASM, so the asm_diskstring parameter was good. Disk permissions on linux were also good.

The Oracle-supplied kfod utility also showed the disks as CANDIDATE:

$ kfod a='/oradata/asm/*' di=all _asm_a=FALSE n=TRUE op=DISKS status=TRUE
276210 CANDIDATE /oradata/asm/datalun01
276210 CANDIDATE /oradata/asm/datalun02
...

But then, a direct read from the disk with the strings command showed actual content:

$ strings /oradata/asm/datalun1 |head
ORCLDISK
DATA_1_0000
DATA_1
DATA_1_0000
...

So I knew the data was there and was reachable, just ASM was not seeing it the way it was supposed to. After searching on the web and on Oracle Support Site, I found nothing definitive on this. I then opened a Service Request with Oracle and got a quick response.

The Oracle representative had me build the kfed utility (it’s available by default from 11.1 on, but can be built on 10.2) and with it we fixed the disk checksums, making them appear as MEMBERs again and then mounting the disk groups. For some reason the disk internal checksums got bad and thus ASM was not recognizing them.

Then we ran an “alter diskgroup data_1 check all norepair;” on all disk groups just to make sure. All output of it goes to the ASM alert log. Gladly there were no further issues on the groups, so the database was open shortly after.

I’ll refrain from writing here the exact commands to fix the issue as they write the disks in an unusual manner which I believe one should only use under Oracle Support supervision.

While writing the blog post, however, I found this forum thread in which “alp” found his own way out of a similar situation on completely different HW and OS. Use at your own discretion.

The Oracle Support Note “ASM REACTING TO PARTITION ERRORS [ID 1062954.1]” also presents similar conditions, mentions the recent addition of new disks, but does not give a definitive root cause for the issue.

I hope this information helped you in some way, although incomplete.

Advertisements
This entry was posted in Availability, Oracle, Troubleshoot and tagged , , , , . Bookmark the permalink.

Leave a Reply

Fill in your details below or click an icon to log in:

WordPress.com Logo

You are commenting using your WordPress.com account. Log Out / Change )

Twitter picture

You are commenting using your Twitter account. Log Out / Change )

Facebook photo

You are commenting using your Facebook account. Log Out / Change )

Google+ photo

You are commenting using your Google+ account. Log Out / Change )

Connecting to %s