11g: handling error ORA-00752 – lost write of a data block in the PRIMARY database

With the recent (11gR1) addition of lost write detection techniques, Oracle has one more tool to help avoid data losses.

Prerequisites to it are having a physical standby database applying logs and the parameters enabled in both the primary and the standby databases.

Lost writes may happen both in primary and in standby databases. Although both kinds of messages are raised on the standby database, it is very important to be aware of the two different situations and react accordingly.

The easier-to-solve issue is when the lost write happened in the standby database itself. The error message in this case presents error code ORA-00600 [3020], and the solution is to restore a good backup of the affected datafile to the standby.

When the error message includes the ORA-00752 error code, however, it means that the primary database lost a write. In this case, having the datafile copied from primary to standby will force log apply to continue – thereby relieving the symptoms, but will not solve the underlying issue. Your primary will still have at least one block that lost a write, requiring additional work.

Below is a test case of the error messages in the alert log of a standby database:
Thu Abc 01 01:01:01 0101
STANDBY REDO APPLICATION HAS DETECTED THAT THE PRIMARY DATABASE
LOST A DISK WRITE OF BLOCK 2751496, FILE 8
NO REDO AT OR AFTER SCN 8373189 CAN BE USED FOR RECOVERY.
Thu Abc 01 01:01:01 0101
Slave exiting with ORA-752 exception
Errors in file /oracle/diag/rdbms/mydb_dr/mydb2/trace/mydb2_pr03_30502.trc:
ORA-00752: recovery detected a lost write of a data block
ORA-10567: Redo is inconsistent with data block (file# 8, block# 2751496, file offset is 1065418752 bytes)
ORA-10564: tablespace USER_INDEX
ORA-01110: data file 8: '+DATA_1/mydb_dr/datafile/user_index.318.761516453'
ORA-10561: block type 'TRANSACTION MANAGED INDEX BLOCK', data object# 28737
Thu Abc 01 01:01:01 0101
Shutting down instance (abort)

Behold, suffering a lost write does not necessarily result in a corrupt block. Traditional corruption verification techniques may not find it. It may just mean that data which the user believes was written and committed to the database is in fact not there.

Oracle documentation will indicate a failover as the solution. That is not always the best solution, as a failover will lose all the transactions that happened after the lost write was detected.

In the above example, as the affected block belonged to an index, in order to solve the issue I copied the datafile from primary to standby and recreated the affected index in the production database.

If the lost write happens to be in a table, solving it gets complicated, depeding on what transactions occurred and how long it took for the lost write to be discovered. So in fact, for this instance I don’t have any suggestions. It would have to be troubleshooted case-by-case.

More information can be found in Oracle Documentation with this link: (Enabling Lost Write Detection)

This link has additional information also (Recovering From Lost-Write Errors on a Primary Database)

Advertisements
This entry was posted in Availability, Oracle, Troubleshoot and tagged , , , , , . Bookmark the permalink.

Leave a Reply

Fill in your details below or click an icon to log in:

WordPress.com Logo

You are commenting using your WordPress.com account. Log Out / Change )

Twitter picture

You are commenting using your Twitter account. Log Out / Change )

Facebook photo

You are commenting using your Facebook account. Log Out / Change )

Google+ photo

You are commenting using your Google+ account. Log Out / Change )

Connecting to %s