Some notes on handling Oracle corruption

Version : 1.1
Date : 28/10/2010
By : Albert van der Sel
Type of doc : It's just a few notes on Oracle Block and File corruption in very simple words. It's no more than "entry level".
For who : For anyone who likes a short orientation on the subject.




Although you do not need to be an expert, it is assumed that you have at least a "reasonable" level of knowledge of Oracle.

So what can you expect to find in this note?

Chapter 1 just describes the creation of a 'SALES' database which will be used to simulate several sorts of corruption.
Chapter 2 will demonstrate a few examples of a soft and hard corruption of data blocks.
Chapter 3 actually is the "body" of this note. It will show you how to repair an object with corrupt blocks.


Contents:

Chapter 1. Creating the example database, and getting a hex editor.

Chapter 2. Examples of soft and hard block corruption (let's do it ourselves).

- 2.1. Soft block corruption.
- 2.2. Soft corruption of a Database block, and the DB_BLOCK_CHECKING init.ora parameter.
- 2.3. Analyzing Tables and Files for corruption (using "dbv" and "ANALYZE TABLE").
- 2.4. Hard corruption of a Database block (writing all zero's).

Chapter 3. Repairing objects with corrupt blocks.

- 3.1. The DBMS_REPAIR.SKIP_CORRUPT_BLOCKS() procedure (quick fix)
- 3.2. The DBMS_REPAIR.FIX_CORRUPT_BLOCKS() procedure.
- 3.3. The ORA-1578 and ORA-8103 errors.
- 3.3. Using SET EVENTS parameters.
- 3.5. Querying "around" bad blocks.
- 3.6. Some remarks on the structure of a block.
- 3.7. Using tools and statements to "dump" blocks.



Chapter 1. Creating the example database, and getting a hex editor.


1.1 Creating a Database and some Objects:


If you have installed Oracle 10g/11g (or 12c), you might already have an Instance (and database) running.
If not, I would like you to create a database, preferrably with the name "sales".
The database name is really not so important, but if you need to create a database anyway (in order to practice
this note), you may as well call it "sales". Furthermore, if possible, place the database files in "c:\oradata\sales".
Here too, the exact location of the files is not really important. If you have placed the database files elsewhere,
just remember to "translate" my location (which is "c:\oradata\sales") to whatever you use.

But, in case you still need to create an Instance/database, just start the "Database Configuration Wizard",
which you can find in your Windows Start menu, or just open a cmd prompt and enter "dbca".
using the "Database Configuration Wizard", create a database, and you can use pretty much use the defaults
as suggested by the Wizard.
Again, (if possible) except for the disklocation of the database files, and, (if possible) choose a database blocksize of 8192 bytes.
The blocksize, in principle, could be 4K, 8K, 16K, 64K etc.., but depending on the details of your storage, the Wizard will
not always allow you to pick another blocksize, other as what it will come up with.

Once the database is created, we will create a couple of Tablespaces and a few objects on those tablespaces.

Note: if you are familiar with SQL Server: An Oracle Tablespace resembles the function of a SQL Server "Filegroup".

There are many ways to create tablespaces and objects like tables. In 10g/11g, a Webbased graphical "Enterprise manager"
gives you almost complete control over the Instance and database. Here, we will just use a 'prompt-oriented" sqlplus utility:

- Start a cmd prompt
- type "set ORACLE_SID=SALES" (without the ""). Ofcourse, if your database is named otherwise, then use that name.
- Enter: sqlplus /nolog
- Hopefully, an "SQL>" prompt appears. If not, the utility is probably not in your "environment" or "path".
- Now, we need to "connect" (or logon) to the Instance. Try this:

SQL> connect sys/password@sales as sysdba replace "password" with the true password of "sys"
Connected.
SQL>

Note: If the above logon did not worked, try:
SQL> connect / as sysdba

Now, just type the following commands (or use copy/paste):

CREATE TABLESPACE SALES_DATA_1 DATAFILE 'C:\ORADATA\SALES\SALES_DATA_01.DBF' SIZE 50M
EXTENT MANAGEMENT LOCAL AUTOALLOCATE
SEGMENT SPACE MANAGEMENT AUTO;

CREATE TABLESPACE SALES_DATA_2 DATAFILE 'C:\ORADATA\SALES\SALES_DATA_02.DBF' SIZE 50M
EXTENT MANAGEMENT LOCAL AUTOALLOCATE
SEGMENT SPACE MANAGEMENT AUTO;

CREATE TABLESPACE SALES_INDEX_1 DATAFILE 'C:\ORADATA\SALES\SALES_INDEX_01.DBF' SIZE 50M
EXTENT MANAGEMENT LOCAL AUTOALLOCATE
SEGMENT SPACE MANAGEMENT AUTO;

CREATE USER TESTUSER IDENTIFIED BY TESTUSER
DEFAULT TABLESPACE SALES_DATA_1
TEMPORARY TABLESPACE TEMP;

GRANT CONNECT TO TESTUSER;
GRANT DBA TO TESTUSER;

/* we will not bother about a surplus of "rights" to the "testuser" in this test database */
/* Now, lets see of our testuser can connect to the database: */

SQL> connect testuser/testuser@sales
Connected.

SQL> select name from v$database;
NAME
---------
SALES

So, we have a couple of tablespaces and a user which can connect, and can who create objects like tables and indexes.

Let's connect as the testuser again, and create a few tables. Note that the EMPLOYEE1 and EMPLOYEE2 tables,
are created on different tablespaces.

CREATE TABLE EMPLOYEE1
(
EMP_ID NUMBER(6) NOT NULL,
EMP_NAME VARCHAR2(20) NOT NULL,
SALARY NUMBER(7,2)
)
TABLESPACE SALES_DATA_1;

CREATE TABLE EMPLOYEE2
(
EMP_ID NUMBER(6) NOT NULL,
EMP_NAME VARCHAR2(20) NOT NULL,
SALARY NUMBER(7,2)
)
TABLESPACE SALES_DATA_2;


/* Now, our testuser will insert 9999 records into the EMPLOYEE1 table:

declare
i number := 1;
begin
while i<10000 loop
insert into EMPLOYEE1
values (i,'harry',2500);

i := i + 1;
end loop;
commit;
end;
/

/* Next, check if we have those records in the EMPLOYEE1 table:

set linesize 1000
set pagesize 1000

SELECT * FROM EMPLOYEE1

/* This should return 9999 records. */

We are almost ready with setting up our SALES database. The following statement will very likely NOT WORK
because your database has a "locally managed" system tablespace.
You might try the next statement, but on 10g/11g it will almost certainly not work, because per default the SYSTEM tablespace is locally managed.
If you get "ORA-12913: Cannot create dictionary managed tablespace", there is nothing to worry about.
Only in a few cases, I will refer to a dictionary managed tablespace.
Anyway, I will create a 'third' data tablespace, which is "dictionary managed".
Later, we will explain the difference between a locally and dictionary managed tablespace.

CREATE TABLESPACE SALES_DATA_3 DATAFILE 'C:\ORADATA\SALES\SALES_DATA_03.DBF' SIZE 50M
EXTENT MANAGEMENT DICTIONARY
default storage (
initial 512K
next 512K
minextents 1
pctincrease 0 )
minimum extent 512K
logging
online
permanent;

/* The upper statement very likely fails in your situation. Don't worry about it. */



1.2 Getting a Hex editor for Windows:

Next, to emulate several crashes, we need a hex editor, by which we can open a database file,
and "corrupt" the file manually.
There are several choices here. You might consider the free utility XVI32, which you can download from
the site below.

http://www.chmaas.handshake.de/delphi/freeware/xvi32/xvi32.htm

You only have to download the zip file, and extract it. Then you are ready to run it.
There exists many alternatives too. Just "google" on "hex editor".
But I really recommend xvi32, due to its simplicity and capabilities.


Chapter 2. Examples of soft and hard block corruption.

2.1. Soft block corruption.

Our "testuser" inserted 9999 records into the EMPLOYEE1 table. This table is located in the "SALES_DATA_1" tablespace,
which corresponds to the file "c:\oradata\sales\sales_data_01.dbf".
Now, say that the first couple of blocks (each 8192 bytes) are reserved for system purpose, then it's fair to "gamble" that blockno 20
is allocated to the EMPLOYEE table. Since 20x8192=163840 in decimal, that would be 28000 in Hexidecimal.

Note: just use Windows calculator, switch to "scientific mode", type 8x8192. When the result has appeared, click the "Hex" radio control.

Let's "edit" that file, while the database is closed. Startup "sqlplus", logon as "sys" (as sysdba) and shutdown the instance:

SQL> shutdown immediate

After a short while, the database is closed, meaning that no Oracle processes is accessing any database file,
and all processes are terminated (the database files are "cold").

- Startup xvi32.
- Open the file "c:\oradata\sales\sales_data_01.dbf".
- Navigate (scroll) to hex address 28000.

I noticed (in my case) that on the 19th block (just "before" 28000), there is a "harry" listed.
Actually, this is the "last" harry in that block. Now, I navigated to that string,
and replaced "harry" with "HENRY".
This is an example of a "soft" corruption. Physically, the block is sound, but the checksum of the block has changed.


Save the file, and close xvi32.

Let's start the instance, and open the SALES database. Start sqlplus (note that you probably should set the "ORACLE_SID=sales" variable first).
and startup the instance:

C:\> set ORACLE_SID=sales
C:\> sqlplus /nolog
SQL> connect / as sysdba
connected
SQL> startup
..
SQL> connect testuser/testuser
connected
SQL> SELECT * FROM EMPLOYEE1;

ERROR:
ORA-01578: ORACLE data block corrupted (file # 5, block # 19)
ORA-01110: data file 5: 'C:\ORADATA\SALES\SALES_DATA_01.DBF'

Ok, the block number is just what we expected it to be. Oracle discovered a checksum mismatch on that block.
Actually, the block is just physically sound, but the "administration" of that block is not right.

Dit you note that Oracle didn't complained at all at startup, about the corrupt block?
This is so because it's just an ordinary datablock. We can't expect Oracle to check every block at boottime.
Only when a process tries to access the block, the error is generated.
Note: Ofcourse many "special" blocks are checked, such as the headers of the database files.

2.2. Soft corruption of a Database block, and the DB_BLOCK_CHECKING init.ora parameter.

If it get's clear that a table has a corrupt block, you might think of a specific init.ora parameter: DB_BLOCK_CHECKING.
As we will soon find out, it will not help us SELECTING a table with a soft corrupt block.
Anyway, let's test it.

The parameter can take the values:

DB_BLOCK_CHECKING = { FALSE | OFF | LOW | MEDIUM | TRUE | FULL }

As you can see, you can put the parameter to FALSE. Then No block checking is performed for blocks in user tablespaces.
Although it is recommended to place it to "TRUE" for production systems, let's see what happens if we place it explicitly
to "FALSE" for the SALES database.
I assume that you know that the spfile.ora (or init.ora) contains a number of "settings", that the instance will read
at startup. You can edit the "init.ora" file accordingly, but most 10g, 11g systems use the binary "spfile.ora" variant,
which you can alter using "ALTER SYSTEM SET = VALUE SCOPE=BOTH|MEMORY|SPFILE"

So here we go:

SQL> alter system set DB_BLOCK_CHECKING=FALSE SCOPE=BOTH;

System altered

SQL> shutdown immediate /* lets do a restart of the instance anyway */
..
SQL> startup
..
SQL> show parameter DB_BLOCK_CHECKING; /* lets check the current value */

NAME TYPE VALUE
------------------------------------ ----------- --------------------
db_block_checking string FALSE
SQL>
..
SQL> SELECT * FROM TESTUSER.EMPLOYEE1;

ERROR:
ORA-01578: ORACLE data block corrupted (file # 5, block # 19)
ORA-01110: data file 5: 'C:\ORADATA\SALES\SALES_DATA_01.DBF'

So, placing this parameter to FALSE, does not help in selecting all rows from a Table, if a block is logically corrupt.
Actually DB_BLOCK_CHECKING main purpose is checking that the block in memory will be checked after it is written to disk,
which is a good thing.
But it's not an "easy escape" to be able to fully select a table while a block is corrupt.


2.3 Analyzing Tables and Files for corruption.

Before we repair anything, let's explore a few "well known" Oracle sanity checks.

1. The "ANALYZE TABLE_NAME VALIDATE STRUCTURE" statement.

Connect as sys (sysdba) and try the analyze command on the TESTUSER.EMPLOYEE1 table.

SQL> connect / as sysdba
Connected.

SQL> ANALYZE TABLE TESTUSER.EMPLOYEE1 VALIDATE STRUCTURE;
analyze table TESTUSER.EMPLOYEE1 VALIDATE STRUCTURE
*
ERROR at line 1:
ORA-01578: ORACLE data block corrupted (file # 5, block # 19)
ORA-01110: data file 5: 'C:\ORADATA\SALES\SALES_DATA_01.DBF'

Here, the "analyze" command comes up with exactly the same errormessage. You see? It could be a very usable tool
to check your tables once in a while.

2. The "dbv" OS prompt utility .

The ANALYZE TABLE statement you use while you are "in" Oracle, that is, you have a sqlplus session to the Database,
and you use that statement just as you would use any other SQL statement.

A check from the Operating System is possible too. For this, Oracle has the "dbv" utility available.
It's very easy to use. Take a look at the following example:

C:\oradata\sales>dbv file=SALES_DATA_01.DBF blocksize=8192

DBVERIFY - Verification starting : FILE = SALES_DATA_01.DBF
Page 19 is marked corrupt
Corrupt block relative dba: 0x01400013 (file 5, block 19)
Bad check value found during dbv:
Data in bad block:
type: 6 format: 2 rdba: 0x01400013
last change scn: 0x0000.000c9749 seq: 0x1 flg: 0x06
spare1: 0x0 spare2: 0x0 spare3: 0x0
consistency value in tail: 0x97490601
check value in block header: 0xe3b6
computed block checksum: 0x43c

DBVERIFY - Verification complete

Total Pages Examined : 6400
Total Pages Processed (Data) : 27
Total Pages Failing (Data) : 0
Total Pages Processed (Index): 0
Total Pages Failing (Index): 0
Total Pages Processed (Other): 12
Total Pages Processed (Seg) : 0
Total Pages Failing (Seg) : 0
Total Pages Empty : 6360
Total Pages Marked Corrupt : 1
Total Pages Influx : 0
Highest block SCN : 825161 (0.825161)

3. Some v$ views

The above commands (ANALYZE TABLE and the dbv utility) are tools by which you can check objects actively. You issue the Statement or command, and the check is done right now.

Some "v$" views will register corruption as well, like for example "V$DATABASE_BLOCK_CORRUPTION".
The only problem often seen here, is that those views will only receive input if some Oracle component
have found the corruption, and not by you (or a userprocess) issueing some statement.
For example, if you would make RMAN backups, and RMAN encounters a corrupt block, then it will be "registered",
and from then on only, it will be visible in some of those v$ views.


In this section, we have seen how to soft corrupt a database block, and we have seen the infamouse Oracle error
message "ORA-01578: ORACLE data block corrupted (file # n, block # m)"
Also we have seen two Oracle methods that enables us to check tables and files on corruption.

2.4 Hard corruption of a Database block.

This time, we will truly destroy a few blocks, that is, fill a region with "all zero's".
We will start somewhere in the middle of a block, then cross a block boundary, and continue
to write zero's all up to somewhere in the middle of that next block.

First, we will start with a "fresh" situation. That is, we will drop the testuser (and all of his objects),
and then we will drop the tablespaces "SALES_DATA_1" and "SALES_DATA_1".
Then we will create those tablespaces again, create the testuser again, create the tables again, and we will
use the script that inserts 9999 into the EMPLOYEE1 table.
It's not much work, since we have all the neccessary statements in Chapter 1.

Now you might argue, that dropping the user with "DROP USER TESTUSER CASCADE" would be enough.
Indeed, DROP USER... CASCADE should drop the user, with all it's objects included.
Well, not exactly. This will be explained later.
So, please do the following:

- from the "C:\>" prompt, set the ORACLE_SID=sales (or whatever name your database uses)
- Start sqlplus /nolog
- connect / as sysdba (or use "connect sys/password@sales as sysdba")
then use statements as show below:

SQL> connect / as sysdba
Connected.
SQL> DROP USER TESTUSER CASCADE;

User dropped.

SQL> DROP TABLESPACE SALES_DATA_1 INCLUDING CONTENTS;

Tablespace dropped.

SQL> DROP TABLESPACE SALES_DATA_2 INCLUDING CONTENTS;

Tablespace dropped.

Next, delete the physical files from the system. For that, use explorer or use the cmd prompt:

C:\oradata\sales> del sales_data_*

Now, using the statements of chapter 1:
- create those tablespaces (SALES_DATA_1 and SALES_DATA_2) again,
- create the testuser again,
- create the tables again (as user TESTUSER), and insert 9999 records into the EMPLOYEE1 again.

This is a fresh start !

If you executed the above, again it's (oviously) "fair" to "assume" that blockno 20
is allocated to the EMPLOYEE1 table.
Again, since 20x8192=163840 in decimal, that would be 28000 in Hexidecimal.

Now, let's shutdown the database, start xvi32, and open the "C:\ORADATA\SALES\SALES_DATA_01.DBF" databasefile.

Scroll down to hex address 27EEE (the 19th block), start entering "zero's", all the way to for example
hex address 28176. Here, we have destroyed the end of the 19th block, crossing the boundary,and destroyed
the header of the 20th block. This is hard corruption.

Save the file, and close xvi32.

Let's start the instance, and open the SALES database. Start sqlplus and startup the instance:

C:\> set ORACLE_SID=sales
C:\> sqlplus /nolog
SQL> connect / as sysdba
connected
SQL> startup
..
SQL> connect testuser/testuser
connected
SQL> SELECT * FROM EMPLOYEE1;

ERROR:
ORA-01578: ORACLE data block corrupted (file # 5, block # 19)
ORA-01110: data file 5: 'C:\ORADATA\SALES\SALES_DATA_01.DBF'

Just as before, at the moment Oracle accesses the first corrupt block, it pulls out, throwing the same
familiar message as in section 2.1.
Note that from the message alone, you cannot determine whether a block is soft- or hardcorrupted.

Ofcourse, we can use the "ANALYZE TABLE" SQL Statement again, and we can try the "dbv" utility.
Let's try the latter one. Exit sqlplus and open a command cmd window.
Navigate to the C:\ORADATA\SALES directory. Enter the familiar dbv command:

C:\oradata\sales> dbv file=SALES_DATA_01.DBF blocksize=8192

DBVERIFY - Verification starting : FILE = SALES_DATA_01.DBF
Page 19 is influx - most likely media corrupt
Corrupt block relative dba: 0x01400013 (file 5, block 19)
Fractured block found during dbv:
Data in bad block:
type: 6 format: 2 rdba: 0x01400013
last change scn: 0x0000.000cb030 seq: 0x1 flg: 0x06
spare1: 0x0 spare2: 0x0 spare3: 0x0
consistency value in tail: 0x00000000
check value in block header: 0xefa5
computed block checksum: 0x9678

Page 20 is influx - most likely media corrupt
Corrupt block relative dba: 0x01400014 (file 5, block 20)
Fractured block found during dbv:
Data in bad block:
type: 0 format: 0 rdba: 0x00000000
last change scn: 0x0000.00000000 seq: 0x0 flg: 0x00
spare1: 0x0 spare2: 0x0 spare3: 0x0
consistency value in tail: 0xb0300601
check value in block header: 0x0
block checksum disabled

DBVERIFY - Verification complete

Total Pages Examined : 6400
.. (some records omitted)
Total Pages Marked Corrupt : 2
Total Pages Influx : 2
Highest block SCN : 831536 (0.831536)


This time, the dbv utility informs you about the "hard" corruption. Note the "Page 19 is influx - most likely media corrupt" message.
A similar message is shown for block 20.
This is very different from the dbv output as shown in section 2.1 (the soft corrupted block).


Chapter 3. Repairing objects with corrupt blocks.

3.1. Repairing tables with the DBMS_REPAIR.SKIP_CORRUPT_BLOCKS() procedure.

This method, using DBMS_REPAIR.SKIP_CORRUPT_BLOCKS(), can be regarded as "the quick fix".

The DBMS_REPAIR package, contains a number of subprograms that enable you to detect, or skip, or repair corrupt blocks
in tables and indexes.
At many occasions, you probably use the "skip" functionality, after you have seen that repair is not the "repair"
you probably hoped it to be.

Why whould a repair not work?

- In section 2.1, we have seen an example of a "soft" corruption. Here, we replaced a characterstring (harry)
with another characterstring (HENRY) in the same place in that block. It only lead to a different
checksum of that block. Such soft corruptions are indeed (in principle) repairable, and thus the block(s) might be recovered.

- In section 2.4, I hard corrupted a few blocks, by placing all zero's in the footer of one block, crossing over
to the next block, and destroying that block's header and most of it's data.
That is not repairable, because the information is truly lost.
Only an RMAN block restore could help us here, or possibly one of Oracle's fabulous FLASHBACK options.

We will study two procedures from the package:

(1): DBMS_REPAIR.SKIP_CORRUPT_BLOCKS()
(2): DBMS_REPAIR.FIX_CORRUPT_BLOCKS()

Personally, I favour the "SKIP" procedure over the "FIX" procedure.
In case of "hard" corruption or "soft" corruption, where one block, or a few blocks, are affected, you can ensure
that Oracle just skips them (!), so that the table can be properly accessed, and that will allow you to copy the data
over to a new copy of that table.
In "real life" circumstances, actually this is a very nice solution, that indeed will work.

Remember, as you have seen in Chapter 2, if you query a table with a corrupt block, it get's interrupted when the
corrupt block is accessed. So, it's difficult to make a full copy of the data of all "good" blocks.

If only one or two blocks are inaccessible, usually this represents just a tiny bit of data, compared to all data
that's stored in that table.
That's why the "skip corrupted" procedure is so valuable, in case of corrupted blocks. This one works really works "quickly".

Let's try it.

This procedure enables or disables the skipping of corrupt blocks during index and table scans of the specified object.
In most occasions, you want Oracle "to skip" corrupt blocks, so that queries can proceed without interruptions.
In this section, we are going to "enable" the skipping of bad blocks.
Thus, what we are going to do here, is telling Oracle to skip over corrupt block, which will enable you to create
a "good" table, reading all good blocks, while ONLY skipping the bad block(s).

The full syntax of the function is shown below. It takes 4 input vars, of which the last 2 can be left as the default.

exec dbms_repair.skip_corrupt_blocks(SCHEMA_NAME, OBJECT_NAME, OBJECT_TYPE, SKIP_FLAG);

Here, the SCHEMA_NAME can be read as the "owner" of the object (like testuser, harry etc..)
Formally, a SCHEMA is not equal to what is understood to be an "owner", but it usually is.
The OBJECT_NAME, most often is the name of a table, or an index.
The OBJECT_TYPE, most often is "table" or "index". The default is "table".
The SKIP_FLAG tells Oracle to skip or not to skip, corrupt tables. The default is "skip".

Remember what happened in section 2.2, where we have the situation that two blocks is the TESTUSER.EMPLOYEE1 table, are hard corrupted?
When one would query that table, this happened:

C:\> set ORACLE_SID=sales
C:\> sqlplus /nolog
SQL> connect / as sysdba
connected
SQL> startup
..
SQL> SELECT * FROM TESTUSER.EMPLOYEE1;

ERROR:
ORA-01578: ORACLE data block corrupted (file # 5, block # 19)
ORA-01110: data file 5: 'C:\ORADATA\SALES\SALES_DATA_01.DBF'

Let's "fix" the table, using "dbms_repair.skip_corrupt_blocks":

SQL> connect / as sysdba
connected

SQL> exec dbms_repair.skip_corrupt_blocks('TESTUSER', 'EMPLOYEE1');

PL/SQL procedure successfully completed.

SQL> SELECT * FROM TESTUSER.EMPLOYEE1;

All rows from all good blocks are shown.

Don't forget the quotes around the SCHEMA_NAME and OBJECT_NAME (thus: 'TESTUSER', 'EMPLOYEE1')
The above action has made the EMPLOYEE1 table accesible again !

Ofcourse, it is suggested that you now create a copy table with all good data (which usually will be "the bulk" of the data).
So, in general, that would be using a Statement like:

CREATE TABLE salvage_table
AS SELECT * FROM corrupt_table;

Now you have saved the "bulk" of the tablerows, because usually there is only one (or at most a couple) bad blocks.
So, making the table "alive" again, and copying the good rows to another table, is quite a reasonable solution.
Ofcourse, you will miss a couple of rows (those that resided on the bad blocks).

CONCLUSION:

DBMS_REPAIR.SKIP_CORRUPT_BLOCKS(SCHEMA_NAME,TABLE_NAME)

is a very reasonable method to access a table with corrupt blocks, and copy the content to a new table.


  • But remember, this procedure is best to be used if you want to copy all "good" rows from the "bad" table, to a new table.

  • If you want to keep using the current table, it's better to use the method of section 3.2.


  • Note: If Oracle is skipping corrupt block(s), you are also able to export (using 'exp' or 'expdp') the table to a file.


    3.2 The DBMS_REPAIR.FIX_CORRUPT_BLOCKS() procedure:

    If you are already content with the method of section 3.1, because it's a method to save the good rows of a table,
    you may skip this section, and go on to section 3.3.

    In case of block corruption, we might consider using the DBMS_REPAIR.FIX_CORRUPT_BLOCK() procedure. Note the expression "FIX" in
    the procedure name.
    I am afraid that it will be a bit of a dissapointment, because you might have thought that it could repair soft corruption.
    In many circumstances, the result is that the procedure will "mark" or "flag" soft or hard corrupted block(s),
    and then consider it to be "repaired".

    In case of a hard corrupted block (as in section 2.2), data is truly gone (like we did in writing "all zero's").
    So, in all hostety, we can't expect Oracle to repair that (that is, get the rows back that were on that block).

    In case of a soft corrupted block (as in section 2.1), in some circumstances, it would be reasonable to expect, that the block
    can be repaired. For example, if only a checksum on a block is not right, it should be repairable.
    Using the Oracle DBMS_REPAIR procedures allone, you won't really repair the block.
    But, in conjunction with the DUMP_ORPHAN_KEYS() and ADMIN_TABLES() procedures, you might get usefull info
    about indexes that are affected, and freelists which should be rebuild. This might help in the overall process.
    The idea then is, that you want to continue using that object, and thus additional procedures can be used
    which helps you in realizing that (providing you info about indexes, freelist etc..)

    But again, with respect to purely salvaging table data, you won't be better off compared to the method of section 3.1.

    The basic idea is this:

    The process requires two administrative tables to hold a list of corrupt block(s) and index keys pointing
    to those blocks. These are created as follows, using the DBMS_REPAIR.ADMIN_TABLES() sub procedure :

    BEGIN
    Dbms_Repair.Admin_Tables (
    table_name => 'REPAIR_TABLE',
    table_type => Dbms_Repair.Repair_Table,
    action => Dbms_Repair.Create_Action,
    tablespace => 'SALES_DATA_1');

    Dbms_Repair.Admin_Tables (
    table_name => 'ORPHAN_KEY_TABLE',
    table_type => Dbms_Repair.Orphan_Table,
    action => Dbms_Repair.Create_Action,
    tablespace => 'SALES_DATA_1');
    END;
    /

    The above statements only builds the admin tables.

    Now that the administration tables are built, we are able to check the table using the DBMS_REPAIR.CHECK_OBJECT() procedure.
    This procedure behaves quite like the "ANALYZE TABLE" command, to parse through the whole table, checking each block.

    SET SERVEROUTPUT ON /* needed for more output on your screen */

    DECLARE
    no_corrupt INT;
    BEGIN
    no_corrupt := 0;
    Dbms_Repair.Check_Object (
    schema_name => 'TESTUSER',
    object_name => 'EMPLOYEE1',
    repair_table_name => 'REPAIR_TABLE',
    corrupt_count => no_corrupt);
    Dbms_Output.Put_Line('number blocks corrupt: ' || TO_CHAR (no_corrupt));
    END;


    After running the above statement, the currupt block(s) have been detected, but are not marked as corrupt yet.
    Next, the DBMS_REPAIR.FIX_CORRUPT_BLOCKS() procedure can be used to mark the blocks as corrupt, allowing them
    to be skipped when queries are run.

    SET SERVEROUTPUT ON
    DECLARE
    v_fixed INT;
    BEGIN
    v_fixed := 0;
    Dbms_Repair.Fix_Corrupt_Blocks (
    schema_name => 'TESTUSER',
    object_name=> 'EMPLOYEE1',
    object_type => Dbms_Repair.Table_Object,
    repair_table_name => 'REPAIR_TABLE',
    fix_count=> v_fixed);
    Dbms_Output.Put_Line('number of blocks fixed: ' || to_char(v_fixed));
    END;
    /

    At this phase, corrupt table blocks have been located, and all are "marked". Now, if you would know that
    the table has indexes, you can check all index entries, searching for "orphaned keys", meaning those index entries
    that were pointing to the lost rows on the bad table blocks.
    So, this can be done using the DBMS_REPAIR.DUMP_ORPHAN_KEYS() procedure.
    Now, suppose the EMPLOYEE1 table has an index called "INDEX_EMP_NAME", you could run a statement as follows:

    SET SERVEROUTPUT ON
    DECLARE
    v_number_orphans INT;
    BEGIN
    v_number_orphans := 0;
    Dbms_Repair.Dump_Orphan_Keys (
    schema_name => 'TESTUSER',
    object_name => 'INDEX_EMP_NAME',
    object_type => Dbms_Repair.Index_Object,
    repair_table_name => 'REPAIR_TABLE',
    orphan_table_name=> 'ORPHAN_KEY_TABLE',
    key_count => v_number_orphans);
    Dbms_Output.Put_Line('orphan key count: ' || to_char(v_number_orphans));
    END;
    /

    Ofcourse, since we have the Admin tables "REPAIR_TABLE" and "ORPHAN_KEY_TABLE", it is very instructive
    to view their contents.

    Excercise:

    Now, here is an exercise: you might try the above described procedure yourself !

    Again, you should start with a "fresh" situtation. Then you will create (again) a soft corruption in the 19th block
    in the EMPLOYEE1 table, exactly as is described in section 2.1.

    SQL> connect / as sysdba
    Connected.
    SQL> DROP USER TESTUSER CASCADE;

    User dropped.

    SQL> DROP TABLESPACE SALES_DATA_1 INCLUDING CONTENTS;

    Tablespace dropped.

    SQL> DROP TABLESPACE SALES_DATA_2 INCLUDING CONTENTS;

    Tablespace dropped.

    Next, delete the physical files from the system. For that, use explorer or use the cmd prompt:

    C:\oradata\sales> del sales_data_*

    Now, using the statements of chapter 1:
    - create those tablespaces (SALES_DATA_1 and SALES_DATA_2) again,
    - create the testuser again,
    - create the tables again (as user TESTUSER), and insert 9999 records into the EMPLOYEE1 again.
    - Also create an index on the EMPLOYEE1 table called "INDEX_EMP_NAME", using:
    CREATE INDEX INDEX_EMP_NAME ON EMPLOYEE1(EMP_NAME) TABLESPACE SALES_INDEX_1.

    Then, connect as sys (sysdba) and shutdown the database. Now, as described in section 2.1, start xvi32,
    open the "c:\oradata\sales\sales_data_01.dbf" file, navigate to the 19th block (just "before" hex 28000),
    and change one of the "last" harry's into HENRY. Then save the file and close xvi32.
    Next, startup the database.

    Now, if you want, execute the various DBMS_REPAIR subprocedures, as described above, yourself


    3.3 The ORA-01578 and ORA-08103 errors.

    ORA-01578:

    Block corruption really will not occur "often". But when it does happen, then you will see
    the well-known ORA-01578 (or ORA-1578) error message.
    In all the examples above, we indeed have seen this particular message.

    ORA-08103:

    Usually, the message is not so shocking, because it merely means this: ORA-08103: object no longer exists.

    In general, we can distinguish three cases:

    (1):
    An object was indeed missing. Such an event could for example happen if some process tries to access the object,
    while another process for example dropped or moved it.
    It does not have to be interactive processes. It could be batch processes where one process want to perform
    a drop and re-create (of a table, or index etc..), while another process is not 'aware' of that fact.
    So, in this case, the message is just stating reality: the object does not exist (for some process, at some time),
    and thus is not alarming (except for the fact that the programming logic is not right).

    (2):
    Over several Oracle versions, a few bugs raised this message too, effectively generating a "false" alarm.
    For example:
    - PQ may signal a false ORA-8103 message under a high load.
    - ORA-8103/ORA-1410 from concurrent INSERT / export on ASSM tables

    (3):
    Not withstanding the above facts, we should always take this message very seriously.
    If you see it (for example, a user reports it, or you find it in a log), and you know for sure that the object exists,
    then you might have run into a "bug". Or, indeed some form of logical corruption is bothering your system.

    If you see the message in production, and no obvious reason can be found (like that some batch indeed re-creates the object,
    at the time that another process tries to access it), and you cannot find any filed bug report, I think it needs immediate attention.

    I consider this message more serious than the ordinary "block-corrupt" errormessage (which is ORA-01578).
    That is, if you cannot explain it.
    For production systems, I think you should immediately contact Oracle Support.

    As for other options:
    You could try to re-create the object. I have seen myself that a " SELECT * " sometimes still works, while other operations fail.


    3.4 Using the SET EVENTs parameter.

    Again, we return to the usual corrupt block message, which is ORA-01578.
    There is another "quick fix", which might be important in some cases.

    Oracle provides many diagnostic tools for debugging the RDBMS. One "traditional" methods is, that certain tracing options,
    or a certain type of "behaviour" of the Instance, can be turned on using the SET EVENTS statement.
    In many cases this will result that "diagnostic information" (focused on a certain event) gets dumped to trace files.
    Or, some other "SET EVENTS" commands, will just change the behaviour of the RDBMS in a certain way.

    There are several ways to put an "event" active, for example:

    1. Using the ALTER SESSION SET EVENTS 'eventsnumber' statement. This will effect your current session only.

    2. Using the ALTER SYSTEM SET EVENTS 'eventsnumber' statement. This is a change "Instance" wide .

    3. Editing the init.ora startup file (if you do not use a spfile.ora). This is a change "Instance" wide .


    Now, the "behaviour" we are searching for, is that Oracle will skip "corrupt" blocks, and that it
    allows you select the whole table again (except for the corrupt block).

    In a way, this method is not much different from what we have seen in section 3.1,
    using the DBMS_REPAIR.SKIP_CORRUPT_BLOCKS procedure.

    If it works, you again have a method to SELECT all of the table again, and just skipping corrupt blocks.
    This means that you can create a new table from the damaged one, just as was shown in section 3.1.

    Let's try it.

    First, let's start again with a "fresh" situation. That is, drop the testuser (and all of his objects),
    and then drop the tablespaces "SALES_DATA_1" and "SALES_DATA_1".
    Then create those tablespaces again, create the testuser again, create the tables again, and use the script
    that inserts 9999 into the EMPLOYEE1 table.
    It's not much work, since we have all the neccessary statements in Chapter 1.
    Then use XVI32 to soft corrupt the 19th block (change a "harry" into "HENRY"), exactly as described in section 2.1

    If you have done all of the above, logon as "testuser", and try the following:

    SQL> SELECT * FROM EMPLOYEE1;

    (many records skipped...)

    ..
    2415 harry 2500
    2416 harry 2500
    2417 harry 2500

    ERROR:
    ORA-01578: ORACLE data block corrupted (file # 5, block # 19)
    ORA-01110: data file 5: 'C:\ORADATA\SALES\SALES_DATA_01.DBF'

    2820 rows selected.

    Ofcourse, we face the dreaded ORA-01578 error again.
    Next, try the following ALTER SESSION SET EVENTS command:

    SQL> ALTER SESSION SET EVENTS '10231 TRACE NAME CONTEXT FOREVER, LEVEL 10';

    SQL> SELECT * FROM EMPLOYEE1;

    (many records skipped...)
    9997 harry 2500
    9998 harry 2500
    9999 harry 2500

    9596 rows selected.

    This time, no error is produced, and your query just continues and is not halted.
    You are now able to select the whole table again, allowing you to salvage all 'good' rows to a new table.

    Note: If Oracle is one of your main focus points, and suppose you are not too familiar with the SET EVENTS statement,
    you really should spend one of these rainy day's on digging further into this very usefull diagnostic tool.


    3.5 Querying "around" bad blocks.

    In this method, you just use SQL queries to obtain all good rows from the object, and where you are only skipping
    the bad block(s).
    It's not garanteed that it will work in all cases, but in many it will.
    The following excellent article can be viewed as a true "classic" on the subject, and it will explain you
    all you need to know.

    http://web.inter.nl.net/users/T.Koppelaars/ora01578.rtf

    The technique described in the article, rely on rowid's and some other sophisticated insights.

    I have nothing to add to content of the above article. Maybe I can spend just a few words on a special case.
    This special case is illustrated by the following example. Here we try to avoid to access the table at all,
    and we want Oracle to ONLY use the "covering index", in a very simple context.

    You know that the EMPLOYEE1 table has the following DDL:

    CREATE TABLE EMPLOYEE1
    (
    EMP_ID NUMBER(6) NOT NULL,
    EMP_NAME VARCHAR2(20) NOT NULL,
    SALARY NUMBER(7,2)
    )
    TABLESPACE SALES_DATA_1;

    Now, suppose we have an index on that table like this one:

    CREATE INDEX INDEX_EMP ON EMPLOYEE1(EMP_NAME, SALARY) TABLESPACE SALES_INDEX_1.

    In a sense, you may view an index as a sort of "mini" table of the real table object.
    The chosen key(s) of the index, corresponds to the same columns of the table.
    I agree that this example is a very opportunistic example, because the index here has almost all tablecolumns
    except for the "EMP_ID" column.
    But I only want to convey the basic idea here.

    Now, its just a fact, that if you only query on a selection of the table columns, which are in the index as well,
    the chances are pretty good that Oracle will only access the index, and not accessing the table at all.

    So, that means that you can retrieve all data from the table, which is covered by the index.
    This includes that data (from the index), which corresponds to the data (from the table) that is on the corrupt block(s).
    Again, and it's very important to stress that fact, it's only that data (only those columns) which
    is included in the index.

    The theory is somewhat illustrated by the figure below:


    And you know? You can even ask Oracle "nicely" to use the index, by using a "Query Hint".
    But even without the "hint", if you only query on colums which are all covered by the index, Oracle will
    almost for sure use that index without accessing the table at all.

    I agree that the example above certainly has it's flaws. But I hope the general idea is clear.

    Excercise: try to make the above scenario work, and see if you indeed are able to retreive all EMP_NAME and SALARY
    values, using the index alone.


    3.6 Some remarks on the structure of a block.

    You probably know that "segments", like tables and indexes, are build from database "blocks" (or "pages').
    Usually, a DBA selects a suitable blocksize voor the entire database, or for individual tablespaces.
    Typical blocksizes are 4K, 8K, 32K etc..

    Oracle's block structure is quite complex. In such block, there are many specific "fields",
    each with a very specific purpose to the database engine.
    Below is a very schematic, and very simplified, representation of a Data Block:




    We can see a "Header" part, with a number of fixed fields, like the "DBA" (database block address), or the "blocktype".
    The blocktype identifies what sort of block we are dealing with.
    Ofcourse, most blocks in a database file are "data blocks" (type 06), for tables and indexes. But many "administrative" blocks
    exists as well, like the "segment header", which is a very special block for each table or index.

    Another interresting field is the RDBA, or the "relative database block address". This is an 4 byte address which most tools,
    or dump commands, will show in hexadecimal format. Fortunately, some "tools" or dump commands will also use a more understandable
    notation, which is like so: (file_id, blocknumber). So, for example (7,137) would then mean block number 137 in file number 7.

    Next to the header, a "transactional layer" is present. Since multiple processes can access a block, it is for example
    needed to store transaction id's. Next, the "Table Directory" contains information about the table object,
    which has rows stored in this particular block. Next, the "Row Directory" contains information about actual rows in the block,
    including addresses for each row piece. Following that, is a certain amount of "free space" which can be used for new rows.
    Then at last, we have space for the actual data, that is, the rows of the table or index object.

    Since we have "xvi32" as our Hex editor, let's take a look at how Block 20 in "c:\oradata\sales\sales_data_01.dbf"
    looks like. Later we will see more tools and options to view contents of blocks, but let's stick to xvi32 for now.

    Make sure the instance is closed. Next, start xvi32 and open "c:\oradata\sales\sales_data_01.dbf".
    Now, navigate to hex address 28000. Here we will see the "start" of block 20.



    Do you notice that the block starts with "06", meaning that this is just a data block. Next, the format seems to be "A2".
    then we see the "00 00" of the unused "spare" fields. After that, the block's DBA follows etc... etc...
    So, in my database, all data blocks start with "06 A2 00 00"

    This way, you could walk through the header, or even the entire block. This not what we will do right now, since we need
    to see first, what other tools exist to view the contents of database blocks.


    3.7 Using tools and commands to dump and view blocks.

    3.7.1 A few words on the "ALTER SYSTEM DUMP DATAFILE filenumber BLOCK blocknumber" statement:

    From a sqlplus session, you can use the following "ALTER SYSTEM.." statement to dump the contents of a block to a tracefile.
    This tracefile will be created in the USER_DUMP_DIRECTORY.

    ALTER SYSTEM DUMP DATAFILE filenumber BLOCK blocknumber;

    or:

    ALTER SYSTEM DUMP DATAFILE filenumber BLOCK min first block number BLOCK max last block number;

    So, using a connection to our SALES database, we could for example try this command:

    SQL> ALTER SYSTEM DUMP DATAFILE 5 BLOCK 17;

    The choice of numbers shown above, is due to the fact that we know that the TESTUSER.EMPLOYEE table lives in the
    first few dozens of blocks in the "c:\oradata\sales\sales_data_01.dbf" file, which in my case, is datafile number "5".

    Note 1: how to lookup the file id's?

    You probably know how to lookup the file numbers of the datafiles. Anyway, I will list one methode here:
    In my SALES database, we can see this:

    SQL> set linesize 1000
    SQL> set pagesize 1000

    SQL> SELECT FILE#, NAME FROM v$datafile;

    FILE# NAME
    1 C:\ORADATA\TEST10G\SYSTEM01.DBF
    2 C:\ORADATA\TEST10G\UNDOTBS01.DBF
    3 C:\ORADATA\TEST10G\SYSAUX01.DBF
    4 C:\ORADATA\TEST10G\USERS01.DBF
    5 C:\ORADATA\SALES\SALES_DATA_01.DBF
    6 C:\ORADATA\SALES\SALES_DATA_02.DBF
    7 C:\ORADATA\SALES\SALES_INDEX_01.DBF

    By the way, you may also use the full path of the datafile in the "dump" command, like for example:

    SQL> ALTER SYSTEM DUMP DATAFILE 'C:\ORADATA\SALES\SALES_DATA_01.DBF' BLOCK 17;

    Note 2: Where are the trace files stored?

    "Traditionally" (in Oracle up to 10g included), the init.ora/spfile.ora parameter "USER_DUMP_DEST",
    specifies some location on our filesystem where Oracle will "dump" tracefiles from userprocesses.
    You can find that location using:

    select value from v$parameter where name = 'user_dump_dest';

    This will show you the "USER_DUMP_DEST" directory. For example, on my 10g installation on Windows, it's:

    C:\oracle\product\10.2\admin\sales\udump

    Sometimes there can be a lot of ".trc" files in that place. Ofcourse, the timestamp of the files will tell
    you what is your trace, but you also use this query to find the "process id" which created your file.
    Then, just note the trace file which name carries that number, and your dump should be easy to identify.

    select spid from v$session s, v$process p
    where p.addr = s.paddr and s.audsid = sys_context('userenv','sessionid')


    Let's return to dumping a block to a trace file. So, what can we expect to see?
    Above, we already dumped block 17 in database file 5, to a trace file in the "USER_DUMP_DEST" directory.
    Such files are just ascii txt files which you can view using your favorite txt editor.
    Here is some partial output of my trace:

    73E6600 0000A206 01400011 000F6019 06010000 [......@..`......]
    73E6610 0000DF10 00000001 0000BBAE 000F5FFC [............._..]
    73E6620 00000000 01320002 01400009 00190003 [......2...@.....]
    73E6630 000001F1 008001DE 0067006A 00002193 [........j.g..!..]
    73E6640 000F6019 00000000 00000000 00000000 [.`..............]

    .. many rows skipped..

    73E6CD0 0303012C 052E35C2 72726168 1AC20279 [,....5..harry...]
    73E6CE0 0303012C 052D35C2 72726168 1AC20279 [,....5-.harry...]
    73E6CF0 0303012C 052C35C2 72726168 1AC20279 [,....5,.harry...]
    73E6D00 0303012C 052B35C2 72726168 1AC20279 [,....5+.harry...]
    73E6D10 0303012C 052A35C2 72726168 1AC20279 [,....5*.harry...]
    .. etc

    You are probably no so very much impressed by what you can "see" in this output.
    There is ofcourse much more to say on how you can best interpret those trace files.

    Now, just to make you curious:

    -- Other variants of the "ALTER SYSTEM DUMP" statement that might be of interrest:

    By the way: The "ALTER SYSTEM DUMP.." statement has more variants.
    For example, you can dump blocks from redologs as well. For example:

    ALTER SYSTEM DUMP LOGFILE 'C:\ORADATA\SALES\redo05.dbf' DBA MIN 5 17221 DBA MAX 5 17250;
    ALTER SYSTEM DUMP LOGFILE 'C:\ORADATA\SALES\redo05.dbf' TIME MIN 528314633;

    (DBA: database block address like we have used before, for example "datafile 5" and "blocknumber 17221")

    -- Using the SET EVENTS statement:

    Again, there is much to learn from the "SET EVENTS" statement. For example take a look at this:

    -- Dump control file contents:
    alter session set events 'immediate trace name CONTROLF level 10'

    -- Dump file headers:
    alter session set events 'immediate trace name FILE_HDRS level 10'

    -- Dump redo log headers:
    alter session set events 'immediate trace name REDOHDR level 10'

    I hope that you practice this note on a testsystem. If so, try the above SET EVENTS commands and take a look at the trace files.


    3.7.2 A few words on the Oracle supplied "BBED" utility.

    The BBED (Block Browser and EDitor) utility, is a tool from Oracle itself.
    However, It looks as if it not present on all Oracle versions (and platforms), which is somewhat "strange".

    It's a very powerfull tool, and just as it's name implies, you can view and modify database blocks.

    For the Windows platform, it's an executable, ready for use. On unix, you first need to link it, before you can use it.
    On Windows, I found it in the Oracle 8i EE Release 3, v 8.1.7, and older versions, but, for example, I could not find it in 10g or 11g.
    So, let's try that "old" one. But the utility is password protected, so let's find that first.



    Using xvi32, opening "bbed.exe", and searching on the string "password", the password is quicly found.
    It's "blockedit". By the way, the password is "all over the Internet" as well.
    So let's start it:

    C:\ora8i\bin>BBED
    Password:

    BBED: Release 2.0.0.0.0 - Limited Production on Thu Oct 28 12:03:36 2010

    (c) Copyright 2000 Oracle Corporation. All rights reserved.

    ************* !!! For Oracle Internal Use only !!! ***************

    BBED>

    The banner that is shown, is somewhat intimidating, warning us that usage is for "Oracle Internal Use only".
    This is not to take lightly. Ofcourse, on testsystems you can do what you like,
    but for production, never use it without Oracle Support guidance.

    There exists an excellent article, describing the use of BBED, including some amazing examples.
    You can find it using this url:

    http://orafaq.com/papers/dissassembling_the_data_block.pdf

    If you have browsed through that document, you probably agree that I don't need to spend any more words on bbed.


    For now, I conclude this simple note. Hope you have liked it!


    What about regular Backup and Recovery techniques ?

    Having a optimal backup/recovery policy in place, is the best defence against any kind of corruption,
    disk errors, and all sorts of other faults (like human error etc..).

    Originally, I had planned a Chapter 4, dealing on Backup and Recovery techniques.
    Now, I think it's better to create a seperate document on that subject, and thus not include it in a note about "block corruption".