Anotomy of an RMAN TSPITR restore which I didn’t do

Wow, that’s a handful — “restore which I didn’t do“, alright then why mention it? Well, here’s why. Imagine you get an email with a title “PROJECT-NAME – QA Schema Rollback?” You quickly read through it and realize that a developer messed up a very important heavily used schema on a QA system for a project that’s about to go live. A quick analysis of the situation reveals that a delete was issued on three tables setup with “cascade constraints” clause and it wiped out data from 11 detail tables. Why would you do that is another question, which I took up with the development team but I digress …

What do you do?

First thing that came to my mind was “flashback“! Sounds good, but what do you flashback? A schema? No such thing. A table? Maybe, but what about referential integrity in 11 detail tables? That’s OK it can be solved like this — I flashback the master tables first and then the 11 detail tables, it might work … Lets check the pre-reqs — undo_management=AUTO (check), undo_retention = 900 (not good — it’s been over 15 minutes). Alright, forget about flashing back anything (mental note to myself — increase undo_retention to few hours).

What’s next?

The next thing is point in time recovery. Quick check shows that the schema is fully contained in a single tablespace — very good, we can do a tablespace level recovery instead of the whole database. And since we are on 10gR2 how about fully automated RMAN TSPITR? Sounds good, lets see what we need for this to work:

  • The target instance, containing the tablespace to be recovered
  • The Recovery Manager client
  • The control file and (optional) recovery catalog, used for the RMAN repository records of backup activity
  • Archived redo logs and backup sets from the target database, which are the source of the reconstructed tablespace.
  • The auxiliary instance, an Oracle database instance used in the recovery process to perform the actual work of recovery.
  • The target time, the point in time or SCN that the tablespace will be left at after TSPITR
  • The recovery set, which consists of the datafiles containing the tablespaces to be recovered;
  • The auxiliary set, which includes datafiles required for TSPITR of the recovery set which are not themselves part of the recovery set. The auxiliary set typically includes:
  • The auxiliary destination, an optional location on disk which can be used to store any of the auxiliary set datafiles, control files and online logs of the auxiliary instance during TSPITR. Files stored here can be deleted after TSPITR is complete.

A handful list, but doable. Alright, suppose I have all this (I did) and suppose I want to go through it, how does it work? Here’s how:

To perform TSPITR of the recovery set using RMAN and an automated auxiliary instance, you carry out the preparations for TSPITR described in “Planning and Preparing for TSPITR”, and then issue the RECOVER TABLESPACE command, specifying, at a minimum, the tablespaces of the recovery set and the target time for the point-in-time recovery, and, if desired, an auxiliary destination as well. RMAN then carries out the following steps:

  1. If there is no connection to an auxiliary instance, RMAN creates the auxiliary instance, starts it up and connects to it.
  2. Takes the tablespaces to be recovered offline in the target database
  3. Restores a backup control file from a point in time before the target time to the auxiliary instance
  4. Restores the datafiles from the recovery set and the auxiliary set to the auxiliary instance. Files are restored either in locations you specify for each file, or the original location of the file (for recovery set files) or in the auxiliary destination (for auxiliary set files, if you used the AUXILIARY DESTINATION argument of RECOVER TABLESPACE)
  5. Recovers the restored datafiles in the auxiliary instance to the specified time
  6. Opens the auxiliary database with the RESETLOGS option
  7. Exports the dictionary metadata about objects in the recovered tablespaces to the target database
  8. Shuts down the auxiliary instance
  9. Issues SWITCH commands on the target instance, so that the target database control file now points to the datafiles in the recovery set that were just recovered at the auxiliary instance.
  10. Imports the dictionary metadata from the auxiliary instance to the target instance, allowing the recovered objects to be accessed.
  11. Deletes all auxiliary set files.

At that point the TSPITR process is complete. The recovery set datafiles are returned to their contents at the specified point in time, and belong to the target database.

The first thing that should jump at you here is item# 2 — “Takes the tablespaces to be recovered offline in the target database” because that’s when the real downtime will begin. Why? Because realize that while some of the schema’s data might have been lost, at least in this particular case it’s not caused a compete outage yet, we still have a website connected to it and functioning. But as soon as I take the tablespace offline it will be a true outage and I will be under the gun to deliver it ASAP. ASAP == interruptions and stress, which as you know inevitably leads to mistakes. And mistakes made while performing RMAN’s TSPITR could cause quite serious issues, much bigger then you’ve started with.

Here’s one scenario where it’s very easy to make a mistake. Lest say the developer tells you that the schema was corrupted at t1. You go through with RMAN TSPITR and realize that he was wrong and that you need to redo RMAN TSPITR using t1-15 minutes. Can you do that? ………. The answer is Not unless you are using recovery catalog, here’s why:

Assume that you run TSPITR on a tablespace, and then bring the tablespace online at time t. Backups of the tablespace created before time t are no longer usable for recovery with a current control file. You cannot run TSPITR again on this tablespace to recover it to any time less than or equal to time t, nor can you use the current control file to recover the database to any time less than or equal to t. Therefore, you must back up the recovered tablespace as soon as TSPITR is complete.

It is extremely important that you choose the right target time or SCN for your TSPITR. As noted already, once you bring a tablespace online after TSPITR, you cannot use any backup from a time earlier than the moment you brought the tablespace online. In practice, this means that you cannot make a second attempt at TSPITR if you choose the wrong target time the first time, unless you are using a recovery catalog. (If you have a recovery catalog, however, you can perform repeated TSPITRs to different target times.)

For example, assume that you are not using a recovery catalog, and you run TSPITR on a tablespace, and then bring the tablespace online at 5PM on Friday. Backups of the tablespace created before 5PM Friday are no longer usable for recovery with a current control file. You cannot run TSPITR again on this tablespace with a target time earlier than 5PM Friday, nor can you use the current control file to recover the database to any time earlier than 5PM Friday. Your only option will be point-in-time recovery of your entire database using a restored control file.

Now that’s a bit of a problem isn’t it? What this means is that you might actually have just ONE SHOOT at getting it right. You miss it and you might be done, at least with using RMAN TSPITR, you can still recover full database using backup control file from before time t.

Considering all of these nuances it quickly became apparent to me that perhaps RMAN TSPITR is not the right solution in this case. It sure sounded great on paper. What’s not to like here, just one command like this:

RECOVER TABLESPACE users, tools 
     UNTIL LOGSEQ 1300 THREAD 1
      AUXILIARY DESTINATION '/disk1/auxdest';

Solves your issue! But it’s not as simple, it’s not just one command, there’s a lot more to this and even if you are willing to take the risk there are other limitations you need to be aware of. If any of the following exists in the tablespace to be recovered, then RMAN TSPITR will not work:

  • tables with VARRAY
  • nested tables
  • external tables
  • Snapshot logs and snapshot tables
  • undo or rollback segments
  • objects owned by SYS

You’d know if you had any Snapshot logs or tables because DBA’s typically set those up, the same thing applies to RBS/UNDO/SYS objects. Checking the rest would be easy:

SQL> col data_type format a30
SQL> r
  1  select count(*),DATA_TYPE from dba_tab_columns where owner='SCHEMA-IN-TABLESPACE'
  2* group by DATA_TYPE

  COUNT(*) DATA_TYPE
---------- ------------------------------
         2 LONG RAW
        72 NUMBER
         1 CLOB
        20 DATE
       200 VARCHAR2
         2 BLOB

6 rows selected.

SQL> select count(*) from dba_nested_tables where owner='SCHEMA-IN-TABLESPACE'
  2  ;

  COUNT(*)
----------
         0

SQL> select count(*) from dba_external_tables where owner='SCHEMA-IN-TABLESPACE'
  2  ;

  COUNT(*)
----------
         0

SQL>

You also have to consider that all the while you are thinking and researching, the database is still being accessed, and it’s entirely possible that changes are being made to the schema in question. If you restore it to the time in the past all of these changes will lost, so you’d have to actually take a quick backup of the schema as it is now before you perform RMAN TSPITR just in case you’ll need anything from it …

Also realize that all of this information you have to either know my heart or sift through very quickly because a decision needs to be made as to how to solve this issue. And unless you have recently practiced RMAN TSPITR on this specific tablespace in this specific database a conservative DBA, which all DBAs should be, would never make a decision to use RMAN TSPITR. I didn’t — I used RMAN backup set to recover the database on another host, exported the schema in question and imported it as another schema into the original database so that developers could reconstruct lost rows from it. My solution, while not the fastest was the sure thing and that’s what our job is all about.

October 7, 2009

Posted In: Operations, RMAN

Tags: ,