Adaptive Server Auto Recovery Scenarios...

Online Recovery and Recovery Fault Isolation

Adaptive Server features Recovery Fault Isolation (RFI) to enhance the granularity of recovery. This discussion surveys various recovery options and focuses on the difference between traditional online recovery and RFI. We use examples to show how RFI can be used to enhance the recovery scenarios and how it can help to avoid extensive downtime due to recovery problems.

Recovery can mean several things:

· Online recovery is the process by which the database is brought into a consistent state after the server is restarted.

· Recovery from backups means restoring a corrupt database by loading a database dump, then applying transaction log dumps to the database to bring it back to a consistent state.

· Finally, although Adaptive Server does not offer this functionality, recovery can also refer to recovering specific objects from a dump or other offline storage and restoring only that object rather than the database as a whole. While this technique is outside the scope of the current discussion, it may be useful in some of the recovery processes outlined below.

Online Recovery Concepts

Online recovery brings the database to a consistent state after you restart the server.

During routine Adaptive Server operation, all changes to the database are written first to the log, then to the data pages on disk. Log pages are written to disk when the transaction completes, that is, when the transaction commits. However, because all changed pages are written to disk whenever a checkpoint occurs, for other reasons prior to commit, changes can be written to the log or data pages as part of an as-yet-incomplete transaction. If the server fails after an uncommitted transaction is written to the log but before the transaction completes, online recovery reads the log and ensures that no uncommitted changes are reflected in the database. Likewise, online recovery ensures that any changes recorded in the log for committed transactions that have not yet been flushed to disk are updated on the data pages and written to disk.

Prior to RFI, online recovery was an all-or-nothing proposition. If recovery failed due to some corruption, there was no way to partially recover the database and leave the corrupt portion offline. The preferable option was to restore the database from backups. However, if backups were not available or time constraints made it difficult to go through the extensive procedures necessary to restore through backups, customers often used an undocumented and risky procedure, referred to as "suiciding the log", to skip recovery and get the database back on line.

Recovery(Restore) from Backups

The traditional recovery option, when online recovery fails, is to restore the database from dumps, and incrementally apply transaction logs to bring the restored database back to the most current possible state. This is the best solution for restoring to an absolutely consistent state after corruption. It often brings the database to a state of consistency to within seconds of the point of original failure.

However, the drawback with this traditional approach is that the recovery granularity is at the level of the transaction dump. If a transaction causing corruption is dumped, the traditional method means loading a database dump and applying transaction log dumps up to, but not including, the transaction dump containing the offending transaction. This can result in hours of lost transactions.

Suiciding the Log

While suiciding the log can usually bring the server back online, it also frequently results in both physical and logical corruption in the database, because it bypasses the normal process of rolling back incomplete transactions in the log (and rolling forward completed transactions) that takes place during online recovery. Frequently, the resulting corruption is not encountered until a later time, and the connection with the earlier log suicide is not always recognized.

Note: Log suicide is highly risky, and is not recommended except in extreme cases.

Recovery With RFI

ASE implements Recovery Fault Isolation (RFI), an online recovery feature that provides a level of granularity in recovery by means of partial recovery. RFI can isolate corruption, encountered during recovery, to the corrupt pages. This enables you to restore database integrity by isolating and repairing corruption on a page by page (and, consequently, on an object by object) basis without having to restore the entire database and transaction logs from backups.

While RFI can only define granularity at the page or database level, corruption is typically corrected at the object level with, for example, an entire index being recreated or an entire table being replaced.

Using RFI

RFI allows partial recovery only on user objects. If it encounters corruption on system tables, recovery fails for the entire database.

RFI allows the database administrator (DBA) to select the granularity of recovery for each user database. The choices are:

· Mark the whole database suspect on any recovery failure. This is the default behavior and it is how recovery worked in previous versions.

· Set the number of pages that can be offlined during recovery and still allow the database to be opened. The DBA can decide whether the partially recovered database is to be open for updates or for reads only.

Finally, the DBA can set the database to be marked suspect on any recovery failure, then change the setting to recover all but the corrupt pages. In this mode users cannot access the database, enabling the DBA to determine the appropriate course of action and proceed accordingly.

There is a significant difference between RFI's page-level and database-level granularity. Database granularity mandates that all transactions in the database should either be completed and rolled forward, or rolled back and all changes backed out. In either case the database is in a logically consistent state at the end of recovery. In short, recovery is all or nothing. Any interruption in recovery that makes this impossible causes recovery to fail entirely, and the only sure way to guarantee a consistent state is to restore from backups. This can be problematical, depending on how many backups are available, their validity, and how recent they are.

Page level granularity, on the other hand, allows the server to offline corrupt pages in a transaction while onlining other pages. Since recovery has not been able to complete and verify the transactions, this will leave some of the transactions only partially available and all other transactions completely recovered as usual. There is no way to determine whether transactions that involved offlined pages are complete except by manual examination.

If, for example, a transaction changes rows on three pages and the changes on two pages are written to disk before the server stops, recovery would normally assure that the third page also was written to disk. If, however, recovery marks as suspect the page to which the third update is to be made, there is no way to determine whether the transaction is complete or incomplete; that is, whether all three pages are updated or only the first two. A transaction in this state is deemed to be partially available, as the changes to the first two pages are available while the change to the third page is unavailable, and it is not known whether it was changed.

At another level, consider a case where a page from a specific table is marked offline. Subsequent work is dependent on this page but only at an implicit level, meaning that it is assumed that business rules will be handled without explicitly coding referential checks. If the code were to explicitly check for the offline data, an error would be raised; but if this is not done and the work proceeds with only an implicit dependence on the offline pages (which cannot be restored to a consistent state), it may result in logical inconsistencies in the database. This is yet another reason we recommend that all dependencies between data be explicity coded via declared referential constraints, triggers or existence checks.

It is important to understand that while it is possible to bring corrupt pages online, doing so without first repairing the pages will result in logical and data inconsistency. When restoring a database by repairing offline pages (or by restoring objects to which the offlined pages belonged), therefore, the DBA must explicitly determine the degree to which logical consistency of the database may be suspect according to business rules and coding practices. Of course, restoring the database from a database backup and incremental transaction backups assures both the logical and physical integrity of the database through the last successful load of a transaction dump.

It is also important to run dbcc tablealloc or dbcc indexalloc with the fix option on any objects with suspect pages because the allocation information for these objects is also suspect.

What To Do When Online Recovery Fails

The options for recovering from a failure in online recovery, in order from most desirable to least, are:

· Restoring from Backups

· Partial online recovery using RFI

· Suiciding the Log.

Restoring from Backups

Prior to RFI, this was the only option if recovery failed, the database could not be repaired, and suicide of the log was not desirable. It is still the preferred option for recovering the database after failure during online recovery if a) the entire database is marked suspect due to thresholds being exceeded, or b) system table(s) are corrupt. It is also the preferred method whenever the absolute need for physical and logical consistency overrides all other concerns. It is highly recommended that you run dbcc checks prior to and following a dump, to ensure that the backup is valid.

Partial online recovery using RFI

Implementing RFI gives the DBA many more choices in the event of failure during online recovery. Before opting for log suicide, consider these advantages of RFI over log suicide:

1. Isolated pages are known and can be examined. You can thus make an informed decision on whether to repair the faults or restore from backups.

o If the isolated pages belong to an index, the corruption can often be fixed by dropping and recreating the index.

o If the isolated pages are data pages, the data can sometimes be recovered via other means. You can also leave the pages safely offline; transactions that explicitly depend on their presence will fail until they are made available.

o Pages referenced in recovery that are marked suspect, but are subsequently deallocated further along in the recovery process, are assumed to have been properly written for the earlier transaction and are taken off the suspect list, thus making the corruption for that page "self-healing".

2. You can set thresholds to determine at what level page faults are unacceptable, and at which the whole database should remain unrecovered.

3. You can make the database available to users while conducting repairs. The database can be configured to allow updates or to allow read-only access.

4. Faults on system table pages cause recovery to fail for the entire database.

5. You can implement a limited form of suicide recovery by disregarding all or some of the suspect pages and onlining them even if they are corrupt. The suicide is limited in the sense that only transactions associated with those pages are suspect. Recovery rolls forward (or back) other transactions in the log properly.

Implementing Recovery with RFI

The default granularity of recovery is at the database level. Take the following steps to implement page level granularity:

1. Check or implement page granularity on desired databases using the sp_setsuspect_granularity stored procedure:

sp_setsuspect_granularity [dbname [,{"database" | "page"}

[, "read_only"]]]

If you set the granularity to page level, you have the option to set the database to read_only mode when recovery detects suspect pages. By default, all available pages are accessible for both reads and writes.

Wherever possible, use the read_only mode. If a query attempts to access an offline page, the server raises error messages 12716 and 12717 regardless of whether the database is read_only. For more information on these errors see the chapter titled "Error Message Writeups."

3. Set the threshold for escalating page level granularity to database granularity using the sp_setsuspect_threshold stored procedure:

sp_setsuspect_threshold [dbname [,threshold ]]

Once the number of offlined pages reaches this threshold value, recovery marks the entire database suspect. The default threshold value is 20 pages. It is unlikely that setting it much higher will be of much use since 20 corrupt pages is very likely to indicate corruption at a level than cannot be effectively repaired.

4. Bring the suspect pages or database on line. You can print a list of pages or databases that are suspect after recovery using the sp_listsuspect_db and sp_listsuspect_page stored procedures:

5. sp_listsuspect_db

sp_listsuspect_page [dbname]

You can bring these pages or database online using the sp_forceonline_db or sp_forceonline_page stored procedures:

sp_forceonline_db dbname

{"sa_on" | "sa_off" | "all_users"}

sp_forceonline_page dbname, pagenumber

{"sa_on" | "sa_off" | "all_users"}

sa_on and sa_off toggle the database or page online and offline, and allow access to the database or page only to those with the sa_role set on. This permits the DBA to examine and repair the suspect database or pages without other users being able to access them.

Note: all_users option is irreversible and makes the database or page available to all users. If no repairs have been made, this may result in some level of logical inconsistency.

Suiciding the Log

RFI eliminates most of the need for suiciding the log. The two most common reasons for suiciding the log in the past were:

1. No backups are available or the backups are too old.

2. Insufficient time to restore.

There should never, of course, be a situation where backups are unavailable or are too old. Unfortunately that is too often the case, either because the dumps are bad or due to poor planning. In such situations, suiciding the log may be the only recourse. Aside from those situations, however, you should never consider suiciding a viable option.

DBAs should test all backup and restore procedures before relying on them. If you attempt to load a dump on the original database and it completes only partially, you will have eliminated the possibility of using that database again and may even have eliminated the final chance to recover data by suiciding the log.

Sybase Log Suicide

1> use master
2> go
1> select count(*) from your_database..syslogs
2> go

-----------
some number

1> sp_configure "allow updates",1
2> go

1> begin tran
2> go

/* Save the following status to be used later... */
1> select saved_status=status from sysdatabases where name = "your_database"
2> go
1> update sysdatabases set status = -32768 where name = "your_database"
2> go
1> commit tran
2> go
1> shutdown
2> go
1> dbcc rebuild_log (your_database, 1, 1)
2> go
1> use your_database
2> go
1> select count(*) from syslogs
2> go

-----------
1

1> begin tran
2> go
1> update sysdatabases set status = saved_status where name = "your_database"
2> go
(1 row affected)
1> commit tran
2> go
1> shutdown

Kasi V Dogga - Ecstatic, Passionate and Certified Sybase, MS SQL and PostgreSQL Database Architect!

DisCopy

Saturday, 16 July 2011

Adaptive Server Auto Recovery Scenarios...

No comments:

Post a Comment

Total Pageviews