Making sense of ASH TIME_WAITED Units

It’s widely believed that ASH / v$active_session_history.TIME_WAITED is presented in microseconds. It sure seems like it, but something is amiss as my small test shows. Using two sessions I created a test table in one, inserted a row and then from another session tried to select the same row for update:

| session A > SQL> create table ash_time_waited_test (t number);
| session A > 
| session A > Table created.
| session A > 
| session A > SQL> insert into ash_time_waited_test values(1);
| session A > 
| session A > 1 row created.
| session A > 
| session A > SQL> commit;
| session A > 
| session A > Commit complete.
| session A > 
| session A > SQL> update ash_time_waited_test set t=2 where t=1;
| session A > 
| session A > 1 row updated.

| session B > SQL> !date
| session B > Thu Jul 30 18:20:15 PDT 2009
| session B > 
| session B > SQL> select * from ash_time_waited_test for update;

| session A > 
| session A > SQL> commit;
| session A > 
| session A > Commit complete.
| session A > 
| session A > SQL>
| session A > 

| session B > 
| session B >          T
| session B > ----------
| session B >          2
| session B > 
| session B > SQL> !date
| session B > Thu Jul 30 18:20:38 PDT 2009
| session B > 

What I found next was rather strange — the sum(TIME_WAITED) from ASH (v$active_session_history) did not add up to the total time session A kept the lock on the test table:

First we get the SQL_ID for session B (the waiter) so we can tie into ASH.SQL_ID later:

SQL> select sql_id from v$sql where sql_text = 'select * from ash_time_waited_test for update';

SQL_ID
-------------
fpm89fyrsf6gv

Now having the SQL_ID in hand we can dig into ASH to see what waits our session experienced:

SQL> col SAMPLE_TIME format a25
SQL> col event format a30
SQL> select sample_time,EVENT,WAIT_TIME,TIME_WAITED from v$active_session_history
  2  where sql_id = 'fpm89fyrsf6gv'
  3  order by sample_time;

SAMPLE_TIME               EVENT                           WAIT_TIME TIME_WAITED
------------------------- ------------------------------ ---------- -----------
30-JUL-09 06.20.25.751 PM enq: TX - row lock contention           0      489233
30-JUL-09 06.20.26.761 PM enq: TX - row lock contention           0      489244
30-JUL-09 06.20.27.761 PM enq: TX - row lock contention           0      489242
30-JUL-09 06.20.28.761 PM enq: TX - row lock contention           0      489238
30-JUL-09 06.20.29.761 PM enq: TX - row lock contention           0      489244
30-JUL-09 06.20.30.761 PM enq: TX - row lock contention           0      489242
30-JUL-09 06.20.31.761 PM enq: TX - row lock contention           0      489244
30-JUL-09 06.20.32.761 PM enq: TX - row lock contention           0      489253
30-JUL-09 06.20.33.761 PM enq: TX - row lock contention           0      489233
30-JUL-09 06.20.34.761 PM enq: TX - row lock contention           0      489245
30-JUL-09 06.20.35.761 PM enq: TX - row lock contention           0      489245

11 rows selected.

SQL>

Based on the output above we know that the first sample that trapped this wait occurred at “06.20.25.751 PM” and the last one at “06.20.35.761 PM”. The difference (delta) between these two timestamps is just over 10 seconds:

SQL> select e.SAMPLE_TIME - s.SAMPLE_TIME
  2  from v$active_session_history s,
  3       v$active_session_history e
  4  where s.sample_id = (select min(sample_id) from v$active_session_history where sql_id = 'fpm89fyrsf6gv')
  5    and e.sample_id = (select max(sample_id) from v$active_session_history where sql_id = 'fpm89fyrsf6gv')
  6  ;

E.SAMPLE_TIME-S.SAMPLE_TIME
---------------------------------------------------------------------------
+000000000 00:00:10.010

SQL>

So far so good. Now lets see what we get by summing up the TIME_WAITED:

SQL> select sum(TIME_WAITED) from v$active_session_history
  2  where sql_id = 'fpm89fyrsf6gv'
  3  ;

SUM(TIME_WAITED)
----------------
         5381663

Interesting? If the ASH.TIME_WAITED is measured in microseconds then the total wait time = 5381663/1000000 = 5.381663 seconds and as we know it’s not correct. In fact, we can prove this using another metric, the ELAPSED_TIME_* from DBA_HIST_SQLSTAT:

SQL> select ELAPSED_TIME_TOTAL, ELAPSED_TIME_DELTA,
  2  APWAIT_TOTAL, APWAIT_DELTA,
  3  CCWAIT_TOTAL, CCWAIT_DELTA
  4  from DBA_HIST_SQLSTAT
  5  where sql_id = 'fpm89fyrsf6gv'
  6  ;

ELAPSED_TIME_TOTAL ELAPSED_TIME_DELTA APWAIT_TOTAL APWAIT_DELTA CCWAIT_TOTAL CCWAIT_DELTA
------------------ ------------------ ------------ ------------ ------------ ------------
          11110172           11110172     11106973     11106973          254          254

SQL>

DBA_HIST_SQLSTAT is dead on — with Elapsed Time Total = 11110172/1000000 = 11.110172 seconds … So what’s happening with ASH then? Why such a large > 2 times discrepancy?

I have my theory — I think that because ASH sampling is taken every second half of that seconds is actually taken to process the data and for this reason the TIME_WAITED gets truncated. Whatever the reason is, one thing is for sure — ASH.TIME_WAITED, at least in this particular case, cannot be trusted.

The final question … how do you then use ASH? How do you arrive at a more or less accurate metrics, specifically the TIME_WAITED? One way is this — we know what ASH sampling takes place every second, so instead of relying on the buggy TIME_WAITED values we can rely on the actual number of times that a SQL_ID appeared in ASH … If a SQL_ID was trapped by ASH 10 consecutive times with WAIT_TIME=0 (Waiting) we then know for sure that it waited for at least 10 seconds …

UPDATE Aug 12, 2009

Graham Wood from Oracle explains the issue with ASH: TIME_WAITED in his paper Sifting through the ASHes (see Page 35). He also lists some other “ASH data gotcha’s” on page 27. What really caught my attention through, was page 36, which explains “ASH vs AWR/Statpack” and it specifically says that for “Counts/occurrence data” analysis you should go to AWR not ASH.

And it really makes sense, after all, ASH is sampled every second and while it sounds like “often enough” it really isn’t. All you have to do is run sql_trace on a busy session and you’ll see that there’s a lot that goes on between the 1 second intervals — all of it would be missed by ASH. ASH is great at putting spotlight on the “big issues”, not so great for subtle, sub second waits which could and do add up to a considerable wait time.

Bottom line for me is that there’s no substitute for the good, old sql_trace, but overall, ASH is a huge improvement over what we had or rather didn’t have before. It definitely replaces my own monitoring scripts which I’ve used for many years now, sampling v$session_wait where wait_time=0 every 5 seconds.

Sat: January 2 2016: Update!

I am building a new monitoring framework that:

  1. Samples Oracle ASH every N minutes and captures the live heartbeat of the database.
  2. Allows you to define thresholds on top of this sampling and notifies you if something is amiss.
  3. Saves all this data for historical purposes so that when AWR is wiped out you have solid performance metrics years later.
  4. Monitors Oracle LOCKS.
  5. Monitors Oracle SPACE.
  6. Delivers this functionality in an easy to deploy (single binary) distribution that only takes a minute to install on a new host.

Does this sound like something you’d like in your shop? If yes – then I’d like to hear from you – sign up for the mailing list below to start the discussion and get the private beta as soon as it’s ready (no SPAM here I guarantee it!):

EvenTorEx Mailing List

July 31, 2009

Posted In: Tuning

Tags: , , , , , ,