Oracle ASH Monitoring

I have over 60 Oracle diagnostic SQL scripts in my arsenal.  A lot of them hit ASH and AWR.  I’ve tested these scripts on a high traffic web-site backed by an 6 node Oracle RAC Cluster.  Everyone of these scripts saved my day at one point, but there is one script that stands out from them all – Oracle ASH Top Waits Script: h1.sql.

Oracle ASH Top Waits

This little gem is the #1 thing I reach for when I am asked to troubleshoot a performance problem that was reported hours ago. For example lets say it’s 9:30am and I get a call from a dev saying that her APEX web application is hanging – she suspects a locking issue. All I need to know at this point is when the problem was first reported – armed with the start time (lets say 7:00am) I simply do this:

sqlplus / as sysdba
@h1 0700 0930 0 -1 -1

What’s happening here? There are 5 parameters:

1: start HHMI     [0700 = 7:00am]
2: end HHMI       [0930 = 9:30am]
3: days back      [0 = today; or 7 = seven days back]
4: instance       [1 = INST_ID=1, give -1 for all]
5: service_hash   [1225919510 = dba_services.name_hash, give -1 for all]

and here’s what I get back:

Oracle ASH Top Waits Script output sample

Can you spot the problem? It’s “SQL*Net message from dblink”, there are no locks – it’s a simple problem of a badly written distributed query that is waiting on remote DB. That was easy! One other hidden benefit of this script is that it saves it’s output in a table under a RUN_ID (in this case RUN_ID=81) allowing you to compare the output of two RUN_IDs and clearly spot the differences in values grouped by WAIT EVENT. This is extremely valuable especially when someone says – “this used to work last week!”, you simply do this:

sqlplus / as sysdba
@h1 0700 0930 7 -1 -1

The “7” in the third parameter instructs the script to look 7 days back for the same time frame (7-9:30am). The output of the above report will have it’s own RUN_ID=82 (next in sequence) and you can now compare the two using a special script h1d.sql like so:

@h1d 81 82

We didn’t need OEM or any GUI apps to get to the bottom of the problem – all because the diagnostics data is already in AWR tables and is available to us directly from command line / sqlplus. Years ago – we’d have to sample v$session_wait to get similar diagnostics, in fact, I wrote a complete monitoring system that utilized such technique. But now, Oracle built this into the core engine in a form of Active Session History (ASH) that automatically samples this data every second with practically no overhead to the database! That is an incredible level of instrumentation available to us and it would be a shame not to utilize it beyond what the OEM reports are capable of.

Note however that ASH sampling is short lived – only 1/10th of it’s sampling is saved in AWR based on some internal thresholds Oracle came up with. Now imagine that we:

  1. Take a script like h1.sql and refactor it to use ASH instead of AWR because ASH samples are at higher fidelity (AWR only get’s 1/10th of ASH data).
  2. Run this script every 3 minutes to capture the live heartbeat of the database.
  3. Define your own thresholds on top of this sampling and get notified if something is amiss.
  4. Save all this data for historical purposes so that even if AWR gets wiped out you have solid performance metrics years later.
  5. Wrap this all up in an easy to deploy (single binary) distribution which only takes a minute to install on a new host.

Are you interested? Does this sound like something you’d like in your shop? If yes – then I’d like to get some feedback from you because I am building it! Sign up for the EvenTorEx Mailing List to start the discussion and get the private beta as soon as it’s ready (no SPAM here I guarantee it!):

EvenTorEx Mailing List

December 2, 2015

Posted In: Operations, Scripts

Tags: , ,

Making sense of ASH TIME_WAITED Units

It’s widely believed that ASH / v$active_session_history.TIME_WAITED is presented in microseconds. It sure seems like it, but something is amiss as my small test shows. Using two sessions I created a test table in one, inserted a row and then from another session tried to select the same row for update:

| session A > SQL> create table ash_time_waited_test (t number);
| session A > 
| session A > Table created.
| session A > 
| session A > SQL> insert into ash_time_waited_test values(1);
| session A > 
| session A > 1 row created.
| session A > 
| session A > SQL> commit;
| session A > 
| session A > Commit complete.
| session A > 
| session A > SQL> update ash_time_waited_test set t=2 where t=1;
| session A > 
| session A > 1 row updated.

| session B > SQL> !date
| session B > Thu Jul 30 18:20:15 PDT 2009
| session B > 
| session B > SQL> select * from ash_time_waited_test for update;

| session A > 
| session A > SQL> commit;
| session A > 
| session A > Commit complete.
| session A > 
| session A > SQL>
| session A > 

| session B > 
| session B >          T
| session B > ----------
| session B >          2
| session B > 
| session B > SQL> !date
| session B > Thu Jul 30 18:20:38 PDT 2009
| session B > 

What I found next was rather strange — the sum(TIME_WAITED) from ASH (v$active_session_history) did not add up to the total time session A kept the lock on the test table:

First we get the SQL_ID for session B (the waiter) so we can tie into ASH.SQL_ID later:

SQL> select sql_id from v$sql where sql_text = 'select * from ash_time_waited_test for update';

SQL_ID
-------------
fpm89fyrsf6gv

Now having the SQL_ID in hand we can dig into ASH to see what waits our session experienced:

SQL> col SAMPLE_TIME format a25
SQL> col event format a30
SQL> select sample_time,EVENT,WAIT_TIME,TIME_WAITED from v$active_session_history
  2  where sql_id = 'fpm89fyrsf6gv'
  3  order by sample_time;

SAMPLE_TIME               EVENT                           WAIT_TIME TIME_WAITED
------------------------- ------------------------------ ---------- -----------
30-JUL-09 06.20.25.751 PM enq: TX - row lock contention           0      489233
30-JUL-09 06.20.26.761 PM enq: TX - row lock contention           0      489244
30-JUL-09 06.20.27.761 PM enq: TX - row lock contention           0      489242
30-JUL-09 06.20.28.761 PM enq: TX - row lock contention           0      489238
30-JUL-09 06.20.29.761 PM enq: TX - row lock contention           0      489244
30-JUL-09 06.20.30.761 PM enq: TX - row lock contention           0      489242
30-JUL-09 06.20.31.761 PM enq: TX - row lock contention           0      489244
30-JUL-09 06.20.32.761 PM enq: TX - row lock contention           0      489253
30-JUL-09 06.20.33.761 PM enq: TX - row lock contention           0      489233
30-JUL-09 06.20.34.761 PM enq: TX - row lock contention           0      489245
30-JUL-09 06.20.35.761 PM enq: TX - row lock contention           0      489245

11 rows selected.

SQL>

Based on the output above we know that the first sample that trapped this wait occurred at “06.20.25.751 PM” and the last one at “06.20.35.761 PM”. The difference (delta) between these two timestamps is just over 10 seconds:

SQL> select e.SAMPLE_TIME - s.SAMPLE_TIME
  2  from v$active_session_history s,
  3       v$active_session_history e
  4  where s.sample_id = (select min(sample_id) from v$active_session_history where sql_id = 'fpm89fyrsf6gv')
  5    and e.sample_id = (select max(sample_id) from v$active_session_history where sql_id = 'fpm89fyrsf6gv')
  6  ;

E.SAMPLE_TIME-S.SAMPLE_TIME
---------------------------------------------------------------------------
+000000000 00:00:10.010

SQL>

So far so good. Now lets see what we get by summing up the TIME_WAITED:

SQL> select sum(TIME_WAITED) from v$active_session_history
  2  where sql_id = 'fpm89fyrsf6gv'
  3  ;

SUM(TIME_WAITED)
----------------
         5381663

Interesting? If the ASH.TIME_WAITED is measured in microseconds then the total wait time = 5381663/1000000 = 5.381663 seconds and as we know it’s not correct. In fact, we can prove this using another metric, the ELAPSED_TIME_* from DBA_HIST_SQLSTAT:

SQL> select ELAPSED_TIME_TOTAL, ELAPSED_TIME_DELTA,
  2  APWAIT_TOTAL, APWAIT_DELTA,
  3  CCWAIT_TOTAL, CCWAIT_DELTA
  4  from DBA_HIST_SQLSTAT
  5  where sql_id = 'fpm89fyrsf6gv'
  6  ;

ELAPSED_TIME_TOTAL ELAPSED_TIME_DELTA APWAIT_TOTAL APWAIT_DELTA CCWAIT_TOTAL CCWAIT_DELTA
------------------ ------------------ ------------ ------------ ------------ ------------
          11110172           11110172     11106973     11106973          254          254

SQL>

DBA_HIST_SQLSTAT is dead on — with Elapsed Time Total = 11110172/1000000 = 11.110172 seconds … So what’s happening with ASH then? Why such a large > 2 times discrepancy?

I have my theory — I think that because ASH sampling is taken every second half of that seconds is actually taken to process the data and for this reason the TIME_WAITED gets truncated. Whatever the reason is, one thing is for sure — ASH.TIME_WAITED, at least in this particular case, cannot be trusted.

The final question … how do you then use ASH? How do you arrive at a more or less accurate metrics, specifically the TIME_WAITED? One way is this — we know what ASH sampling takes place every second, so instead of relying on the buggy TIME_WAITED values we can rely on the actual number of times that a SQL_ID appeared in ASH … If a SQL_ID was trapped by ASH 10 consecutive times with WAIT_TIME=0 (Waiting) we then know for sure that it waited for at least 10 seconds …

UPDATE Aug 12, 2009

Graham Wood from Oracle explains the issue with ASH: TIME_WAITED in his paper Sifting through the ASHes (see Page 35). He also lists some other “ASH data gotcha’s” on page 27. What really caught my attention through, was page 36, which explains “ASH vs AWR/Statpack” and it specifically says that for “Counts/occurrence data” analysis you should go to AWR not ASH.

And it really makes sense, after all, ASH is sampled every second and while it sounds like “often enough” it really isn’t. All you have to do is run sql_trace on a busy session and you’ll see that there’s a lot that goes on between the 1 second intervals — all of it would be missed by ASH. ASH is great at putting spotlight on the “big issues”, not so great for subtle, sub second waits which could and do add up to a considerable wait time.

Bottom line for me is that there’s no substitute for the good, old sql_trace, but overall, ASH is a huge improvement over what we had or rather didn’t have before. It definitely replaces my own monitoring scripts which I’ve used for many years now, sampling v$session_wait where wait_time=0 every 5 seconds.

Sat: January 2 2016: Update!

I am building a new monitoring framework that:

  1. Samples Oracle ASH every N minutes and captures the live heartbeat of the database.
  2. Allows you to define thresholds on top of this sampling and notifies you if something is amiss.
  3. Saves all this data for historical purposes so that when AWR is wiped out you have solid performance metrics years later.
  4. Monitors Oracle LOCKS.
  5. Monitors Oracle SPACE.
  6. Delivers this functionality in an easy to deploy (single binary) distribution that only takes a minute to install on a new host.

Does this sound like something you’d like in your shop? If yes – then I’d like to hear from you – sign up for the mailing list below to start the discussion and get the private beta as soon as it’s ready (no SPAM here I guarantee it!):

EvenTorEx Mailing List

July 31, 2009

Posted In: Tuning

Tags: , , , , , ,