This is the first entry of this blog. Just to see if all works and to tune the layout. The blog will cover Oracle and databases. In particular tuning and internals.
↧
Uno
↧
Of I/O Latency, Skew and Histograms 1/2
This is a performance investigation of Oracle DB and interaction with storage.
Case: DB operations are slow during backup. The system is an OLTP-like workload, IO bound. Most of the reads are via indexes, typically single-block access, not much full scan.The system has still many free CPU cycles.
First analysis: single block read times jump to very high valueswhen backups start.
Tools: Use AWR data, to plot db file sequential read wait time over an interval of about 3 weeks. Large spikes can be seen during full DB backup. Spikes are also seen daily when incremental backup starts (see graph below)
Graph: average time spent on db file sequential read events aggregated per hour and plotted as as function of time. High and large spikes of IO latency can be seen during full backups. Smaller spikes during daily incremental backups.
Conclusions from first analysis: Users most important queries are IO latency-bound, latency jumps up during backups, therefore this explains the slowdown seen by the users.
Drill down: We want to understand why IO is slow during backup. Is it some sort of 'interference' of the sequential IO of the backup with random IO of the users? Is it an Oracle problem or a storage problem?
Main idea: use event histogram values reported by Oracle instrumentation to study skew on IO access time and find more about the root cause of the issue. Investigations: see part 2 of this blog entry.
Techniques and references: To analyze AWR data I have built a set of custom queries against dba_hist_* views and used them inside the excellent Perfsheet by Tanel Poder.
This is the text of the query used, among others, to generate the graph above:
-- system_event with rate and time
select cast(sn.begin_interval_time as date) beg_time,
ss.dbid,
sn.instance_number,
ss.event_name,
ss.wait_class,
ss.total_waits,
ss.time_waited_micro,
ss.total_waits - lag(ss.total_waits) over (partition by ss.dbid,ss.instance_number,ss.event_id order by sn.snap_id nulls first) Delta_waits,
ss.time_waited_micro - lag(ss.time_waited_micro) over (partition by ss.dbid,ss.instance_number,ss.event_id order by sn.snap_id nulls first) Delta_timewaited,
extract(hour from END_INTERVAL_TIME-begin_interval_time)*3600
+ extract(minute from END_INTERVAL_TIME-begin_interval_time)* 60
+ extract(second from END_INTERVAL_TIME-begin_interval_time) DeltaT_sec,
round((ss.total_waits - lag(ss.total_waits) over (partition by ss.dbid,ss.instance_number,ss.event_id order by sn.snap_id nulls first)) /
(extract(hour from END_INTERVAL_TIME-begin_interval_time)*3600
+ extract(minute from END_INTERVAL_TIME-begin_interval_time)* 60
+ extract(second from END_INTERVAL_TIME-begin_interval_time)),2 ) Waits_per_sec,
round((ss.time_waited_micro - lag(ss.time_waited_micro) over (partition by ss.dbid,ss.instance_number,ss.event_id order by sn.snap_id nulls first)) /
(extract(hour from END_INTERVAL_TIME-begin_interval_time)*3600
+ extract(minute from END_INTERVAL_TIME-begin_interval_time)* 60
+ extract(second from END_INTERVAL_TIME-begin_interval_time)),2 ) Rate_timewaited, -- time_waited_microsec/clock_time_sec
round((ss.time_waited_micro - lag(ss.time_waited_micro) over (partition by ss.dbid,ss.instance_number,ss.event_id order by sn.snap_id nulls first)) /
nullif(ss.total_waits - lag(ss.total_waits) over (partition by ss.dbid,ss.instance_number,ss.event_id order by sn.snap_id nulls first),0),2) Avg_wait_time_micro
from dba_hist_system_event ss,
dba_hist_snapshot sn
where
%FILTER%
and sn.snap_id = ss.snap_id
and sn.dbid = ss.dbid
and sn.instance_number = ss.instance_number
and sn.begin_interval_time between %FROM_DATE% and %TO_DATE%
↧
↧
Of I/O Latency, Skew and Histograms 2/2
Case: This is part 2/2 of a performance investigation. In the previous part we have seen how the sudden increase of single block read latency would harm performance of the production database and how this would appear consistently when taking backups.
Drill down questions: what causes the high latency for single block reads during backup? Why do the sequential IO of the backup causes random reads to slow down? Are disks saturating? Is it an Oracle issue or more a storage issue? Finally, what can be done about it?
Oracle event histogram: Oracle instrumentation of wait events provides additional information in terms of histograms. Wait duration is divided in intervals and for each interval we can see how many wait events have happened. Historical data is collected and visible in AWR.
AWR data: Graph of data from DBA_HIST_EVENT_HISTOGRAM for the event db file sequential read are reported here below. In particular we can see that read of 1ms or less, that is reads served by the SSD cache of the storage, go to zero during the time backups are taken. Moreover wait of long duration (between 64 and 128 ms, which is a very long time for single block reads) show a peak at the same time.
Graph1: "Number of single block reads of 64 ms or more (that is very slow)" plotted as function of time. Large peaks are seen in correspondence of full backups (every 2 weeks), smaller peaks during daily backups.
Graph2: "Number of single block read of 1 ms or less (that is blocks read from the SSD cache of the storage)" plotted as a function of time. Reading from the cache can be seen to go to zero during backups full backups (every two weeks, see also graph 1).
First conclusions and future work: IO becomes slow because backups effectively seem to flush the SSD cache. Also Very long response times are seen from single block IO. There is work to do on the storage side!
What about actions on the DB? Moving backup to the (active) Data Guard, can be of help, as it would effectively remove the workload source of contention.
Techniques and references: To analyze AWR data I have built a set of custom queries against dba_hist_* views and used them inside the excellent Perfsheet by Tanel Poder.
Moreover I'd like to reference James Morle's blog.
This is the text of the query used, among others, to generate the graph above:
--event histogram with wait_count rate and time
select cast(sn.begin_interval_time as date) beg_time,
eh.dbid,
sn.instance_number,
eh.event_name,
eh.wait_time_milli,
eh.wait_count,
eh.wait_count - lag(eh.wait_count) over (partition by eh.dbid,eh.instance_number,eh.event_id,eh.wait_time_milli order by sn.snap_id nulls first) Delta_wait_count,
extract(hour from END_INTERVAL_TIME-begin_interval_time)*3600
+ extract(minute from END_INTERVAL_TIME-begin_interval_time)* 60
+ extract(second from END_INTERVAL_TIME-begin_interval_time) DeltaT_sec,
round((eh.wait_count - lag(eh.wait_count) over (partition by eh.dbid,eh.instance_number,eh.event_id,eh.wait_time_milli order by sn.snap_id nulls first)) /
(extract(hour from END_INTERVAL_TIME-begin_interval_time)*3600
+ extract(minute from END_INTERVAL_TIME-begin_interval_time)* 60
+ extract(second from END_INTERVAL_TIME-begin_interval_time)),2 ) Rate_wait_count_per_bin
from dba_hist_event_histogram eh,
dba_hist_snapshot sn
where
%FILTER%
and sn.snap_id = eh.snap_id
and sn.dbid = eh.dbid
and sn.instance_number = eh.instance_number
and sn.begin_interval_time between %FROM_DATE% and %TO_DATE%
Drill down questions: what causes the high latency for single block reads during backup? Why do the sequential IO of the backup causes random reads to slow down? Are disks saturating? Is it an Oracle issue or more a storage issue? Finally, what can be done about it?
Oracle event histogram: Oracle instrumentation of wait events provides additional information in terms of histograms. Wait duration is divided in intervals and for each interval we can see how many wait events have happened. Historical data is collected and visible in AWR.
AWR data: Graph of data from DBA_HIST_EVENT_HISTOGRAM for the event db file sequential read are reported here below. In particular we can see that read of 1ms or less, that is reads served by the SSD cache of the storage, go to zero during the time backups are taken. Moreover wait of long duration (between 64 and 128 ms, which is a very long time for single block reads) show a peak at the same time.
Graph1: "Number of single block reads of 64 ms or more (that is very slow)" plotted as function of time. Large peaks are seen in correspondence of full backups (every 2 weeks), smaller peaks during daily backups.
Graph2: "Number of single block read of 1 ms or less (that is blocks read from the SSD cache of the storage)" plotted as a function of time. Reading from the cache can be seen to go to zero during backups full backups (every two weeks, see also graph 1).
First conclusions and future work: IO becomes slow because backups effectively seem to flush the SSD cache. Also Very long response times are seen from single block IO. There is work to do on the storage side!
What about actions on the DB? Moving backup to the (active) Data Guard, can be of help, as it would effectively remove the workload source of contention.
Techniques and references: To analyze AWR data I have built a set of custom queries against dba_hist_* views and used them inside the excellent Perfsheet by Tanel Poder.
Moreover I'd like to reference James Morle's blog.
This is the text of the query used, among others, to generate the graph above:
--event histogram with wait_count rate and time
select cast(sn.begin_interval_time as date) beg_time,
eh.dbid,
sn.instance_number,
eh.event_name,
eh.wait_time_milli,
eh.wait_count,
eh.wait_count - lag(eh.wait_count) over (partition by eh.dbid,eh.instance_number,eh.event_id,eh.wait_time_milli order by sn.snap_id nulls first) Delta_wait_count,
extract(hour from END_INTERVAL_TIME-begin_interval_time)*3600
+ extract(minute from END_INTERVAL_TIME-begin_interval_time)* 60
+ extract(second from END_INTERVAL_TIME-begin_interval_time) DeltaT_sec,
round((eh.wait_count - lag(eh.wait_count) over (partition by eh.dbid,eh.instance_number,eh.event_id,eh.wait_time_milli order by sn.snap_id nulls first)) /
(extract(hour from END_INTERVAL_TIME-begin_interval_time)*3600
+ extract(minute from END_INTERVAL_TIME-begin_interval_time)* 60
+ extract(second from END_INTERVAL_TIME-begin_interval_time)),2 ) Rate_wait_count_per_bin
from dba_hist_event_histogram eh,
dba_hist_snapshot sn
where
%FILTER%
and sn.snap_id = eh.snap_id
and sn.dbid = eh.dbid
and sn.instance_number = eh.instance_number
and sn.begin_interval_time between %FROM_DATE% and %TO_DATE%
↧
Performance Metrics Views
Topic: GV$ views of the 'metrics family' and how they can help in tuning + links to scripts I use.
Oracle has plenty of instrumentation, statistics counters and wait event data are the bread and butter of Oracle monitoring and tuning.
However counters are typically incremented during the life of the instances/sessions, so delta values are often needed to make sense of data for analysis. AWR reports are a way to do that for instance-wide data.
A class of GV$ views that can be of help for faster/ 'online' monitoring are the V$ views that I call of the 'metric family'. Of those I find myself using often: gv$sysmetric, gv$eventmetric and gv$iofuncmetric
The great thing about metrics is that they report values accumulated over shorts period of times (15 and 60 sec are typical). Effectively this provides the needed deltas.
Still one minor problem: a lot of data is reported by the metrics, more than what is needed normally. So a good filtering of the metrics of interest can be helpful.
Below a few examples of scripts that I have written and I use for this. (Scripts can be found here).
Examples:
sysmetric_details.sql reports on a dozen of metrics and lists values for instance (in RAC).
GV$IOFUNCMETRIC is new in 11gR2 and provides details on the type of IO the system is doing. An example below. This time iometric.sql aggregates over all instances of RAC.
Conclusions: gv$ views of the metric family provide a very useful set of tools for reactive performance analysis. They provide a quick view of key metrics, events consumed and IO done.
Limitations: views of the metric family don't provide a session-based view or a SQL-based view. They are often more useful for finding the symptoms rather than the root causes of performance issues.
Oracle has plenty of instrumentation, statistics counters and wait event data are the bread and butter of Oracle monitoring and tuning.
However counters are typically incremented during the life of the instances/sessions, so delta values are often needed to make sense of data for analysis. AWR reports are a way to do that for instance-wide data.
A class of GV$ views that can be of help for faster/ 'online' monitoring are the V$ views that I call of the 'metric family'. Of those I find myself using often: gv$sysmetric, gv$eventmetric and gv$iofuncmetric
The great thing about metrics is that they report values accumulated over shorts period of times (15 and 60 sec are typical). Effectively this provides the needed deltas.
Still one minor problem: a lot of data is reported by the metrics, more than what is needed normally. So a good filtering of the metrics of interest can be helpful.
Below a few examples of scripts that I have written and I use for this. (Scripts can be found here).
Examples:
sysmetric_details.sql reports on a dozen of metrics and lists values for instance (in RAC).
GV$IOFUNCMETRIC is new in 11gR2 and provides details on the type of IO the system is doing. An example below. This time iometric.sql aggregates over all instances of RAC.
Conclusions: gv$ views of the metric family provide a very useful set of tools for reactive performance analysis. They provide a quick view of key metrics, events consumed and IO done.
Limitations: views of the metric family don't provide a session-based view or a SQL-based view. They are often more useful for finding the symptoms rather than the root causes of performance issues.
↧
V$EVENT_HISTOGRAM_METRIC
V$EVENT_HISTOGRAM_METRIC does not exist (at least up to 11.2.0.3) however it could be handy! The idea is combine the detailed information from event histograms and the ease of use of metric views. Here below an example of how this can be implemented with a sample script and a reference to the code of the script.
Example: Measure latency for single block reads for an OLTP database, as exposed by Oracle wait event interface as 'db file sequential read'. This is used to investigate a case of performance degradation where storage behaviour is involved.
Step 1: collect GV$EVENT_HISTOGRAM data in a period of normal activity using the script ehm.sql
Step 2: Plot data and analysis. Roughly 30 % of the reads are accounted as being competed in less than 1 ms. Those reads are being served from the SSD cache of the storage. A peak of latency is around 16 ms. We can interpret that as reads from physical disks (SATA disks at 7.2K rpm, that's why not very fast BTW)
Step 3: collect GV$EVENT_HISTOGRAM data during a period when applications report performance issues and plot data.
Analysis: Data measured in step 3 show that there are almost no blocks read from the SSD cache (that is there is no peak around the 1 ms measurement point) and that reads from physical IO have very high latency and definitely higher latency than normal (compare with graph above).
Note: This is the same case as discussed in a previous blog entry, see there for further details analisys of the investigations on what triggers this type of behaviour in the storage: "Of latency, skew and histograms 2/2"
The point I want to make: event histogram data gives valuable information for troubleshooting. In particular I have found useful the study of db file sequential read event. I is important to collect data samples over an interval of time of interest, as opposed to using raw data from gv$event_histogram that are cumulative since instance(s) startup.
The script @ehm.sql can be found at this link.
Example: Measure latency for single block reads for an OLTP database, as exposed by Oracle wait event interface as 'db file sequential read'. This is used to investigate a case of performance degradation where storage behaviour is involved.
Step 1: collect GV$EVENT_HISTOGRAM data in a period of normal activity using the script ehm.sql
Step 2: Plot data and analysis. Roughly 30 % of the reads are accounted as being competed in less than 1 ms. Those reads are being served from the SSD cache of the storage. A peak of latency is around 16 ms. We can interpret that as reads from physical disks (SATA disks at 7.2K rpm, that's why not very fast BTW)
Step 3: collect GV$EVENT_HISTOGRAM data during a period when applications report performance issues and plot data.
Analysis: Data measured in step 3 show that there are almost no blocks read from the SSD cache (that is there is no peak around the 1 ms measurement point) and that reads from physical IO have very high latency and definitely higher latency than normal (compare with graph above).
Note: This is the same case as discussed in a previous blog entry, see there for further details analisys of the investigations on what triggers this type of behaviour in the storage: "Of latency, skew and histograms 2/2"
The point I want to make: event histogram data gives valuable information for troubleshooting. In particular I have found useful the study of db file sequential read event. I is important to collect data samples over an interval of time of interest, as opposed to using raw data from gv$event_histogram that are cumulative since instance(s) startup.
The script @ehm.sql can be found at this link.
↧
↧
SQL Patch and Force Match
Discussion on how to create sql_patch with force_match=true in Oracle 11g and related topics.
SQL patches have recently saved the day for me in a production issue where a given SQL had suddenly changed execution plan causing IO overload (a full scan was done instead of index-based read for a high-load statement). 11g allows for a quick fix in such situations (as in, stop the fire and buy time to find a more stable solution): a set of hints can be added to a given query via the use of SQL Patch.
The official Oracle documentation has not many details on the topic (up to 11.2.0.3). Oracle's optimizer blog and Dominic Brooks have very good posts on the topic though.
I add here a few additional details that I have researched and found useful:
1) Recap (from the references listed above): how to create a SQL_PATCH from command line
2) What if I have a sql_id instead of the sql_text?
3) What if I want to create a SQL patch with FORCE_MATCH = TRUE? (force match will additionally target all SQL statements that have the same text after normalizing literal values to bind variables).
4) Additional considerations:
What should I put as hints in the SQL PATCH? I would normally use the full outline content from an execution plan that I have tested and found OK. Mileage may vary. To print out the outline one can use:
select * from table(dbms_xplan.display('PLAN_TABLE',null,'OUTLINE'));
Drop and enable/disable SQL patches with: DBMS_SQLDIAG.DROP_SQL_PATCH and DBMS_SQLDIAG.ALTER_SQL_PATCH
Diagnostics: dba_sql_patches. That view is based on tables shared with sql profiles and sql baselines. Underlying tables of interest for baselines, profiles and sql patches are: sys.sqlobj$data sys.sqlobj$auxdata, sys.sql$text, sys.sqlobj$
V$SQL.SQL_PATCH when not null reports the name of the sql patch that has been used to parse that particular child cursor.
SQL patches have recently saved the day for me in a production issue where a given SQL had suddenly changed execution plan causing IO overload (a full scan was done instead of index-based read for a high-load statement). 11g allows for a quick fix in such situations (as in, stop the fire and buy time to find a more stable solution): a set of hints can be added to a given query via the use of SQL Patch.
The official Oracle documentation has not many details on the topic (up to 11.2.0.3). Oracle's optimizer blog and Dominic Brooks have very good posts on the topic though.
I add here a few additional details that I have researched and found useful:
1) Recap (from the references listed above): how to create a SQL_PATCH from command line
begin
sys.dbms_sqldiag_internal.i_create_patch
(sql_text => '..put SQL here..',
hint_text => '..put hints here..',
name => 'my patch name');
end;
/
2) What if I have a sql_id instead of the sql_text?
SQL> var c clob
SQL> exec select sql_fulltext into :c from v$sqlstats where sql_id='...' and rownum=1;
begin
sys.dbms_sqldiag_internal.i_create_patch
(sql_text => :c,
hint_text => '..put hints here..',
name => 'my patch name');
end;
/
3) What if I want to create a SQL patch with FORCE_MATCH = TRUE? (force match will additionally target all SQL statements that have the same text after normalizing literal values to bind variables).
-- create sql patch
-- allows to set FORCE_MATCH to TRUE which is currently not possible with dbms_sqldiag_internal.i_create_patch (11.2.0.3)-- run as sys
-- this is undocumented stuff, handle with care
DECLARE
sql_text clob := '....put sql here...';
hints varchar2(1000) :='....put hints here...';
description varchar2(100):='my patch description';
name varchar2(100) :='my patch name';
output varchar2(100);
sqlpro_attr SYS.SQLPROF_ATTR;
BEGIN
sqlpro_attr := SYS.SQLPROF_ATTR(hints);
output := SYS.DBMS_SQLTUNE_INTERNAL.I_CREATE_SQL_PROFILE(
SQL_TEXT => sql_text,
PROFILE_XML => DBMS_SMB_INTERNAL.VARR_TO_HINTS_XML(sqlpro_attr),
NAME => name, DESCRIPTION => description,
CATEGORY => 'DEFAULT',
CREATOR => 'SYS',
VALIDATE => TRUE,
TYPE => 'PATCH',
FORCE_MATCH => TRUE,
IS_PATCH => TRUE);
dbms_output.put_line(output);
END;
/
4) Additional considerations:
What should I put as hints in the SQL PATCH? I would normally use the full outline content from an execution plan that I have tested and found OK. Mileage may vary. To print out the outline one can use:
select * from table(dbms_xplan.display('PLAN_TABLE',null,'OUTLINE'));
Drop and enable/disable SQL patches with: DBMS_SQLDIAG.DROP_SQL_PATCH and DBMS_SQLDIAG.ALTER_SQL_PATCH
Diagnostics: dba_sql_patches. That view is based on tables shared with sql profiles and sql baselines. Underlying tables of interest for baselines, profiles and sql patches are: sys.sqlobj$data sys.sqlobj$auxdata, sys.sql$text, sys.sqlobj$
V$SQL.SQL_PATCH when not null reports the name of the sql patch that has been used to parse that particular child cursor.
↧
SQL Signature, Text Normalization and MD5 Hash
Topic: A discussion on how Oracle computes SQL signature using MD5 hashing and on how SQL text is normalized before computing SQL signatures.
From the work of Tanel, Marcin and Slavik we learn that the full hash is an md5 hash in hexadecimal, sql_id is a base-32 representation of the trailing 8 bytes of the same md5 hash and finally the good old hash_value is a decimal representation of the trailing 4 bytes of the md5 hash.
Here below an example to illustrate how one can easily reproduce Oracle's hash values from md5 calculations (see also the blog entries mentioned above for more details):
The same value can be calculated with md5_with_chr0.sql, a simple script based on DBMS_OBFUSCATION_TOOLKIT.MD5. Note chr(0) is added to the sql text before hashing.
CALCULATED_FULL_HASH
--------------------------------
7d4dc9b423f0bcfb510272edaae096c8 --> same value as full_hash_value above
1) SQL statements are 'normalized'. Oracle performs transformations to the text before calculating the signature. Among others extra spaces are removed and case is ignored (see also oracle doc on V$SQL). An example here below to illustrate this: 2 SQL statements that differ only on case and additional whitespaces are shown to have the same SQL signature. Spoiler: normalization of more complex statements has some additional surprises, see below.
SQL> select DBMS_SQLTUNE.SQLTEXT_TO_SIGNATURE('select 1 from dual') signature_with_space, DBMS_SQLTUNE.SQLTEXT_TO_SIGNATURE('SELECT 1 FROM DUAL') signature_normalized_sql from dual;
SIGNATURE_WITH_SPACE SIGNATURE_NORMALIZED_SQL
-------------------- ------------------------
12518811395313535686 12518811395313535686
2) Force matching is an additional option/feature of SQL signatures. It is used to match the text of SQL that uses literals. Oracle performs an additional transformation, that is literals are transformed into system-generated names for bind variables before hashing. An example below using literals (the value 1). The signature is converted in hexadecimal in this example, as we will use that for the following.
select to_char(DBMS_SQLTUNE.SQLTEXT_TO_SIGNATURE('SELECT 1 FROM DUAL'),'xxxxxxxxxxxxxxxx') exact_match_signature_hex,
to_char(DBMS_SQLTUNE.SQLTEXT_TO_SIGNATURE('SELECT 1 FROM DUAL',force_match=>1),'xxxxxxxxxxxxxxxx') force_match_signature_hex from dual;
EXACT_MATCH_SIGNATURE_HEX FORCE_MATCH_SIGNATURE_HEX
------------------------- -------------------------
adbbc0a2f3c68ac6 9289f992520d5a86
3) Computing SQL signatures directly with MD5. The script md5.sql computes md5 hashes of the input (note it does not add a trailing chr(0)). We can reproduce the values obtained before and therefore confirm what is the transformation that Oracle does when calculating signatures.
SQL> @md5 'SELECT 1 FROM DUAL'
CALCULATED_FULL_HASH
--------------------------------
d9f6df623a086c51adbbc0a2f3c68ac6 -> see exact_match in (2)above
SQL> @md5 'SELECT :"SYS_B_0" FROM DUAL'
CALCULATED_FULL_HASH
--------------------------------
a995b29240a442869289f992520d5a86 -> see force_match in (2) above
4) Additional investigations: when we have lists of columns or tables Oracle transforms ",", in " , " so actually in this case whitespaces may be added to the original statement and not removed as we have seen in point (1) above! An example here below:
CALCULATED_FULL_HASH
--------------------------------
4202d4045dbff1b4065f0b8b3ef8a341
The MD5 hash calculated on the normalized SQL text is truncated to the last 8 bytes and is displayed in V$ views as a decimal number (SQL signature).
In the case of force matching signatures literals are additionally replaced with:"SYS_B_<n>" (where <n> is a decimal number, see example above), very much like the case when using cursor_sharing=force.
Note: a trailing chr(0) is added by Oracle when calculating hash_values/sql_id but not when calculating SQL signatures.
md5.sql and md5_with_chr0.sql scripts can be found at this link.
Introduction and warm-up:
SQL statements that are cached in the library cache are associated to a hash value, which is visible in various forms across a few V$ views. In 11gR2 for example: V$DB_OBJECT_CACHE.FULL_HASH_VALUE, V$SQL.SQL_ID, V$SQL.HASH_VALUE.From the work of Tanel, Marcin and Slavik we learn that the full hash is an md5 hash in hexadecimal, sql_id is a base-32 representation of the trailing 8 bytes of the same md5 hash and finally the good old hash_value is a decimal representation of the trailing 4 bytes of the md5 hash.
Here below an example to illustrate how one can easily reproduce Oracle's hash values from md5 calculations (see also the blog entries mentioned above for more details):
--11gR2 hash from library cache, in 10g use x$kglob.kglnahsv
SQL> select hash_value, to_char(hash_value,'xxxxxxxx') hex_hash_value, full_hash_value from GV$DB_OBJECT_CACHE where name='select 1 from dual';
HASH_VALUE HEX_HASH_VALUE FULL_HASH_VALUE
----------- -------------- --------------------------------
2866845384 aae096c8 7d4dc9b423f0bcfb510272edaae096c8
The same value can be calculated with md5_with_chr0.sql, a simple script based on DBMS_OBFUSCATION_TOOLKIT.MD5. Note chr(0) is added to the sql text before hashing.
SQL> @md5_with_chr0 'select 1 from dual'
--------------------------------
7d4dc9b423f0bcfb510272edaae096c8 --> same value as full_hash_value above
SQL signature calculation and SQL text normalization
SQL signature is another type of hashing for SQL statements used by Oracle to support the features of plan management with SQL profiles, SQL patches, SQL plan baselines.1) SQL statements are 'normalized'. Oracle performs transformations to the text before calculating the signature. Among others extra spaces are removed and case is ignored (see also oracle doc on V$SQL). An example here below to illustrate this: 2 SQL statements that differ only on case and additional whitespaces are shown to have the same SQL signature. Spoiler: normalization of more complex statements has some additional surprises, see below.
SQL> select DBMS_SQLTUNE.SQLTEXT_TO_SIGNATURE('select 1 from dual') signature_with_space, DBMS_SQLTUNE.SQLTEXT_TO_SIGNATURE('SELECT 1 FROM DUAL') signature_normalized_sql from dual;
SIGNATURE_WITH_SPACE SIGNATURE_NORMALIZED_SQL
-------------------- ------------------------
12518811395313535686 12518811395313535686
2) Force matching is an additional option/feature of SQL signatures. It is used to match the text of SQL that uses literals. Oracle performs an additional transformation, that is literals are transformed into system-generated names for bind variables before hashing. An example below using literals (the value 1). The signature is converted in hexadecimal in this example, as we will use that for the following.
select to_char(DBMS_SQLTUNE.SQLTEXT_TO_SIGNATURE('SELECT 1 FROM DUAL'),'xxxxxxxxxxxxxxxx') exact_match_signature_hex,
to_char(DBMS_SQLTUNE.SQLTEXT_TO_SIGNATURE('SELECT 1 FROM DUAL',force_match=>1),'xxxxxxxxxxxxxxxx') force_match_signature_hex from dual;
EXACT_MATCH_SIGNATURE_HEX FORCE_MATCH_SIGNATURE_HEX
------------------------- -------------------------
adbbc0a2f3c68ac6 9289f992520d5a86
3) Computing SQL signatures directly with MD5. The script md5.sql computes md5 hashes of the input (note it does not add a trailing chr(0)). We can reproduce the values obtained before and therefore confirm what is the transformation that Oracle does when calculating signatures.
SQL> @md5 'SELECT 1 FROM DUAL'
CALCULATED_FULL_HASH
--------------------------------
d9f6df623a086c51adbbc0a2f3c68ac6 -> see exact_match in (2)above
SQL> @md5 'SELECT :"SYS_B_0" FROM DUAL'
CALCULATED_FULL_HASH
--------------------------------
a995b29240a442869289f992520d5a86 -> see force_match in (2) above
4) Additional investigations: when we have lists of columns or tables Oracle transforms ",", in " , " so actually in this case whitespaces may be added to the original statement and not removed as we have seen in point (1) above! An example here below:
SQL> select to_char(DBMS_SQLTUNE.SQLTEXT_TO_SIGNATURE('SELECT ID,ID FROM DUAL,DUAL',force_match=>1),'xxxxxxxxxxxxxxxx') force_match_signature_hex from dual;
FORCE_MATCH_SIGNATURE
---------------------
65f0b8b3ef8a341
---------------------
65f0b8b3ef8a341
SQL @md5 'SELECT ID , ID FROM DUAL , DUAL'
--------------------------------
4202d4045dbff1b4065f0b8b3ef8a341
Conclusions
The signature of SQL statements is calculated by Oracle using MD5 on the normalized SQL text. Normalization steps include removing extra white-space characters, converting the text to upper case. Additional normalization steps concern comma-separated list of objects: the item separator is transformed into ' , ' (space, comma, space).The MD5 hash calculated on the normalized SQL text is truncated to the last 8 bytes and is displayed in V$ views as a decimal number (SQL signature).
In the case of force matching signatures literals are additionally replaced with:"SYS_B_<n>" (where <n> is a decimal number, see example above), very much like the case when using cursor_sharing=force.
Note: a trailing chr(0) is added by Oracle when calculating hash_values/sql_id but not when calculating SQL signatures.
md5.sql and md5_with_chr0.sql scripts can be found at this link.
↧
Hash Collisions in Oracle: SQL Signature and SQL_ID
Topic: Discussion and implementation of a method for finding hash collisions for sql_id and SQL signature in Oracle.
Introduction:
MD5 hashing is used by Oracle to compute sql_id and SQL signature (see also a previous blog post). Those hash values are normally used as if they were a unique representation of a given SQL statement. However collisions (2 different statements with the same hash value) can happen or, as it is the case of this post, will happen!
The underlying math (a short discussion on birthday attack). Let's say you walk into a room an meet 30 people: shake hand with the first person you see and also disclose each other's the birthday. There is about 1 chance out 365 that that particular person has the same birthday as you. If you repeat this process with the rest of the 30 people in the room, chances of a 'positive event' increase but still stay below 10%. However when all the people in the room repeat the procedure of shaking hands and sharing birthday details, probabilities add up quickly (there are N(N-1)/2 different handshakes possible in a room of N people) and indeed it turns out that it is likely that 2 people share a common birthday in such a small group. See the link for the details.
The main point is that to find collisions by brute force on a hash function of n bits we don't need to generate O(2^n) hash values but rather the square root of that O(2^n/2). In particular this means that the complexity of finding hash collisions for 32-bit and 64-bit hash functions by brute force are down to numbers that are easily computable with current HW. sql_id and signatures are indeed 64-bit hashes, as discussed here, so let's give it a try!
Warm-up, collisions on hash_value:
The hash_value of a a given SQL statement (as shown in V$SQL.hash_value for example) is a 32-bit hash. Collisions can be found using about 100K random hashes. A simple algorithm using Oracle and PL/SQL is:
SQL> select owner, table_name from dba_tables where rownum<=10 --ooXUUTNfekKkRZGlCJfiPlxKcoCfpDAh;
Scaling up to 64-bit hash: sql_id collisions
The same algorithm used for hash_value collisions can be used for sql_id, although we need to scale up the calculation: the hash is 8 byte long now and we need ~10 billion random hashes to find a few meaningful collisions. SQL> @hash_birthday_sql hashtab 500000 8 would take about 60 hours to do such calculation (mileage may vary). A faster way is to split the execution in smaller chunks and parallelize the execution across multiple CPU and possibly RAC nodes. See the link here for an example on 2 RAC nodes and parallelism 10 each.
Results: the following two SQL statements are different but have the same sql_id (ayr58apvbz37z) and hash_value (1992264959).
--confirm this by running:
SQL> select sql_text from v$sql where sql_id='ayr58apvbz37z';
Another 64 bit hash, SQL signature collisions
A variation from the theme above is to calculate statements that have a SQL signature collision. The signature is used by SQL profiles, baselines and SQL patches. It is calculated in a similar but different way than sql_id, in particular SQL text is normalized before hashing (see this link for additional details). The base script is hash_birthday_signature.sql again parallel computation can be used to speed up the computation.
Results: the following two SQL statements are different but have the same signature (6284201623763853025), note sql_id values are different.
--confirm this by running:
select sql_text from v$sql where force_matching_signature=6284201623763853025;
How does Oracle cope with collisions
In a few tests I have seen dbms_xplan throwing ORA-01422 when displaying plans for sql_id with collisions. There may be other packages/tools that make the assumption that sql_id values don't have collisions and they may throw errors.
These are a few more problems with SQL signature collisions. In particular a source of potential trouble is that one of the main (and common) underlying tables for SQL plan management (baselines, profiles, patches) which stores the mappings of sql_text to signature, SYS.SQL$TEXT, has a unique index on signature. However we have seen that it is quite unlikely to have a signature collision just by pure chance (i.e. in the normal usage of the DB).
Conclusions:
This post reports on an exercise to compute collisions for sql_id and SQL signatures of Oracle statements using the Oracle database and PL/SQL. We have seen how some simple math, called birthday attack, can be of help and how an implementation of this in Oracle is quite easy up to hash length of 64 bit. We also report on the findings, notably collisions of sql_id and SQL signature.
Additional comments: a collision on full_hash_value (as reported in v$db_object_cache) could not be calculated with this method as it is a 128 bit hash and the brute force method would require resources out of a 'reasonable scale' . However, since 2005 there are methods known to find collisions for MD5 hashes on the full 128 bits that use weaknesses in the algorithm (that's why MD5 should not be used for security purposes any more). Those methods could be used to find full_hash_value collision.
PS: SQL and scripts mentioned in this article can be found here
Introduction:
MD5 hashing is used by Oracle to compute sql_id and SQL signature (see also a previous blog post). Those hash values are normally used as if they were a unique representation of a given SQL statement. However collisions (2 different statements with the same hash value) can happen or, as it is the case of this post, will happen!
The underlying math (a short discussion on birthday attack). Let's say you walk into a room an meet 30 people: shake hand with the first person you see and also disclose each other's the birthday. There is about 1 chance out 365 that that particular person has the same birthday as you. If you repeat this process with the rest of the 30 people in the room, chances of a 'positive event' increase but still stay below 10%. However when all the people in the room repeat the procedure of shaking hands and sharing birthday details, probabilities add up quickly (there are N(N-1)/2 different handshakes possible in a room of N people) and indeed it turns out that it is likely that 2 people share a common birthday in such a small group. See the link for the details.
The main point is that to find collisions by brute force on a hash function of n bits we don't need to generate O(2^n) hash values but rather the square root of that O(2^n/2). In particular this means that the complexity of finding hash collisions for 32-bit and 64-bit hash functions by brute force are down to numbers that are easily computable with current HW. sql_id and signatures are indeed 64-bit hashes, as discussed here, so let's give it a try!
Warm-up, collisions on hash_value:
The hash_value of a a given SQL statement (as shown in V$SQL.hash_value for example) is a 32-bit hash. Collisions can be found using about 100K random hashes. A simple algorithm using Oracle and PL/SQL is:
- Take a couple of different SQL statements and append to them a randomly-generated string
- Compute the MD5 hash and take the number of bytes that are needed (4 bytes in this case)
- Insert the hashes in a table
- Repeat ~100K times
- When finished look in the output table for duplicates in the hash value
- If duplicates are found, those are the hash_value collisions
SQL> @hash_birthday_sql hashtab 10 4
SQL> @find_dup hashtab
-- these 2 statements have the same hash_value (=3577689134)
SQL> select owner, index_name from dba_indexes where rownum<=9 --CTUUewnYYPJXFuFyAzcMtXTVNmPtNgAV;SQL> select owner, table_name from dba_tables where rownum<=10 --ooXUUTNfekKkRZGlCJfiPlxKcoCfpDAh;
Scaling up to 64-bit hash: sql_id collisions
The same algorithm used for hash_value collisions can be used for sql_id, although we need to scale up the calculation: the hash is 8 byte long now and we need ~10 billion random hashes to find a few meaningful collisions. SQL> @hash_birthday_sql hashtab 500000 8 would take about 60 hours to do such calculation (mileage may vary). A faster way is to split the execution in smaller chunks and parallelize the execution across multiple CPU and possibly RAC nodes. See the link here for an example on 2 RAC nodes and parallelism 10 each.
Results: the following two SQL statements are different but have the same sql_id (ayr58apvbz37z) and hash_value (1992264959).
SQL> select owner, index_name from dba_indexes where rownum<=9 --BaERRzEYqyYphBAvEbIrbYYDKkemLaib;
SQL> select owner, table_name from dba_tables where rownum<=10 --XhiidvehudXqDpCMZokNkScXlQiIUkUq;
--confirm this by running:
SQL> select sql_text from v$sql where sql_id='ayr58apvbz37z';
Another 64 bit hash, SQL signature collisions
A variation from the theme above is to calculate statements that have a SQL signature collision. The signature is used by SQL profiles, baselines and SQL patches. It is calculated in a similar but different way than sql_id, in particular SQL text is normalized before hashing (see this link for additional details). The base script is hash_birthday_signature.sql again parallel computation can be used to speed up the computation.
Results: the following two SQL statements are different but have the same signature (6284201623763853025), note sql_id values are different.
SQL> SELECT TABLE_NAME FROM USER_TABLES --JO7HWQAPNFYEALIOA3N5ODTR46INWY6N;
SQL> SELECT INDEX_NAME FROM USER_INDEXES --PAOZY8VWGDHDS9NICVREGTR22J6C24DQ;
--confirm this by running:
select sql_text from v$sql where force_matching_signature=6284201623763853025;
In a few tests I have seen dbms_xplan throwing ORA-01422 when displaying plans for sql_id with collisions. There may be other packages/tools that make the assumption that sql_id values don't have collisions and they may throw errors.
These are a few more problems with SQL signature collisions. In particular a source of potential trouble is that one of the main (and common) underlying tables for SQL plan management (baselines, profiles, patches) which stores the mappings of sql_text to signature, SYS.SQL$TEXT, has a unique index on signature. However we have seen that it is quite unlikely to have a signature collision just by pure chance (i.e. in the normal usage of the DB).
Conclusions:
This post reports on an exercise to compute collisions for sql_id and SQL signatures of Oracle statements using the Oracle database and PL/SQL. We have seen how some simple math, called birthday attack, can be of help and how an implementation of this in Oracle is quite easy up to hash length of 64 bit. We also report on the findings, notably collisions of sql_id and SQL signature.
Additional comments: a collision on full_hash_value (as reported in v$db_object_cache) could not be calculated with this method as it is a 128 bit hash and the brute force method would require resources out of a 'reasonable scale' . However, since 2005 there are methods known to find collisions for MD5 hashes on the full 128 bits that use weaknesses in the algorithm (that's why MD5 should not be used for security purposes any more). Those methods could be used to find full_hash_value collision.
PS: SQL and scripts mentioned in this article can be found here
↧
Kerberos Authentication and Proxy Users
Topic: a discussion on the usage of Kerberos authentication in Oracle and of the usage of proxy users together with Kerberos authenticated accounts
Introduction:Kerberos authentication allows to connect to Oracle without specifying the username/password credentials. The authentication is done externally. Kerberos has a widespread usais in use already in large environments so is a good candidate (for example for windows domain accounts or for an afs file system in Linux).
Proxy authentication allows connect to the DB to a target user via another DB user (the proxy user). For example we can authorize a user with a development account to connect to the application owner account using his/her credentials (thus no need to expose the application user's password).
When the two features (Kerberos and proxy authentication) are combined we can provide a mean to connect to any given DB user via a Kerberos-authenticated user, i.e. without specifying the password at the connect time.
Introduction:Kerberos authentication allows to connect to Oracle without specifying the username/password credentials. The authentication is done externally. Kerberos has a widespread usais in use already in large environments so is a good candidate (for example for windows domain accounts or for an afs file system in Linux).
Proxy authentication allows connect to the DB to a target user via another DB user (the proxy user). For example we can authorize a user with a development account to connect to the application owner account using his/her credentials (thus no need to expose the application user's password).
When the two features (Kerberos and proxy authentication) are combined we can provide a mean to connect to any given DB user via a Kerberos-authenticated user, i.e. without specifying the password at the connect time.
Example of client-side configuration (Linux)
- These tests have been performed using Oracle server and client 11.2.0.3 for Linux
- See below for details of server-side configuration
- Create oracle user and grant connection privileges:
grant connect to "LUCA@MYDOMAIN.COM" ;
...add tablespace config and other privileges if needed..
A (quite) different technique also aiming at reducing the need of hard-coded credentials is Secure External Password Store. This is described in oracle documentation and in the following support note: Using The Secure External Password Store [ID 340559.1]
Conclusions: We have discussed how to set up Kerberos authentication for Oracle users both as a standalone method and in conjunction with Proxy authentication. These techniques can be of help to reduce the need of hard-coded credentials when accessing Oracle database services.
- configure sqlnet.ora for the oracle client. This can be in $ORACLE_HOME/network/admin, $TNS_ADMIN or in $HOME/.sqlnet.ora depending on the scope of the configuration.
SQLNET.AUTHENTICATION_SERVICES=(KERBEROS5)
SQLNET.KERBEROS5_CONF=/etc/krb5.conf # need to have kerberos client configured
SQLNET.AUTHENTICATION_KERBEROS5_SERVICE=orasrv #this needs to match DB server config
SQLNET.KERBEROS5_CONF_MIT=TRUE
- Request kerberos pricipal for your user if not already present: kinit luca
- at this point you'll need to put the password on the command line as asked
- it's also possible to use the oracle utility okinit, in this case for our config we need to run
okinit -e 23 luca.
W
hen using db link one need forwardable tickets (kinit -f)
- Connect to the DB with kerberos credentials. In sqlplus use this syntax sqlplus /@MYDB
- show user will show LUCA@MYDOMAIN.COM
- If you get instead ORA-12638: Credential retrieval failed means you need to troubleshoot. Unfortunately there are many causes for the error, staring with bugs and version incompatibilities (see also notes below).
Proxy users and Kerberos used together
Suppose that we need to authorize the user luca@mydomain.com to connect to production application. I would normally use: sqlplus myappuser/mysecretpass@MYDB. With kerberos and proxy authentication, used together, we can do the following:- Authorize the user to use proxy authentication:
- alter user myappuser grant connect through "LUCA@MYDOMAIN.COM";
- Check the configuration with select * from PROXY_USERS;
- Setup and test Kerberos authentication for MYDB, see steps above
- Connect to the production user with this method (Kerberos+proxy user].
- The syntax in sqlplus is: sqlplus [myappuser]/@MYDB
- show user from sqlplus will show myappuser as result
- With this method the password for myappuser has never been used. Only the password for the kerberos user LUCA@MYDOMAIN.COM has been used when requesting the ticket with kinit
- One can still connect to myappuser with the orginal username/pass credentials too, this is just an additional path to it
Notes on client-side configuration (Windows)
- Kerberos authentication will work with the ticket granted by the domain at users logon
- okinit, oklist etc are missing from instant client but are also not needed to make this work
- do not use client 11.2.0.1 if possible.
- you need to logon as domain user
- in alternative if you want to use this from a local account or usea different kerberos user, just run cmd.exe with run as and specify a domain user's credentials
- check with klist that you have the ticket for the principal "LUCA" in this example
- Create or copy over krb5.conf from a Linux machine (for example from DB server)
- Edit sqlnet.ora in %ORACLE_HOME%\network\admin (or %TNS_ADMIN%) adding the following:
SQLNET.AUTHENTICATION_SERVICES=(KERBEROS5)
SQLNET.KERBEROS5_CONF= ....put path here....\krb5.conf
SQLNET.KERBEROS5_CONF_MIT=TRUE
SQLNET.AUTHENTICATION_KERBEROS5_SERVICE=orasrv # match value on server
SQLNET.KERBEROS5_CC_NAME=OSMSFT: # special windows setting
- connect to the DB with kerberos credentials
- same as the case of Linux above.
- In sqlplus use this syntax sqlplus /@MYDB
- This has been tested with a Windows 7 client and Oracle 11.2.0.3 instant client 32 bit
How to configure the DB server for kerberos authentication
This is described in Oracle documentation so I'll just put a few notes here.- Ask the kerberos administrator to add a new service principal. Best to use a short name, for example orasrv, as used in the examples above
- Create the keytab file and install on the DB server node(s) with correct permissions for the oracle user
- for Scientific Linux cern-config-keytab can be used
- Configure
vi $ORACLE_HOME/network/admin/sqlnet.ora
and add on all nodes - note there are a few differences with client-only configuration discussed above
SQLNET.AUTHENTICATION_KERBEROS5_SERVICE=orasrv
SQLNET.KERBEROS5_KEYTAB=/etc/krb5.keytab.orasrv # see keytab generation step above
SQLNET.KERBEROS5_CONF=/etc/krb5.conf
SQLNET.KERBEROS5_CONF_MIT=TRUE
Notes:
Setting up Kerberos authentication requires just a few steps and is often quite simple. However there are many possible sources of issues that may need to be troubleshooted for a new installation. That ranges from bug in certain Oracle versions, incompatibilities with encryption levels, etc. I have tested with Oracle client 11.2.0.3 both on Linux and Windows 7 again Oracle server server on 11.2.0.3 (Linux 64 bit). Kerberos authentication requires advanced security option.This support note can be a useful starting point for troubleshooting if needed: Master Note For Kerberos Authentication [ID 1375853.1]A (quite) different technique also aiming at reducing the need of hard-coded credentials is Secure External Password Store. This is described in oracle documentation and in the following support note: Using The Secure External Password Store [ID 340559.1]
↧
↧
Purging Cursors From the Library Cache Using Full_hash_value
Introduction: Purging cursors from the library cache is a useful technique to keep handy for troubleshooting. Oracle has introduced a procedure call to do that in version 11 with backports to 10g. Besides Oracle documentation, this has been covered by several blogs already (including Kerry Osborne, Harald van Breederode, Martin Widlake, Martin Bach), Oracle support (note 457309.1 for example) and the actual package file in $ORACLE_HOME/rdbms/admin/dbmspool.sql
Most of the examples and discussions in the links above utilize with the following syntax:
SQL> exec sys.dbms_shared_pool.purge(‘&address, &hash_value’,'c’)
admin_user_SQL> select a.FULL_HASH_VALUE from V$DB_OBJECT_CACHE a where name='select /*MYTEST*/ sysdate from dual';
-- find full_hash_value to be used in the next step
-- in this example the full_hash_value is 98d0f8fcbddf4095175e36592011cc2c
admin_user_SQL> exec sys.dbms_shared_pool.purge(HASH=>'98d0f8fcbddf4095175e36592011cc2c',namespace=>0,heaps=>1)
Most of the examples and discussions in the links above utilize with the following syntax:
SQL> exec sys.dbms_shared_pool.purge(‘&address, &hash_value’,'c’)
What's new in 11.2:
A new (overloaded) procedure in dbms_shared_pool.purge is available in 11.2 and allows to purge statements identified by thier full_hash_value of the statement. One of the advantages compared to the previous method is that the full_hash_value is a property of a given sql statement and does not depend on the memory address of the (parent) cursor. Note this has been tested in 11.2.0.3 64 bit for Linux.Example:
myapp_user_SQL> select /*MYTEST*/ sysdate from dual; -- put test SQL statement that we want to flush in the followingadmin_user_SQL> select a.FULL_HASH_VALUE from V$DB_OBJECT_CACHE a where name='select /*MYTEST*/ sysdate from dual';
-- find full_hash_value to be used in the next step
-- in this example the full_hash_value is 98d0f8fcbddf4095175e36592011cc2c
admin_user_SQL> exec sys.dbms_shared_pool.purge(HASH=>'98d0f8fcbddf4095175e36592011cc2c',namespace=>0,heaps=>1)
Additional info:
full_hash_value is a 128-bit MD5 hash of the sql statement. More details on full_hash_value at this link.A few methods to find full_hash_value given different input are listed here below:
- find full_hash_value from cache, query v$db_object_cache
- select a.FULL_HASH_VALUE from V$DB_OBJECT_CACHE a where name='select /*MYTEST*/ sysdate from dual';
- find full_hash_value from hash_value
- select full_hash_value from v$db_object_cache where hash_value=538037292
- find full_hash_value from sql_id
- find hash_value from sql_id using DBMS_UTILITY.SQLID_TO_SQLHASH
- select full_hash_value from v$db_object_cache where hash_value= DBMS_UTILITY.SQLID_TO_SQLHASH('1frjqb4h13m1c');
- compute full_hash_value from SQL text
- use the script md5_with_chr0 discussed at this link
Conclusions:
We have discussed a method to purge cursors for the library cache that uses the full_hash_value of the cursor instead of the address and hash_value which is the more common approach (and the only one documented in previous versions). This method discussed here is available in 11.2.
↧
Listener.ora and Oraagent in RAC 11gR2
Topic: An investigation of a few details of the implementation of listeners in 11gR2, including the configuration of listener.ora in RAC and the role of the cluster process 'oraagent'.
11gR2 comes with several important changes and improvements to the clusterware in general and in particular the way listeners are managed. While the listener process is still the 'good old' process tnslsnr (Linux and Unix), it is now started from the grid home (as opposed to database oracle home). Moreover listeners are divided in two classes: node listeners and scan listeners, although they use the same binary for both functions. There are many more details and instead of covering them here I'd rather reference this excellent review: Markus Michalewicz's presentation at TCOUG.
11gR2 comes with several important changes and improvements to the clusterware in general and in particular the way listeners are managed. While the listener process is still the 'good old' process tnslsnr (Linux and Unix), it is now started from the grid home (as opposed to database oracle home). Moreover listeners are divided in two classes: node listeners and scan listeners, although they use the same binary for both functions. There are many more details and instead of covering them here I'd rather reference this excellent review: Markus Michalewicz's presentation at TCOUG.
Oraagent takes good care of our listeners
- Node listeners and scan listeners are configured at the clusterware level, for example with srvctl and the configuration is propagated to the listeners accordingly
- this integration is more consistently enforced in 11gR2 than in previous versions
- The oraagent process spawned by crsd takes care of our listeners in terms of configuration and monitoring
- note that there is normally a second oraagent on the system which is spawned by ohasd and does not seem to come into play here
- Notably in the listener.ora file we can see which configuration lines are being taken care of automatically by oraagent, as they are marked with the following comment: # line added by Agent
- experiment on a test DB: delete one or more of the listener.ora configuration lines and restart the listener (for example with srvctl stop listener; srvctl start listener). The original configuration will reappear in listener.ora and the manually modified listener.ora will be renamed (with a timestamp suffix)
- The agent creates and maintains the file: endpoints_listener.ora
- this file is there for backward compatibility, see docs and/or support site for more info.
- experiment on test: delete the file and restart the listerner, oraagent will recreate the file.
- Oraagent log can be found at: $GRID_HOME/log/{nodename}/agent/crsd/oraagent_oracle/oraagent_oracle.log
- Oraagent monitors the functioning of each listener
- from the log file entries (see above about the location of the oraagent log file) we can see that each listener is monitored with a frequency of 60 seconds
- Oraagent comes into action also at instance startup, when the instance is started with srvctl (as opposed to 'manually' started instance from sqlplus) and sets LOCAL_LISTENER parameter, dynamically (this is done with an alter system command and only if the parameter has not been set on spfile).
Dynamic listening endpoints
- Where are the TCP/IP settings of my listeners in listener.ora?
- Only IPC endpoints are listed in listener.ora, this is at first puzzling, where are the TCP settings that in previous versions were listed in listener.ora? ..read on!
- Notes:
- endpoint=the address or connection point to the listener. The most known endpoint in the oracle DB world being TCP, port 1521
- dynamic listening endpoint and dynamic service registration are both concepts related to listener activities, but they are distinct.
- Oraagent connects to the listener via IPC and activates the TCP (TCPS, etc) endpoints as specified in the clusterware configuration
- experiment on test: $GRID_HOME/bin/lsnrctl stop listener; $GRID_HOME/bin/lsnrctl start listener; Note that the latter command starts only the IPC endpoint. However oraagent is posted at listener startup and makes active the rest of the endpoints (notably listening on the TCP port), this can be seen for example by running the following a few seconds after listener restart: $GRID_HOME/bin/lsnrctl status listener (which will list all the active endpoints)
- Another way to say that is that the endpoints for the listener in RAC 11gR2 are configured in a dynamic way: TCP (TCPS, etc) endpoints are activated at runtime by oraagent
- this is indicated in the listener.ora by the new (undocumented) parameters ENABLE_GLOBAL_DYNAMIC_ENDPOINT_{LISTENER_NAME}=ON
- Experiment on test on disabling dynamic listening endpoint:
- stop the listener: $GRID_HOME/bin/lsnrctl stop listener
- edit listener.ora and set ENABLE_GLOBAL_DYNAMIC_ENDPOINT_LISTENER=OFF
- start the listener: $GRID_HOME/bin/lsnrctl start listener
- check listener.ora and check that the parameter edited above has not changed, that is ENABLE_GLOBAL_DYNAMIC_ENDPOINT_LISTENER=OFF
- in this case the TCP endpoint will not be started, that is the listener will be listening only on IPC. Check with: $GRID_HOME/bin/lsnrctl status listener
- note: if we try do the same exercise by stopping and starting the listener with srvctl, as it would be the typical way to do it, we will see that the parameter ENABLE_GLOBAL_DYNAMIC_ENDPOINT_LISTENER in listener.ora will be set again to ON. This will activate again dynamic listening endpoints, which is something we want in production, but not for this exercise!
- Note: the examples have been tested on 11.2.0.3 RAC for Linux. ORACLE_HOME needs to be set to point to the grid home (and TNS_ADMIN if used at all, needs to be set to $GRID_HOME/network/admin)
I don't need to edit listener.ora any more, or do I?
- We have seen above that the most important configurations related to listener.ora are automated via oraagent and the clusterware in general
- There are additional listener.ora parameters that are not managed by the clusterware in11gR2 and need to be configured in listener.ora in case we want to use them
- notably parameters for COST (Class of Secure Transports) parameters needed to secure against a security vulnerability, detailed at Oracle Security Alert for CVE-2012-1675
- The steps for the configuration are very well described in support note 1340831.1. Here we just mention that for a 2-node RAC the following parameters will be needed in listener.ora (note parameters taken from listener.ora on 11.2.0.3 RAC for Linux):
SECURE_REGISTER_LISTENER = (IPC,TCP)
SECURE_REGISTER_LISTENER_SCAN1 = (IPC,TCPS)
SECURE_REGISTER_LISTENER_SCAN2 = (IPC,TCPS)
WALLET_LOCATION = (SOURCE =(METHOD = FILE)(METHOD_DATA =(DIRECTORY = ..put here directory of wallet..)))
Conclusions
The handling of listeners in 11gR2 RAC has become much more integrated with the clusterware and for most RAC configurations there is not much need to play with listener.ora any more. At a closer look of however, new parameters and overall some new rules of the game in the 11gR2 implementation are revealed, most notably the use of dynamic endpoint registration by the clusterware.
↧
Recursive Subquery Factoring, Oracle SQL and Physics
Introduction: this post is about the use of recursive subquery factoring in Oracle to find numerical solutions to the equations of classical mechanics, for demonstration purposes. This technique will be introduced using 4 examples or growing complexity. In particular by the end of the post we will be able to compute the length of the (sideral) year using the laws of physics and SQL!
Recursive subquery factoring (aka recursive common table expressions) is a new feature of Oracle 11gR2 that has already been used for businness and for play: see this post on solving sudokus with Oracle and this other post. The computational strength of the feature comes from the fact that it makes recursion easily available from within SQL and so opens the door to natural ways of implementing several algorithms for numerical computation.
Additional note, the amount of physics and maths used have been kept to a minimum and simplified where possible, hopefully without introducing too many mistakes in the process. Links are provided in the text for additional investigations on the topic.
The motion is on 1 dimension, say the x axis. Our first task is to find a representation of the state of the system with a structure that can be computed in the database. A natural choice is to use the tuple (t, x, v, a). Where t is time, x is the position along the 1-dimensional axis of motion, v is the velocity and a is the acceleration.
Newton's second law (F=ma) gives us the rule of the game when we want to compute the state of the system subject to and external force. Using calculus this can be written as:![]()
Next we need a method to find numerical (approximate) solutions. We can use the Euler integration method, which basically means doing the following:
Another important point is that we need to know the initial conditions for the state tuple, in particular initial position and velocity. In the example we will take x(t=0)=0 m and v(t=0)=100 m/sec. The force is the gravitational pull at the surface of the Earth: F=--m*g, where g=9.81 m/sec^2. Note the mass cancels out in our equations. This example models the motion of a projectile that we shoot vertically into the air and we observe as it rises up to about 500 meters and then falls back down, all this happens in about 20.5 seconds. The SQL used and a graph of the results are here below:
define dT=0.1
define g=9.81
define maxT=20
-- SQL to compute the motion of a projectile subject to gravity
WITH data(t,x,v,a) AS (
SELECT cast(0 as binary_double) t, cast(0 as binary_double) x, cast (100 as binary_double) v, cast(-&g as binary_double) a FROM dual
UNION ALL
SELECT t+&dT, x+v*&dT, v+a*&dT, -&g FROM data WHERE t < &maxT
)
SELECT t,x,v,a FROM data;
In this example we investigate the motion of a mass attached to a spring. We expect the mass to oscillate around a central point (x=0).
For greater accuracy in calculations we use a different integration method: the Verlet integration method (see also this link). The equation for the acceleration is: a = F/m =-K*x and initial conditions are: x(t=0)=1 m and v(t=0)=0 m/sec. Moreover we take K= 0.1 sec^-2. See below the SQL used for the calculation and a graph of the results.
define dT=0.1
define K=0.1
define maxT=20
-- SQL to compute the motion of a harmonic oscillator
WITH data(t,x,v,a) AS (
SELECT cast(0 as binary_double) t, cast(1 as binary_double) x, cast (0 as binary_double) v, cast(-&K*1 as binary_double) a FROM dual
UNION ALL
SELECT t+&dT, x+v*&dT+0.5*a*&dT*&dT, v+0.5*(a-&K*(x+v*&dT))*&dT, -&K*(x+v*&dT+0.5*a*&dT*&dT) FROM data WHERE t < &maxT
)
SELECT t,x,v,a FROM data;
The motion of the system Earth-Sun is a problem of 2 bodies moving in space. With a 'standard trick' this can be reduced to a problem of 1 body, and 2 spatial variables, which represent the (vector) distance of the Earth from the Sun in the place of the orbit. Our description of the system will use the following tuple: (t,x,vx,ax,y,vy,ay), that is time, position, velocity and acceleration for a 2-dimensional problem in the (x,y) plane. The equation for the force is Newton's law of universal gravitation. Another important point again is to use the correct initial conditions. These can be taken from astronomical observations, x(t=0)=152098233 Km(also know as the aphelion point) andv=29.291 Km/sec (credits to this link and this other link). We will use again the Verlet integration method as in example 2 above. Note, for ease of computation, time and space will be re-scaled so that t=1 means 1000 sec and x=1 means 1 Km (same is true for the y axis). The SQL used is pasted here below as well as a graph of the computed results, that is our approximate calculation of the Earth's orbit.
-- length unit = 1 km
-- time unit = 1000 sec
define dT=.1
define aph=152098233
define GM=132712838618000000
-- note this is GM Sun + Earth (reduced mass correction)
define v_aph=29291
define maxT=40000
-- SQL to compute the trajectory of the Earth around the Sun
WITH data(step, t,x,y,vx,vy,ax,ay) AS (
SELECT 0 step, cast(0 as binary_double) t, cast(&aph as binary_double) x, cast(0 as binary_double) y,
cast(0 as binary_double) vx, cast(&v_aph as binary_double) vy,
cast(-&GM/power(&aph,2) as binary_double) ax, cast(0 as binary_double) ay FROM dual
UNION ALL
SELECT step+1, t+&dT, x+vx*&dT+0.5*ax*&dT*&dT, y+vy*&dT+0.5*ay*&dT*&dT,
vx+0.5*(ax-&GM*(x+vx*&dT)/power(x*x+y*y,1.5))*&dT,
vy+0.5*(ay-&GM*(y+vy*&dT)/power(x*x+y*y,1.5))*&dT,
-&GM*(x+vx*&dT+0.5*ax*&dT*&dT)/power(power(x+vx*&dT,2)+power(y+vy*&dT,2),1.5),
-&GM*(y+vy*&dT+0.5*ay*&dT*&dT)/power(power(x+vx*&dT,2)+power(y+vy*&dT,2),1.5)
FROM data WHERE t < &maxT
)
SELECT t,x,y,vx,vy,ax,ay FROM data WHERE mod(step,100)=0; -- output only one point every 100
As a final example and an 'application' of the techniques above we can use SQL to compute the number of days year (or rather in a sideral year, see the link for additional details). We find a result that is in agreement with measurements with 6 significant digits. This is an interesting result considering that it is obtained with just a few lines of SQL!
-- This is a wrapper on the equation discussed in example 2 above
Recursive subquery factoring (aka recursive common table expressions) is a new feature of Oracle 11gR2 that has already been used for businness and for play: see this post on solving sudokus with Oracle and this other post. The computational strength of the feature comes from the fact that it makes recursion easily available from within SQL and so opens the door to natural ways of implementing several algorithms for numerical computation.
Additional note, the amount of physics and maths used have been kept to a minimum and simplified where possible, hopefully without introducing too many mistakes in the process. Links are provided in the text for additional investigations on the topic.
Example 1: projectile in gravity field
The motion is on 1 dimension, say the x axis. Our first task is to find a representation of the state of the system with a structure that can be computed in the database. A natural choice is to use the tuple (t, x, v, a). Where t is time, x is the position along the 1-dimensional axis of motion, v is the velocity and a is the acceleration.
Newton's second law (F=ma) gives us the rule of the game when we want to compute the state of the system subject to and external force. Using calculus this can be written as:
Next we need a method to find numerical (approximate) solutions. We can use the Euler integration method, which basically means doing the following:
Another important point is that we need to know the initial conditions for the state tuple, in particular initial position and velocity. In the example we will take x(t=0)=0 m and v(t=0)=100 m/sec. The force is the gravitational pull at the surface of the Earth: F=--m*g, where g=9.81 m/sec^2. Note the mass cancels out in our equations. This example models the motion of a projectile that we shoot vertically into the air and we observe as it rises up to about 500 meters and then falls back down, all this happens in about 20.5 seconds. The SQL used and a graph of the results are here below:
define dT=0.1
define g=9.81
define maxT=20
-- SQL to compute the motion of a projectile subject to gravity
WITH data(t,x,v,a) AS (
SELECT cast(0 as binary_double) t, cast(0 as binary_double) x, cast (100 as binary_double) v, cast(-&g as binary_double) a FROM dual
UNION ALL
SELECT t+&dT, x+v*&dT, v+a*&dT, -&g FROM data WHERE t < &maxT
)
SELECT t,x,v,a FROM data;
Example 2: Harmonic oscillator
In this example we investigate the motion of a mass attached to a spring. We expect the mass to oscillate around a central point (x=0).
For greater accuracy in calculations we use a different integration method: the Verlet integration method (see also this link). The equation for the acceleration is: a = F/m =-K*x and initial conditions are: x(t=0)=1 m and v(t=0)=0 m/sec. Moreover we take K= 0.1 sec^-2. See below the SQL used for the calculation and a graph of the results.
define dT=0.1
define K=0.1
define maxT=20
-- SQL to compute the motion of a harmonic oscillator
WITH data(t,x,v,a) AS (
SELECT cast(0 as binary_double) t, cast(1 as binary_double) x, cast (0 as binary_double) v, cast(-&K*1 as binary_double) a FROM dual
UNION ALL
SELECT t+&dT, x+v*&dT+0.5*a*&dT*&dT, v+0.5*(a-&K*(x+v*&dT))*&dT, -&K*(x+v*&dT+0.5*a*&dT*&dT) FROM data WHERE t < &maxT
)
SELECT t,x,v,a FROM data;
The motion of the system Earth-Sun is a problem of 2 bodies moving in space. With a 'standard trick' this can be reduced to a problem of 1 body, and 2 spatial variables, which represent the (vector) distance of the Earth from the Sun in the place of the orbit. Our description of the system will use the following tuple: (t,x,vx,ax,y,vy,ay), that is time, position, velocity and acceleration for a 2-dimensional problem in the (x,y) plane. The equation for the force is Newton's law of universal gravitation. Another important point again is to use the correct initial conditions. These can be taken from astronomical observations, x(t=0)=152098233 Km(also know as the aphelion point) andv=29.291 Km/sec (credits to this link and this other link). We will use again the Verlet integration method as in example 2 above. Note, for ease of computation, time and space will be re-scaled so that t=1 means 1000 sec and x=1 means 1 Km (same is true for the y axis). The SQL used is pasted here below as well as a graph of the computed results, that is our approximate calculation of the Earth's orbit.
-- length unit = 1 km
-- time unit = 1000 sec
define dT=.1
define aph=152098233
define GM=132712838618000000
-- note this is GM Sun + Earth (reduced mass correction)
define v_aph=29291
define maxT=40000
-- SQL to compute the trajectory of the Earth around the Sun
WITH data(step, t,x,y,vx,vy,ax,ay) AS (
SELECT 0 step, cast(0 as binary_double) t, cast(&aph as binary_double) x, cast(0 as binary_double) y,
cast(0 as binary_double) vx, cast(&v_aph as binary_double) vy,
cast(-&GM/power(&aph,2) as binary_double) ax, cast(0 as binary_double) ay FROM dual
UNION ALL
SELECT step+1, t+&dT, x+vx*&dT+0.5*ax*&dT*&dT, y+vy*&dT+0.5*ay*&dT*&dT,
vx+0.5*(ax-&GM*(x+vx*&dT)/power(x*x+y*y,1.5))*&dT,
vy+0.5*(ay-&GM*(y+vy*&dT)/power(x*x+y*y,1.5))*&dT,
-&GM*(x+vx*&dT+0.5*ax*&dT*&dT)/power(power(x+vx*&dT,2)+power(y+vy*&dT,2),1.5),
-&GM*(y+vy*&dT+0.5*ay*&dT*&dT)/power(power(x+vx*&dT,2)+power(y+vy*&dT,2),1.5)
FROM data WHERE t < &maxT
)
SELECT t,x,y,vx,vy,ax,ay FROM data WHERE mod(step,100)=0; -- output only one point every 100
Example 4: Compute the length of the sideral year using the equations of motion of the Earth around the Sun and SQL
As a final example and an 'application' of the techniques above we can use SQL to compute the number of days year (or rather in a sideral year, see the link for additional details). We find a result that is in agreement with measurements with 6 significant digits. This is an interesting result considering that it is obtained with just a few lines of SQL!
-- This is a wrapper on the equation discussed in example 2 above
-- SQL to compute the period of revolution of the Earth around the Sun, that is the number of seconds in 1 sideral year
select round(t*1000/(3600*24),3) days_in_sideral_year from (
WITH data(step, t,x,y,vx,vy,ax,ay) AS (
SELECT 0 step, cast(0 as binary_double) t, cast(&aph as binary_double) x, cast(0 as binary_double) y,
cast(0 as binary_double) vx, cast(&v_aph as binary_double) vy,
cast(-&GM/power(&aph,2) as binary_double) ax, cast(0 as binary_double) ay FROM dual
UNION ALL
SELECT step+1, t+&dT, x+vx*&dT+0.5*ax*&dT*&dT, y+vy*&dT+0.5*ay*&dT*&dT,
vx+0.5*(ax-&GM*(x+vx*&dT)/power(x*x+y*y,1.5))*&dT,
vy+0.5*(ay-&GM*(y+vy*&dT)/power(x*x+y*y,1.5))*&dT,
-&GM*(x+vx*&dT+0.5*ax*&dT*&dT)/power(power(x+vx*&dT,2)+power(y+vy*&dT,2),1.5),
-&GM*(y+vy*&dT+0.5*ay*&dT*&dT)/power(power(x+vx*&dT,2)+power(y+vy*&dT,2),1.5)
FROM data WHERE t < &maxT
)
SELECT step,t,y,lead(y) over(order by step) next_y FROM data
) where y<0 and next_y>0;
DAYS_IN_SIDERAL_YEAR
--------------------
365.256
Conclusions:
We have discussed in 4 examples the usage of Oracle SQL and in particular of recursive subquery factoring, applied to solve selected cases of differential equations of classical mechanics. Despite the simplifications and approximations involved, the technique has allowed us to compute the length of the sideral year with good precision. This method can be extended to make calculation for more complex systems, such as systems of many particles.↧
How to Turn Off Adaptive Cursor Sharing, Cardinality Feedback and Serial Direct Read
Scope: New features and improvements on the SQL (cost-based) execution engine are great. However sometimes the need comes to turn some of those new features off. Maybe it's because of a bug that has appeared or simply because we just want consistency, especially after an upgrade. This is often the case when upgrading an application that has been tuned already with heavy usage of hints for critical SQL.The discussion of pros and cons of using hints is a very interesting topic but outside the scope of this entry.
When this is relevant: using a full set of hints (an outlines) normally provides consistent SQL execution for a given Oracle version (and set of session/instance parameter values). However this is not guaranteed to be stable across upgrades. The reason is that additional functionality of the cost based optimization engine can play a role. This is indeed the case when moving from 10g to 11g and in particular to 11gR2. It's a good idea to investigate the new features and also to keep a note aside on how to disable them in case of need.
Cardinality feedback. This feature, enabled by default in 11.2, is intended to improve plans for repeated executions. See Support note 1344937.1 and documentation for additional info. You can be disable cardinality feedback with an underscore parameter which can be set at session or instance level. Usual caveats on setting underscore parameters apply:
alter session set "_OPTIMIZER_USE_FEEDBACK"=FALSE;
This can also be done with a hint, therefore at statement level, using the opt_param syntax: /*+ opt_param('_OPTIMIZER_USE_FEEDBACK','FALSE') */
Adaptive cursor sharing. This was introduced in 11gR1 to address issues related to bind variable peeking. See support note 740052.1 and documentation for details. For the purposes of this doc, if you want to get consistent behavior with 10g and overall if you want to disable this feature, you may want to add the hint:
/*+ NO_BIND_AWARE */.
According to support note 11657468.8 adaptive cursor sharing can be disabled by setting the following 2 parameters (say at session level): _optimizer_adaptive_cursor_sharing = false, _optimizer_extended_cursor_sharing_rel = "none"
Serial direct read. This feature, introduced in 11gR2, is about sql execution as not strictly speaking on cost-based optimization, it still makes sense to discuss it here as it can change the behavior of full scans: by forcing the use of direct read instead of 'normal cache-based access'. There are many interesting details to this feature that are outside the scope of this discussion. To revert to 10g behavior and overall to disable the feature you need to set the following parameter (there is no hint for this unfortunately, as this feature is not directly linked to CBO):
alter session set "_SERIAL_DIRECT_READ"=never;
1.Cost-based optimization can be switched to preserve the behavior of an older version, say for example 10.2.0.5, either with a SQL hint: /*+ OPTIMIZER_FEATURES_ENABLE('10.2.0.5') */ or by setting the parameter OPTIMIZER_FEATURES_ENABLE at session (or system) level.
However as discussed above, setting this parameter sometimes is not enough to ensure that a given SQL will execute in exactly the same way in the new system as it did in the older version. In particular related to the 3 examples above, adaptive cursor sharing and cardinality feedback are tuned off by reverting CBO compatibility to a 10g version, although the behavior of serial direct read is not affected by this hint (or any other hint for that matter).
2. How to get a full list of available hints? query V$SQL_HINT in Oracle 11.2 or higher
3a. Example of a logon trigger that can be used to set session-level parameters:
create or replace trigger change_session_parameters
AFTER LOGON ON mydbuser.SCHEMA --customize username, replace "mydbuser" with relevant user name
BEGIN
execute immediate 'alter session set "_serial_direct_read"=never';
END;
/
3b. Alternative trigger definition
create or replace TRIGGER trace_trig
AFTER LOGON
ON DATABASE
DECLARE
sqlstr1 VARCHAR2(100) := 'alter session set "_serial_direct_read"=never';
BEGIN
IF (USER IN ('USERNAME1', 'USERNAME2')) THEN -- customize user names as needed
execute immediate sqlstr1;
END IF;
END trace_trig;
/
4. Example of how to get the outline(full set of hints) for a given query:
select * from table(dbms_xplan.display_cursor('sql_id',null,'OUTLINE')); -- edit sql_id with actual value
When this is relevant: using a full set of hints (an outlines) normally provides consistent SQL execution for a given Oracle version (and set of session/instance parameter values). However this is not guaranteed to be stable across upgrades. The reason is that additional functionality of the cost based optimization engine can play a role. This is indeed the case when moving from 10g to 11g and in particular to 11gR2. It's a good idea to investigate the new features and also to keep a note aside on how to disable them in case of need.
Three relevant 11g features of cost-based optimization and SQL execution:
Cardinality feedback. This feature, enabled by default in 11.2, is intended to improve plans for repeated executions. See Support note 1344937.1 and documentation for additional info. You can be disable cardinality feedback with an underscore parameter which can be set at session or instance level. Usual caveats on setting underscore parameters apply:
alter session set "_OPTIMIZER_USE_FEEDBACK"=FALSE;
This can also be done with a hint, therefore at statement level, using the opt_param syntax: /*+ opt_param('_OPTIMIZER_USE_FEEDBACK','FALSE') */
Adaptive cursor sharing. This was introduced in 11gR1 to address issues related to bind variable peeking. See support note 740052.1 and documentation for details. For the purposes of this doc, if you want to get consistent behavior with 10g and overall if you want to disable this feature, you may want to add the hint:
/*+ NO_BIND_AWARE */.
According to support note 11657468.8 adaptive cursor sharing can be disabled by setting the following 2 parameters (say at session level): _optimizer_adaptive_cursor_sharing = false, _optimizer_extended_cursor_sharing_rel = "none"
Serial direct read. This feature, introduced in 11gR2, is about sql execution as not strictly speaking on cost-based optimization, it still makes sense to discuss it here as it can change the behavior of full scans: by forcing the use of direct read instead of 'normal cache-based access'. There are many interesting details to this feature that are outside the scope of this discussion. To revert to 10g behavior and overall to disable the feature you need to set the following parameter (there is no hint for this unfortunately, as this feature is not directly linked to CBO):
alter session set "_SERIAL_DIRECT_READ"=never;
Additional info and comments:
However as discussed above, setting this parameter sometimes is not enough to ensure that a given SQL will execute in exactly the same way in the new system as it did in the older version. In particular related to the 3 examples above, adaptive cursor sharing and cardinality feedback are tuned off by reverting CBO compatibility to a 10g version, although the behavior of serial direct read is not affected by this hint (or any other hint for that matter).
2. How to get a full list of available hints? query V$SQL_HINT in Oracle 11.2 or higher
3a. Example of a logon trigger that can be used to set session-level parameters:
create or replace trigger change_session_parameters
AFTER LOGON ON mydbuser.SCHEMA --customize username, replace "mydbuser" with relevant user name
BEGIN
execute immediate 'alter session set "_serial_direct_read"=never';
END;
/
3b. Alternative trigger definition
create or replace TRIGGER trace_trig
AFTER LOGON
ON DATABASE
DECLARE
sqlstr1 VARCHAR2(100) := 'alter session set "_serial_direct_read"=never';
BEGIN
IF (USER IN ('USERNAME1', 'USERNAME2')) THEN -- customize user names as needed
execute immediate sqlstr1;
END IF;
END trace_trig;
/
4. Example of how to get the outline(full set of hints) for a given query:
select * from table(dbms_xplan.display_cursor('sql_id',null,'OUTLINE')); -- edit sql_id with actual value
↧
↧
On Command-Line DBA
Topic: A review of the command-line monitoring scripts that I currently use for Oracle.
Command-line is still very useful for Oracle database administration, troubleshooting and performance monitoring. One of the first thins that I do when I want to work with a given Oracle database is to connect as a privileged user with Sql*plus and get information on the active sessions with a script. I will run the script a few times to get an idea of what's happening 'right now' and get a possible starting point for more systematic tuning/troubleshooting if needed.
I liken this approach to examining a few pictures from a remote webcam and trying to find out what's happening out there!.. The method can be powerful, although with some limitations. One of its advantages is that it is easy to implement in Oracle! One of the main source of information for this type of monitoring is GV$SESSION, which is available in Oracle since ages. More recently GV$ACTIVE_SESSION_HISTORY has proven very useful and I quite like the additional improvements introduced there in11gR2.
Examples: One of my favorites is a script I have called top.sql (not very much an original name I know :) which queries gv$session (RAC-aware) and prints out what I currently find the most interesting details (for my environment) on active sessions that I can fit in roughly 200 characters. Another script that does a similar job taking data from v$active_session_history is ash.sql together with 'its brothers' ash2.sql and ash4.sql which are meant for RAC of 2 and 4 nodes respectively.
One of the points that I have found as most time-consuming when writing such monitoring scripts has to do with deciding which information I wanted to display and how to pack it in a relatively short line. I have also found that with each major Oracle versions I want to refresh my scripts with a few changes. I have recently done that when moving from 10g to 11gR2.
Notable info displayed by top.sql: SQL_DT=calculated as round((sysdate-sql_exec_start)*24*3600,1); CALL_DT=last_call_et; W_dT=seconds in wait (if waiting) or else seconds since last wait; OBJ#=object number currently under operation, TR=has the session an open transanction?
Notable info displayed by ash.sql: EXEC_PLAN_LN#_OBJ#=which step of the execution plan is the session executing? at which line and on which object?; "DB%,CPU" percentage active in terms of db time and cpu time; "R,W_IOP"=read and write IOPS; "R,W_MBP"= read and write throughput in MB/s; "PGA,TEMP_"=pga and temporary space usage.
Scripts details: I have packaged some of the scripts that I have written and that I find useful in a zip file which is available at this link (see there for the most recent version of the Monitoring Scripts file). Note that the scripts are tested on 11gR2, some of them will also work on 10g.
Monitoring Class
Drill-down Class
Workload Characterization Class
Miscellaneous
Conclusions: command-line scripts can be of great help for monitoring and performance troubleshooting of Oracle databases. Of particular interest are queries to print the details of active sessions, the SQL they execute and their wait events details. This makes for a quick and easy entry point to troubleshooting and to get an idea of what's happening on the DB server. More systematic troubleshooting may be needed beyond that. Command-line scripts are easy to modify and customize, many DBAs indeed have their own tool-set. In this post I have exposed some of the scripts I have written and still find useful: download from this link.
A few links to end this post. First of all the excellent session snapper script by Tanel Poder. In addition a mention to a few cool tools that seem to be a sort of bridge (missing link? :) between command-line and graphical-interface approaches: MOATS by Tanel Poder and Adrian Billington , SQL Dashboard by Jagjeet Singh and topaas by Marcin Przepiorowski.
Command-line is still very useful for Oracle database administration, troubleshooting and performance monitoring. One of the first thins that I do when I want to work with a given Oracle database is to connect as a privileged user with Sql*plus and get information on the active sessions with a script. I will run the script a few times to get an idea of what's happening 'right now' and get a possible starting point for more systematic tuning/troubleshooting if needed.
I liken this approach to examining a few pictures from a remote webcam and trying to find out what's happening out there!.. The method can be powerful, although with some limitations. One of its advantages is that it is easy to implement in Oracle! One of the main source of information for this type of monitoring is GV$SESSION, which is available in Oracle since ages. More recently GV$ACTIVE_SESSION_HISTORY has proven very useful and I quite like the additional improvements introduced there in11gR2.
Examples: One of my favorites is a script I have called top.sql (not very much an original name I know :) which queries gv$session (RAC-aware) and prints out what I currently find the most interesting details (for my environment) on active sessions that I can fit in roughly 200 characters. Another script that does a similar job taking data from v$active_session_history is ash.sql together with 'its brothers' ash2.sql and ash4.sql which are meant for RAC of 2 and 4 nodes respectively.
One of the points that I have found as most time-consuming when writing such monitoring scripts has to do with deciding which information I wanted to display and how to pack it in a relatively short line. I have also found that with each major Oracle versions I want to refresh my scripts with a few changes. I have recently done that when moving from 10g to 11gR2.
Notable info displayed by top.sql: SQL_DT=calculated as round((sysdate-sql_exec_start)*24*3600,1); CALL_DT=last_call_et; W_dT=seconds in wait (if waiting) or else seconds since last wait; OBJ#=object number currently under operation, TR=has the session an open transanction?
Notable info displayed by ash.sql: EXEC_PLAN_LN#_OBJ#=which step of the execution plan is the session executing? at which line and on which object?; "DB%,CPU" percentage active in terms of db time and cpu time; "R,W_IOP"=read and write IOPS; "R,W_MBP"= read and write throughput in MB/s; "PGA,TEMP_"=pga and temporary space usage.
Scripts details: I have packaged some of the scripts that I have written and that I find useful in a zip file which is available at this link (see there for the most recent version of the Monitoring Scripts file). Note that the scripts are tested on 11gR2, some of them will also work on 10g.
Monitoring Class
Script Name | Description |
---|---|
top.sql | List with selected details of the active sessions, from gv$session (RAC aware) |
top_10g.sql | Active sessions details, from gv$session, 10g version (RAC aware) |
ash.sql | List with selected details of the active sessions, from v$active_session_history (one instance only) |
ash_h.sql | Usage @ash_h n#_sec. Active sessions details for the last n#_sec seconds, from v$active_session_history (current instance only) |
ash2.sql, ash4.sql | Active sessions details, from gv$active_session_history (for RAC 2 and 4 nodes) |
Drill-down Class
Script Name | Description |
---|---|
ash_sess.sql | Usage: @ash_sess inst_id sid n#_sec. Prints details from gv$active_session_history for a given session (RAC aware) |
explain.sql | Usage: @explain sql_id. Prints detailed explain plan info for given sql_id from libary cache (current instance only) |
explain_awr.sql | Usage: @explain_awr sql_id. Prints detailed explain plan info for given sql_id from AWR (rac aware) |
kill.sql | Usage @kill sid serial# (current instance only) |
sessinfo.sql | Usage example: @sessinfo sid=xxx. Prints info on sessions that match the search pattern (RAC aware) |
sql.sql | Usage: @sql sql_id. Prints sql text for given sql_id (RAC aware) |
Workload Characterization Class
Script Name | Description |
---|---|
ash_top.sql | Usage @ash_top n#_sec. Aggregates ash data over n#_sec |
ehm.sql | Usage example: @ehm 15 db%sequential. Event history metric data, helps finding skew in wait times, for example to study random reads latency skew (RAC aware) |
ehm_local.sql | Single-node version of ehm.sql: event history metric data, helps finding skew in wait events (current instance only) |
eventmetric.sql | Wait event metric details for top wait events (RAC aware) |
filemetric.sql | File metric details for top-usage files (RAC aware) |
iometric.sql, iometric_details.sql | IO metric details (RAC aware) |
jobs_running.sql | List running jobs details (RAC aware) |
load.sql | Displays OS-load values (RAC aware) |
longops.sql, longops_10g.sql, longops_details.sql | Info on long running operations from gv$sessoin_longops (RAC aware) |
monitor.sql, monitor_details | Display information from statements instrumented by sql monitoring |
monitor_plan | Usage @monitor sql_monitor_key. Drill down from monitor.sql on the details of the execution plan with sql monitoring data per execution plan line |
obj.sql | Usage: @obj object_id. Find details for a given object id, or data_object_id |
sysmetric.sql, sysmetric_details.sql | Selected system metric details (RAC aware) |
transactions.sql | List of open transactions and crash recovery (RAC aware) |
Miscellaneous
Script Name | Description |
---|---|
binds.sql | Usage: @binds sql_id. Helps in finding bind values for a given sql_id (RAC aware) |
kill_select.sql | Usage example: @kill_select "sql_id='xxxx'". Helps in preparing scripts for killing large number of sessions (current instance only) |
locks.sql, locks_10g.sql, locks_details.sql | Scripts to help investigating locks and blocking enqueues issues (RACaware) |
login.sql | Sql*plus startup/config script for my environment (windows client version 11.2.0.3) |
Conclusions: command-line scripts can be of great help for monitoring and performance troubleshooting of Oracle databases. Of particular interest are queries to print the details of active sessions, the SQL they execute and their wait events details. This makes for a quick and easy entry point to troubleshooting and to get an idea of what's happening on the DB server. More systematic troubleshooting may be needed beyond that. Command-line scripts are easy to modify and customize, many DBAs indeed have their own tool-set. In this post I have exposed some of the scripts I have written and still find useful: download from this link.
A few links to end this post. First of all the excellent session snapper script by Tanel Poder. In addition a mention to a few cool tools that seem to be a sort of bridge (missing link? :) between command-line and graphical-interface approaches: MOATS by Tanel Poder and Adrian Billington , SQL Dashboard by Jagjeet Singh and topaas by Marcin Przepiorowski.
↧
Active Data Guard and UKOUG 2012
I am looking forward to participating again to the UKOUG annual conference (and also to attend the Sunday's OakTable event). This is for me a great opportunity to meet and discuss with many passionate Oracle experts who regularly attend the conference and also to get up-to-date with the latest technology and news from the Oracle community.
My colleague Marcin and I have just finished preparing our presentation on the subject of Active Data Guard. I have enjoyed the work even though, as it is often the case, this has taken quite some effort from both of us to reach the level of details we wanted.
Our goal with this presentation is to share our thoughts and experience on Active Data Guard as we have seen it mature in our production environment in the last 10 months. We will discuss, with examples, what we think are the great points of the technology for our use cases and what we have learned from production deployment. What has worked.. and what has not worked!
If you read this and are going to participate to the UKOUG conference, you are welcome to come to our presentation on December 4th at 15:10 in Hall 5!
Slides can be downloaded from this link.
My colleague Marcin and I have just finished preparing our presentation on the subject of Active Data Guard. I have enjoyed the work even though, as it is often the case, this has taken quite some effort from both of us to reach the level of details we wanted.
Our goal with this presentation is to share our thoughts and experience on Active Data Guard as we have seen it mature in our production environment in the last 10 months. We will discuss, with examples, what we think are the great points of the technology for our use cases and what we have learned from production deployment. What has worked.. and what has not worked!
If you read this and are going to participate to the UKOUG conference, you are welcome to come to our presentation on December 4th at 15:10 in Hall 5!
Slides can be downloaded from this link.
↧
AWR Analytics and Oracle Performance Visualization with PerfSheet4
Topic: This post describesPerfSheet4, a tool for performance analysis aimed at streamlining access and visualization of Oracle's AWR data. The tool is aimed at DBAs and Oracle performance analysts. PerfSheet4 is a spin off of previous original work by Tanel Poder, rewritten and integrated with additional functionality for AWR analysis and with important changes to the user interface.
Context: There is much information in the counters and metrics of Oracle's AWR that can be of substantial help for troubleshooting and for capacity planning. Besides the standard AWR report, time-based analysis is often very useful. However this type of access is made slow and tedious by a list of tasks, such as, fetching data, calculating incremental changes and producing interesting graphs.This normally requires developing or downloading a bunch of scripts and adding some visualization techniques on top. See for example Karl Arao's scripts and blog.
Tanel Poder has already demonstrated a few years ago that the pivot chart functionality of Excel can be an excellent platform for performance data visualization including AWR (and Perfstat) data. See his work on Perfsheet 2.0 and PerfSheet3.
What's new: The main drive to develop PerfSheet4 was to (try to) make the user interface as simple as possible, without compromising the general nature of the tool. A set of pre-defined queries for some common AWR analysis come with the tool. The plotting functionality will generate custom graphs for such data, an easy entry point to analysis.
If direct access to the DB is not possible, data can be extracted from the DB with the scripts supplied in the sql*plus_scripts directory and then loaded as csv. Similarly, export to csv can be used to archive data for later analysis.
PerfSheet4 is a free tool and is available for download from this link. The tool has been developed and tested on Excel 2010 against Oracle 11gR2. See here below a snapshot of the user interface with some additional details.
Why Excel: Pivot tables and charts are a great feature available in Excel that allows to quickly and easily analyze multidimensional data. This can be of great help when troubleshooting problems in production, for example with the need to try out different ideas and data combinations
Why pre-defined graphs: This feature is again to make life easier for the user. The idea is to automates the tedious task of setting up the pivot chart fields and provides an entry point to get to the needed graph.
Available queries:version 3.2p beta comes with the following queries (and pre-defined graphs):
Future work: Stabilize Perfsheet4 against bugs and for different execution environments. For this feedback from other users would be very much appreciated. Possibly add more pre-defined queries (for AWR analytics but not necessarily only that). Add a post with examples of how analysis performed with PerfSheet4 have been useful for my job in various occasions. In the mean time if you find the tool useful or find bugs or have ideas for possible improvements, I'll be interested to know!
Acknowledgments: Many thanks to Tanel Poder, the co-author of the tool.
Context: There is much information in the counters and metrics of Oracle's AWR that can be of substantial help for troubleshooting and for capacity planning. Besides the standard AWR report, time-based analysis is often very useful. However this type of access is made slow and tedious by a list of tasks, such as, fetching data, calculating incremental changes and producing interesting graphs.This normally requires developing or downloading a bunch of scripts and adding some visualization techniques on top. See for example Karl Arao's scripts and blog.
Tanel Poder has already demonstrated a few years ago that the pivot chart functionality of Excel can be an excellent platform for performance data visualization including AWR (and Perfstat) data. See his work on Perfsheet 2.0 and PerfSheet3.
What's new: The main drive to develop PerfSheet4 was to (try to) make the user interface as simple as possible, without compromising the general nature of the tool. A set of pre-defined queries for some common AWR analysis come with the tool. The plotting functionality will generate custom graphs for such data, an easy entry point to analysis.
If direct access to the DB is not possible, data can be extracted from the DB with the scripts supplied in the sql*plus_scripts directory and then loaded as csv. Similarly, export to csv can be used to archive data for later analysis.
PerfSheet4 is a free tool and is available for download from this link. The tool has been developed and tested on Excel 2010 against Oracle 11gR2. See here below a snapshot of the user interface with some additional details.
Why Excel: Pivot tables and charts are a great feature available in Excel that allows to quickly and easily analyze multidimensional data. This can be of great help when troubleshooting problems in production, for example with the need to try out different ideas and data combinations
Why pre-defined graphs: This feature is again to make life easier for the user. The idea is to automates the tedious task of setting up the pivot chart fields and provides an entry point to get to the needed graph.
Available queries:version 3.2p beta comes with the following queries (and pre-defined graphs):
Query name name | Short description |
---|---|
Wait events | Extracts data from dba_hist_system_event, computes delta values between snapshots and rates (i.e. delta values divided over delta time) |
System statistics | Extracts data from dba_hist_sysstat, computes delta values between snapshots and rates (i.e. delta values divided over delta time) |
Stats per service | Extracts data from dba_hist_service_stat, computes delta values between snapshots and rates (i.e. delta values divided over delta time) |
Workload data | Extracts data from dba_hist_sysmetric_summary, metrics give details on the usage of system resources over time ranges |
IO wait latency | Extracts data from dba_hist_event_histogram for io related events, computes delta values between snapshots and rates (i.e. delta values divided over delta time) |
Top 5 wait per instance | Extracts data from dba_hist_system_event and dba_hist_sysstat, computes delta values and rates (delta value over delta time) and selects top 5 non idle events for each instance |
Wait events per class | Extracts data from dba_hist_system_event, computes delta values and rates (delta value over delta time) aggregating over wait class |
Future work: Stabilize Perfsheet4 against bugs and for different execution environments. For this feedback from other users would be very much appreciated. Possibly add more pre-defined queries (for AWR analytics but not necessarily only that). Add a post with examples of how analysis performed with PerfSheet4 have been useful for my job in various occasions. In the mean time if you find the tool useful or find bugs or have ideas for possible improvements, I'll be interested to know!
↧
Testing Lost Writes with Oracle and Data Guard
Topic: This post is about lost writes in Oracle and in particular on techniques to reproduce and investigate the effects of lost writes in a test environment and with Data Guard
Motivations: Imagine this scenario: a production system has two standbys to protect against disaster and to load balance read-only load (with Active Data Guard (ADG)). A lost write happens in the primary and remains unnoticed for a few days. Finally the block that suffered a lost write is updated again. Both standbys stop applying redo throwing ORA-600 [3020] (also known as stuck recovery). The primary DB keeps working fine, although it is logically corrupted by the lost write.
You are the DBA in charge of fixing this, what would you do?
The importance of testing: I hope that the example above illustrates that lost write can generate quite complex recovery scenarios and overall a few headaches to support DBAs. In this post I illustrate a few techniques and examples that can be used to test the effects of lost writes in Oracle and therefore prepare in case of a real-world issue strikes. Of particular interest will be to test the effect of lost writes in an environment with (Active) Data Guard.
We need to cover some ground first on techniques and how to setup the test. But first a definition.
Lost writes: "A data block lost write occurs when an I/O subsystem acknowledges the completion of the block write, while in fact the write did not occur in the persistent storage" (from support note 1302539.1). Lost writes can be caused by faulty storage, but also anything in between our data in RAM (in the buffer cache, for example) and disk storage can potentially cause similar effects, this list includes also potential Oracle bugs.
Digression on techniques.
1. A useful technique that we will need in the following is the ability to read and write a single block from Oracle data files (in a test environment). For databases on filesystems (and also DBs on NFS) dd is the tool for this job (I am using the Linux platform as reference). Examples:
read one 8KB block from filesystem (block 134 in this example):
dd if=testlostwrite.dbf bs=8192 count=1 skip=134 of=blk134.dmp conv=notrunc
write one 8KB block to filesystem (block 134 in this example)::
dd of=testlostwrite.dbf bs=8192 count=1 seek=134 if=blk134.dmp conv=notrunc
Note when writing we must use conv=notrunc or else we will end up with an unusable output file. Note also the syntax for specifying the block offset, skip is used for input files (reads) and seek for output files (writes).
How to read and write single blocks on ASM data files. One possibility is to take a backup copy of the datafile with RMAN, edit it with dd (as detailed above), then (with rman again) restore the backup copy. With a little knowledge of ASM internals, more direct ways to access files in ASM are available: one can find the position of the block (and its mirror copies if relevant) and then use dd to read/write data directly. Currently my preferred way is sightly different and it exploits the dbms_diskgroup package. This is an undocumented package (see again the link above on ASM internals for some additional details) although it is extensively used by Oracle's asmcmd utility. I have packaged the dbms_diskgroup.read and dbms_diskgroup.write calls into a small utility written in perl (that I called asmblk_edit, follow this link to download asmblk_edit). Similar ideas can be found also in support note 603962.1.
The following example illustrates using the utility asmblk_edit to read and write block number 134 from and to a data files stored in ASM:
read one 8KB block from an Oracle data file in ASM (block 134 in this example):
./asmblk_edit -r -s 134 -a +TESTDB_DATADG1/TESTDB/datafile/testlostwrite.353.810578639 -f blk134.dmp
write one 8KB block to and Oracle data file in ASM (block 134 in this example):
./asmblk_edit -w -s 134 -a +TESTDB_DATADG1/TESTDB/datafile/testlostwrite.353.810578639 -f blk134.dmp
2. Another technique that we need is quite straightforward and allows us to find the offset of the Oracle block that we want to read/write to for our tests.
The example here below shows how to find block number where data is stored, using rowid:
SQL> select rowid, dbms_rowid.ROWID_BLOCK_NUMBER(rowid), a.* from testlosttable a;
Incidentally finding the block number for an index leaf block can be done with the following (using the undocumented function sys_op_lbid):
SQL> select rowid rowid_table_from_index_leaf, sys_op_lbid(18297477,'L',t.rowid) index_leaf_rowid from testlosttable t --note: 18297477 in this example is the data_object_id of the index I am examining, edit with the actual number as needed
3. Last but not least, we need a way to modify data blocks 'under the nose of Oracle', in particular we want to make sure we flush/invalidate the relevant cached data and metadata. The method we will use in the following is:
A basic recipe to reproduce the effects of a lost write.
We can now put together the ideas and techniques described above into a working example aimed at reproducing the effects of a lost write in Oracle:
SQL> create bigfile tablespace testlostwrite datafile '..{path to datafile}../testlostwrite.dbf' size 10m;
SQL> create table testlosttable (id number, payload varchar2(100)) tablespace testlostwrite ;
SQL> create index i_testlosttable on testlosttable (id) tablespace testlostwrite ;
SQL> insert into testlosttable values (10,'aaaaaaaaaaaaaaaaaaaaaaaaa');
SQL> insert into testlosttable values (20,'bbbbbbbbbbbbbbbbbbbbbbbbb');
SQL> commit;
-- ./asmblk_edit -r -s 134 -a +TESTDB_DATADG1/TESTDB/datafile/testlostwrite.353.810578639 -f blk134.dmp
SQL> alter tablespace testlostwrite online;
SQL> insert into testlosttable values (30,'cccccccccccccccccccccccccc');
SQL> commit;
SQL> alter tablespace testlostwrite offline;
SQL> select /*+ INDEX_FFS(a)*/ id from testlosttable a where id is not null;
SQL> select /*+ FULL(a)*/ id from testlosttable a;
SQL> -- these 2 queries return different results in our test with a lost write.. and only one of them is correct!
We can now proceed with a very interesting test: insert another row of data into the table, for example:
SQL> insert into testlosttable values (40,'ddddddddddddddddddddddd');
SQL> commit;
What we should notice notice at this point is that Oracle keeps working fine and no errors are shown to the user. With the techniques discussed above we can easily show that this new row has been inserted into block 134 (the block with a lost write). Let's postpone further investigations for a later paragraph and for now just note that Oracle cannot easily find out that we have suffered a lost write.
An example of the effects of lost writes with Standby (Data Guard).
A standby database provides a copy of the database that Oracle can use to detect lost writes (i.e. the standby provides a reference of 'good data; in first approximation). Let's see how we can test how this works in practice:
ORA-00600: internal error code, arguments: [3020], [10], [134], [134], [], [], [], [], [], [], [], []
ORA-10567: Redo is inconsistent with data block (file# 10, block# 134, file offset is 1097728 bytes)
ORA-10564: tablespace TESTLOSTWRITE
ORA-01110: data file 10: '+STDBY_DATADG1/TESTDB/datafile/testlostwrite.347.810578685'
ORA-10561: block type 'TRANSACTION MANAGED DATA BLOCK', data object# 18299147
Note: additional information is dumped in the trace files of MRP and of the recovery slaves
Oracle 11g and lost write protection.
In our previous example to reproduce lost write effects, Oracle will only throw an error (ORA-600 [3020], aka stuck recovery) when a DML operation is performed on the primary DB against a block that had suffered a lost write. This means that in practice lost writes may remain silently unnoticed in the primary for an unlimited period of time. A new feature of 11g can be used to make Oracle look for lost writes also in blocks used for queries (in the primary). This is what we can do to use it:
Hex dump of (file 10, block 134) in trace file {path..}/standby_pr09_26471.trc
Reading datafile '+STDBY_DATADG1/TESTDB/datafile/testlostwrite.347.810578685' for corruption at rdba: 0x00000086 (file 10, block 134)
Read datafile mirror 'STDBY_DATADG1_0000' (file 10, block 134) found same corrupt data (logically corrupt)
Read datafile mirror 'STDBY_DATADG1_0011' (file 10, block 134) found same corrupt data (logically corrupt)
NO REDO AT OR AFTER SCN 6367748422450 CAN BE USED FOR RECOVERY.
ORA-00752: recovery detected a lost write of a data block
ORA-10567: Redo is inconsistent with data block (file# 10, block# 134, file offset is 1097728 bytes)
ORA-10564: tablespace TESTLOSTWRITE
ORA-01110: data file 10: '+STDBY_DATADG1/TESTDB/datafile/testlostwrite.347.810578685'
ORA-10561: block type 'TRANSACTION MANAGED DATA BLOCK', data object# 18299538
Note: additional information is dumped in the trace files of MRP and of the recovery slaves
Digression on DB_LOST_WRITE_PROTECT.
A comment: from the discussion above we see that the extra checks triggered by setting DB_LOST_WRITE_PROTECT are definitely an improvement from 10g behavior, although this mechanism does not provide a complete protection against lost writes but only gives us a higher probability that a lost write will be found.
Motivations: Imagine this scenario: a production system has two standbys to protect against disaster and to load balance read-only load (with Active Data Guard (ADG)). A lost write happens in the primary and remains unnoticed for a few days. Finally the block that suffered a lost write is updated again. Both standbys stop applying redo throwing ORA-600 [3020] (also known as stuck recovery). The primary DB keeps working fine, although it is logically corrupted by the lost write.
You are the DBA in charge of fixing this, what would you do?
The importance of testing: I hope that the example above illustrates that lost write can generate quite complex recovery scenarios and overall a few headaches to support DBAs. In this post I illustrate a few techniques and examples that can be used to test the effects of lost writes in Oracle and therefore prepare in case of a real-world issue strikes. Of particular interest will be to test the effect of lost writes in an environment with (Active) Data Guard.
We need to cover some ground first on techniques and how to setup the test. But first a definition.
Lost writes: "A data block lost write occurs when an I/O subsystem acknowledges the completion of the block write, while in fact the write did not occur in the persistent storage" (from support note 1302539.1). Lost writes can be caused by faulty storage, but also anything in between our data in RAM (in the buffer cache, for example) and disk storage can potentially cause similar effects, this list includes also potential Oracle bugs.
Digression on techniques.
1. A useful technique that we will need in the following is the ability to read and write a single block from Oracle data files (in a test environment). For databases on filesystems (and also DBs on NFS) dd is the tool for this job (I am using the Linux platform as reference). Examples:
read one 8KB block from filesystem (block 134 in this example):
dd if=testlostwrite.dbf bs=8192 count=1 skip=134 of=blk134.dmp conv=notrunc
write one 8KB block to filesystem (block 134 in this example)::
dd of=testlostwrite.dbf bs=8192 count=1 seek=134 if=blk134.dmp conv=notrunc
Note when writing we must use conv=notrunc or else we will end up with an unusable output file. Note also the syntax for specifying the block offset, skip is used for input files (reads) and seek for output files (writes).
How to read and write single blocks on ASM data files. One possibility is to take a backup copy of the datafile with RMAN, edit it with dd (as detailed above), then (with rman again) restore the backup copy. With a little knowledge of ASM internals, more direct ways to access files in ASM are available: one can find the position of the block (and its mirror copies if relevant) and then use dd to read/write data directly. Currently my preferred way is sightly different and it exploits the dbms_diskgroup package. This is an undocumented package (see again the link above on ASM internals for some additional details) although it is extensively used by Oracle's asmcmd utility. I have packaged the dbms_diskgroup.read and dbms_diskgroup.write calls into a small utility written in perl (that I called asmblk_edit, follow this link to download asmblk_edit). Similar ideas can be found also in support note 603962.1.
The following example illustrates using the utility asmblk_edit to read and write block number 134 from and to a data files stored in ASM:
read one 8KB block from an Oracle data file in ASM (block 134 in this example):
./asmblk_edit -r -s 134 -a +TESTDB_DATADG1/TESTDB/datafile/testlostwrite.353.810578639 -f blk134.dmp
write one 8KB block to and Oracle data file in ASM (block 134 in this example):
./asmblk_edit -w -s 134 -a +TESTDB_DATADG1/TESTDB/datafile/testlostwrite.353.810578639 -f blk134.dmp
2. Another technique that we need is quite straightforward and allows us to find the offset of the Oracle block that we want to read/write to for our tests.
The example here below shows how to find block number where data is stored, using rowid:
SQL> select rowid, dbms_rowid.ROWID_BLOCK_NUMBER(rowid), a.* from testlosttable a;
Incidentally finding the block number for an index leaf block can be done with the following (using the undocumented function sys_op_lbid):
SQL> select rowid rowid_table_from_index_leaf, sys_op_lbid(18297477,'L',t.rowid) index_leaf_rowid from testlosttable t --note: 18297477 in this example is the data_object_id of the index I am examining, edit with the actual number as needed
3. Last but not least, we need a way to modify data blocks 'under the nose of Oracle', in particular we want to make sure we flush/invalidate the relevant cached data and metadata. The method we will use in the following is:
- Offline the tablespace where data resides (this flushes dirty blocks and invalidates cache entries)
- Perform the read/write modifications to the block, with dd or asmblk_edit, as needed
- Online the tablespace again before next usage
A basic recipe to reproduce the effects of a lost write.
We can now put together the ideas and techniques described above into a working example aimed at reproducing the effects of a lost write in Oracle:
SQL> create bigfile tablespace testlostwrite datafile '..{path to datafile}../testlostwrite.dbf' size 10m;
SQL> create table testlosttable (id number, payload varchar2(100)) tablespace testlostwrite ;
SQL> create index i_testlosttable on testlosttable (id) tablespace testlostwrite ;
SQL> insert into testlosttable values (10,'aaaaaaaaaaaaaaaaaaaaaaaaa');
SQL> insert into testlosttable values (20,'bbbbbbbbbbbbbbbbbbbbbbbbb');
SQL> commit;
SQL> select rowid, dbms_rowid.ROWID_BLOCK_NUMBER(rowid), a.* from testlosttable a;
--note: this will allow to find the block_id where data resides, let's say it's block 134
SQL> alter tablespace testlostwrite offline;
-- read block either with dd or with asmblk_edit and create backup copy. Example:--note: this will allow to find the block_id where data resides, let's say it's block 134
SQL> alter tablespace testlostwrite offline;
-- ./asmblk_edit -r -s 134 -a +TESTDB_DATADG1/TESTDB/datafile/testlostwrite.353.810578639 -f blk134.dmp
SQL> alter tablespace testlostwrite online;
SQL> insert into testlosttable values (30,'cccccccccccccccccccccccccc');
SQL> commit;
SQL> alter tablespace testlostwrite offline;
-- write block either with dd or with asmblk_edit and from previously created backup copy. Example:
-- ./asmblk_edit -w -s 134 -a +TESTDB_DATADG1/TESTDB/datafile/testlostwrite.353.810578639 -f blk134.dmp
SQL> alter tablespace testlostwrite online;
SQL> -- our database has now a lost write in the table testlosttable block 134
The effect of the lost write on the table is that the row with id=30 has disappeared from the table. However the entry with id=30 is still visible in the index. This observation can be confirmed with the 2 queries reported here. Note that in case of normal operations (i.e. no lost writes) the 2 queries should both return three rows, this is not the case here because of our manual editing of the table block.
-- ./asmblk_edit -w -s 134 -a +TESTDB_DATADG1/TESTDB/datafile/testlostwrite.353.810578639 -f blk134.dmp
SQL> alter tablespace testlostwrite online;
SQL> -- our database has now a lost write in the table testlosttable block 134
The effect of the lost write on the table is that the row with id=30 has disappeared from the table. However the entry with id=30 is still visible in the index. This observation can be confirmed with the 2 queries reported here. Note that in case of normal operations (i.e. no lost writes) the 2 queries should both return three rows, this is not the case here because of our manual editing of the table block.
SQL> select /*+ INDEX_FFS(a)*/ id from testlosttable a where id is not null;
SQL> select /*+ FULL(a)*/ id from testlosttable a;
SQL> -- these 2 queries return different results in our test with a lost write.. and only one of them is correct!
SQL> insert into testlosttable values (40,'ddddddddddddddddddddddd');
SQL> commit;
What we should notice notice at this point is that Oracle keeps working fine and no errors are shown to the user. With the techniques discussed above we can easily show that this new row has been inserted into block 134 (the block with a lost write). Let's postpone further investigations for a later paragraph and for now just note that Oracle cannot easily find out that we have suffered a lost write.
An example of the effects of lost writes with Standby (Data Guard).
A standby database provides a copy of the database that Oracle can use to detect lost writes (i.e. the standby provides a reference of 'good data; in first approximation). Let's see how we can test how this works in practice:
- We start by going through the same steps described above and we create a lost write in the primary.
- Note that at this point the standby has no knowledge that something has gone wrong in the primary.
- Now we can go ahead and run DML against the block that has suffered the lost write
- i.e. we insert the row with id=40 as detailed above.
- At the moment when the standby database will try to apply the 'redo' (change vector) to the block with a lost write, it will compare SCN numbers and find that 'something is wrong'. MRP will stop and throw ORA-600 [3020] (stuck recovery).
ORA-00600: internal error code, arguments: [3020], [10], [134], [134], [], [], [], [], [], [], [], []
ORA-10567: Redo is inconsistent with data block (file# 10, block# 134, file offset is 1097728 bytes)
ORA-10564: tablespace TESTLOSTWRITE
ORA-01110: data file 10: '+STDBY_DATADG1/TESTDB/datafile/testlostwrite.347.810578685'
ORA-10561: block type 'TRANSACTION MANAGED DATA BLOCK', data object# 18299147
Note: additional information is dumped in the trace files of MRP and of the recovery slaves
Oracle 11g and lost write protection.
In our previous example to reproduce lost write effects, Oracle will only throw an error (ORA-600 [3020], aka stuck recovery) when a DML operation is performed on the primary DB against a block that had suffered a lost write. This means that in practice lost writes may remain silently unnoticed in the primary for an unlimited period of time. A new feature of 11g can be used to make Oracle look for lost writes also in blocks used for queries (in the primary). This is what we can do to use it:
- we set DB_LOST_WRITE_PROTECT = TYPICAL (or FULL if we prefer) on the primary database. This will cause the generation of additional redo entries when Oracle performs physical reads.
- we set DB_LOST_WRITE_PROTECT = TYPICAL on the standby, this will make MRP and its recovery slaves to check for lost writes using the extra information in the redo log stream.
- If we hit a lost write MRP and its slaves will stop and throw ORA-752: recovery detected a lost write of a data block
Hex dump of (file 10, block 134) in trace file {path..}/standby_pr09_26471.trc
Reading datafile '+STDBY_DATADG1/TESTDB/datafile/testlostwrite.347.810578685' for corruption at rdba: 0x00000086 (file 10, block 134)
Read datafile mirror 'STDBY_DATADG1_0000' (file 10, block 134) found same corrupt data (logically corrupt)
Read datafile mirror 'STDBY_DATADG1_0011' (file 10, block 134) found same corrupt data (logically corrupt)
STANDBY REDO APPLICATION HAS DETECTED THAT THE PRIMARY DATABASE
LOST A DISK WRITE OF BLOCK 134, FILE 10NO REDO AT OR AFTER SCN 6367748422450 CAN BE USED FOR RECOVERY.
ORA-00752: recovery detected a lost write of a data block
ORA-10567: Redo is inconsistent with data block (file# 10, block# 134, file offset is 1097728 bytes)
ORA-10564: tablespace TESTLOSTWRITE
ORA-01110: data file 10: '+STDBY_DATADG1/TESTDB/datafile/testlostwrite.347.810578685'
ORA-10561: block type 'TRANSACTION MANAGED DATA BLOCK', data object# 18299538
Note: additional information is dumped in the trace files of MRP and of the recovery slaves
Digression on DB_LOST_WRITE_PROTECT.
A comment: from the discussion above we see that the extra checks triggered by setting DB_LOST_WRITE_PROTECT are definitely an improvement from 10g behavior, although this mechanism does not provide a complete protection against lost writes but only gives us a higher probability that a lost write will be found.
How about the impact in production of this parameter? Extra redo entries are generated on the primary: they are called block read redo (BRR). We can directly investigate BRR entries with 2 simple techniques: dumping redo (BRR corresponds to layer 23 opcode 2) and querying v$mystat or v$sysstat (we will look for stats with 'lost write' in their name).
Example SQL:
SQL> alter system dump logfile '...{path to storage}.../thread_2_seq_1816.510.810661967' layer 23 opcode 2;
SQL> select name, sum(value) from v$mystat se, v$statname n where n.statistic#=se.statistic# and (n.name like '%lost write%' or name like '%physical read%') group by name;
The size of BRR entries in the redo stream varies, as redo dumps show that and a few optimizations can come into play (such as batching more entries in one redo record). Based on a few observations of a production system I estimate than on average we can expect 30 bytes of extra redo generated by BRR for each physical block read performed by the database, although mileage may vary and definitely the impact of the parameter should be tested before applying it to a busy production! Another observation based on testing is that direct read operations (for example reads for parallel query) do not generate BRR.
On a standby db, by setting db_lost_write_protect to typical (or full) we will implicitly set the underscore parameter _db_lost_write_checking (to the value 2 based on my test system) and MRP slaves will start taking the extra work of checking for lost writes by comparing SCNs in BRRs with SCN in the block headers. This extra work implies additional physical reads on the standby. This can be seen in v$sysstat (statistics name = 'recovery blocks read for lost write detection' and also the related stat 'recovery blocks skipped lost write checks').
Example of BRR entry from a logfile dump:
CHANGE #5 TYP:0 CLS:4 AFN:10 DBA:0x00000086 OBJ:18299943 SCN:0x05ca.9b6e95f9 SEQ:1 OP:23.2 ENC:0 RBL:1
Block Read - afn: 10 rdba: 0x00000086 BFT:(1024,134) non-BFT:(0,134)
scn: 0x05ca.9b6e95f9 seq: 0x01
flags: 0x00000004 ( ckval )
Analysis and troubleshooting.
A good starting point for troubleshooting lost writes in a production system is support note 1265884.1 "Resolving ORA-752 or ORA-600 [3020] During Standby Recovery".
However now we have a test environment with a lost write and we can have some fun investigating it and developing techniques for analysis and troubleshooting. All this with the great advantage, compared to a real-life case, that now we know for sure what the root cause of the problem is!
- Investigate the block contents on primary and standby by dumping the block contents to a trace file. The idea is to compare the contents of the primary and standby. Of particular interest will be the SCN of the last change to the block and also SCNs in the ITL list. This is an easy operation to perform. Example:
SQL> alter system dump datafile 10 block 134;
SQL> alter session set events 'immediate trace name set_tsn_p1 level <ts#+1>'; -- where ts# is the tablespace number
SQL> alter session set events 'immediate trace name buffer level <decimal rdba>'; --rba is 134 in our example
- We can also investigate the contents of the block with a lost write using SQL. This has the advantage of allowing the use of flashback query. This is particularly useful for reading the details of the block on the primary database. The current content of the block on the primary may not be what we want. We are interested in a consistent image at point in time with SCN equal to the SCN where we have our error message on the standby (current SCN of the standby). Example:
SQL> set num 16
SQL> select ora_rowscn, rowid, dbms_rowid.rowid_row_number(a.rowid) row_number, a.* from testlosttable as of scn 6367748219413 a where rowid like 'ABFzkJAAAAAAACG%'; -- edit values for scn and rowid, use SQL below to find the values to use
Find rowid of the block with lost write (block 134 of file 10 and object_id=18299145). Actually what we need is just the first 15 characters of the rowid (the last three characters are the row_number inside the block). Example:
SQL> select substr(DBMS_ROWID.ROWID_CREATE(rowid_type =>1, object_number =>18299145, relative_fno =>0, block_number =>134, row_number =>0),1,15) rowid_prefix from dual;
Find current SCN (at the standby): SQL> select current_scn from v$database;
- Dump from all relevant logfiles searching for entries related to the block with lost writes (in our example it's block 134 of file 10). The dump will include transaction details and most notably redo marking the time when DBWR has written the give block (this info is sotre in block written redo, BWR). If the parameter db_lost_write_protect is set to typical or full the redo dump will also show details of the block read redo (see BRR discussed above). For further info on logfile dumps see also Julian Dyke's website. Example:
SQL> alter system dump logfile '...{path to storage}.../thread_2_seq_1816.510.810661967' DBA MIN 10 134 DBA MAX 10 134; -- edit file number and block number as needed
Actions that we can take to restore the services
Suppose that our analysis has confirmed that a lost write happened and also that we have the details of what 'is lost'. We need now to fix the corrupted block on the primary and restore the service on the standby. In particular if we have an Active Data Guard with a SLA, the latter may be quite an urgent action. Hopefully we also have some ideas of what the root cause was and a way to fix it in the future.
One possible action is to failover to standby. This action plan however may provide to be unacceptable in many circumstances, given the potential for data loss it implies. A failover would likely not be acceptable if the primary database has continued working and accepting users transactions since the time (SCN) of the incident that has generated ORA-600 (or ORA-752).
Another possibility is to use our knowledge of the lost transactions gathered in the analysis phase to run SQL actions to 'fix the primary'. This has to be evaluated case by case. In some circumstances we can also get away with just dropping and recreating the object with a lost write. In our simple example of a lost insert on table testlosttable, the action to perform on the primary is:
SQL> alter index i_testlosttable rebuild online;
SQL> insert into testlosttable values (30,'cccccccccccccccccccccccccc');
SQL> commit;
What about fixing the standby? We can unstuck the recovery on the standby by allowing it to corrupt the block with a lost write (and fix it later, as detailed below). Example:
SQL> alter system set db_lost_write_protect=none; --temporarily disable lost write check if needed
SQL> alter database recover automatic standby database allow 1 corruption;
SQL> --wait till the redo that cause the error message has been applied
SQL> alter database recover cancel;
SQL> -- restart normal Data Guard operations. An example for ADG:
SQL> alter system set db_lost_write_protect=typical;
SQL> alter database open read only;
SQL> alter database recover managed standby database nodelay using current logfile disconnect;
At this point all is OK except for one corrupted block on the standby. How to restore the corrupted block on standby? This depends on the case, we may just rebuild the object on primary this will fix the problem on standby. Another option is to copy over the datafile from primary to standby
In our example the corrupted block is block 134 of file 10 and we have an active data guard in real time apply. We can use automatic block media recovery (ABMR) to fix it. In my tests AMBR is attempted but does not really work against the corrupted block 134, I can work around this by zeroing out the block. This is an example (intended to be used on test databases):
./asmblk_edit -w -s 134 -a +STDBY_DATADG1/TESTDB/datafile/testlostwrite.347.810578685 -f zeroblock
If we now query the table (the table testlostwrite in our example) when Oracle reaches the zeroed block it will fetch a copy from production (see the manual for the details of configuring ABMR). This will happen in a transparent way for the user issuing the query, the operation is logged in the alert log of the instance (the 2 lines here below appeared repeated twice in my test using 11.2.0.3):Oracle's OTN demo video on lost write
Short videos with demos on testing lost writes and automatic block media recovery in a Data Guard environment can be found on OTN.
See also support document "Best Practices for Corruption Detection, Prevention, and Automatic Repair - in a Data Guard Configuration" [ID 1302539.1]
Conclusions and future work
This article illustrates a simple example on how to reproduce and investigate the effects lost write in a test environment, which is intended as a training exercise for DBAs.
I would like to end with four points that I still find unsettling when thinking at how lost writes can affect an Oracle setup configured for high availability:
Note: the examples reported in this article have been tested against Oracle 11.2.0.3 64 bit for Linux (RHEL5).
Example SQL:
SQL> alter system dump logfile '...{path to storage}.../thread_2_seq_1816.510.810661967' layer 23 opcode 2;
SQL> select name, sum(value) from v$mystat se, v$statname n where n.statistic#=se.statistic# and (n.name like '%lost write%' or name like '%physical read%') group by name;
The size of BRR entries in the redo stream varies, as redo dumps show that and a few optimizations can come into play (such as batching more entries in one redo record). Based on a few observations of a production system I estimate than on average we can expect 30 bytes of extra redo generated by BRR for each physical block read performed by the database, although mileage may vary and definitely the impact of the parameter should be tested before applying it to a busy production! Another observation based on testing is that direct read operations (for example reads for parallel query) do not generate BRR.
On a standby db, by setting db_lost_write_protect to typical (or full) we will implicitly set the underscore parameter _db_lost_write_checking (to the value 2 based on my test system) and MRP slaves will start taking the extra work of checking for lost writes by comparing SCNs in BRRs with SCN in the block headers. This extra work implies additional physical reads on the standby. This can be seen in v$sysstat (statistics name = 'recovery blocks read for lost write detection' and also the related stat 'recovery blocks skipped lost write checks').
Example of BRR entry from a logfile dump:
CHANGE #5 TYP:0 CLS:4 AFN:10 DBA:0x00000086 OBJ:18299943 SCN:0x05ca.9b6e95f9 SEQ:1 OP:23.2 ENC:0 RBL:1
Block Read - afn: 10 rdba: 0x00000086 BFT:(1024,134) non-BFT:(0,134)
scn: 0x05ca.9b6e95f9 seq: 0x01
flags: 0x00000004 ( ckval )
Note: checks for lost write based on SCN from BRR entries are performed also when doing media recovery (for example recover database command). Therefore even if Data Guard is not available one can use a simple restore of the database from backup to perform validation of BRR records to search for lost writes.
A good starting point for troubleshooting lost writes in a production system is support note 1265884.1 "Resolving ORA-752 or ORA-600 [3020] During Standby Recovery".
However now we have a test environment with a lost write and we can have some fun investigating it and developing techniques for analysis and troubleshooting. All this with the great advantage, compared to a real-life case, that now we know for sure what the root cause of the problem is!
- Investigate the block contents on primary and standby by dumping the block contents to a trace file. The idea is to compare the contents of the primary and standby. Of particular interest will be the SCN of the last change to the block and also SCNs in the ITL list. This is an easy operation to perform. Example:
SQL> alter system dump datafile 10 block 134;
SQL> alter session set events 'immediate trace name set_tsn_p1 level <ts#+1>'; -- where ts# is the tablespace number
SQL> alter session set events 'immediate trace name buffer level <decimal rdba>'; --rba is 134 in our example
- We can also investigate the contents of the block with a lost write using SQL. This has the advantage of allowing the use of flashback query. This is particularly useful for reading the details of the block on the primary database. The current content of the block on the primary may not be what we want. We are interested in a consistent image at point in time with SCN equal to the SCN where we have our error message on the standby (current SCN of the standby). Example:
SQL> set num 16
SQL> select ora_rowscn, rowid, dbms_rowid.rowid_row_number(a.rowid) row_number, a.* from testlosttable as of scn 6367748219413 a where rowid like 'ABFzkJAAAAAAACG%'; -- edit values for scn and rowid, use SQL below to find the values to use
Find rowid of the block with lost write (block 134 of file 10 and object_id=18299145). Actually what we need is just the first 15 characters of the rowid (the last three characters are the row_number inside the block). Example:
SQL> select substr(DBMS_ROWID.ROWID_CREATE(rowid_type =>1, object_number =>18299145, relative_fno =>0, block_number =>134, row_number =>0),1,15) rowid_prefix from dual;
Find current SCN (at the standby): SQL> select current_scn from v$database;
- Dump from all relevant logfiles searching for entries related to the block with lost writes (in our example it's block 134 of file 10). The dump will include transaction details and most notably redo marking the time when DBWR has written the give block (this info is sotre in block written redo, BWR). If the parameter db_lost_write_protect is set to typical or full the redo dump will also show details of the block read redo (see BRR discussed above). For further info on logfile dumps see also Julian Dyke's website. Example:
SQL> alter system dump logfile '...{path to storage}.../thread_2_seq_1816.510.810661967' DBA MIN 10 134 DBA MAX 10 134; -- edit file number and block number as needed
- Perform log mining to find the SQL of all the transactions for the affected block. Identify the relevant redo logs to mine first. Example:
SQL> BEGIN
SYS.DBMS_LOGMNR.ADD_LOGFILE(LogFileName=>'...{path to storage}.../thread_2_seq_1816.510.810661967',Options=>SYS.DBMS_LOGMNR.NEW);
SYS.DBMS_LOGMNR.START_LOGMNR(Options=> SYS.DBMS_LOGMNR.DICT_FROM_ONLINE_CATALOG);
END;
/
SQL> SELECT scn,sql_redo FROM SYS.V_$LOGMNR_CONTENTS WHERE data_obj#=18299145 and row_id like 'ABFzkJAAAAAAACG%'; -- calculate rowid with dbms_rowid package as detailed above
- Other analysis that can be done in case we have indexes on the tables:
In case one or more indexes are present we can read data from the index and compare the results with what we have on the table. Example to read from the index:
SQL> select rowid rowid_table_from_index_leaf, id, sys_op_lbid(18299146,'L',t.rowid) index_leaf_rowid from testlosttable t where rowid like 'ABFzkJAAAAAAACG%';
--note: update 18299146 with the the data_object_id of the index of interest
-- in this example this is the data_object_id of I_TESTLOSTTABLE
Another standard check with corruption between index and table is to run analyze table validate, although in my experience this can be quite time consuming and not necessarily add more information to the analysis. Example:
SQL> analyze table testlosttable validate structure cascade online;
SQL> select rowid rowid_table_from_index_leaf, id, sys_op_lbid(18299146,'L',t.rowid) index_leaf_rowid from testlosttable t where rowid like 'ABFzkJAAAAAAACG%';
--note: update 18299146 with the the data_object_id of the index of interest
-- in this example this is the data_object_id of I_TESTLOSTTABLE
Another standard check with corruption between index and table is to run analyze table validate, although in my experience this can be quite time consuming and not necessarily add more information to the analysis. Example:
SQL> analyze table testlosttable validate structure cascade online;
Suppose that our analysis has confirmed that a lost write happened and also that we have the details of what 'is lost'. We need now to fix the corrupted block on the primary and restore the service on the standby. In particular if we have an Active Data Guard with a SLA, the latter may be quite an urgent action. Hopefully we also have some ideas of what the root cause was and a way to fix it in the future.
One possible action is to failover to standby. This action plan however may provide to be unacceptable in many circumstances, given the potential for data loss it implies. A failover would likely not be acceptable if the primary database has continued working and accepting users transactions since the time (SCN) of the incident that has generated ORA-600 (or ORA-752).
Another possibility is to use our knowledge of the lost transactions gathered in the analysis phase to run SQL actions to 'fix the primary'. This has to be evaluated case by case. In some circumstances we can also get away with just dropping and recreating the object with a lost write. In our simple example of a lost insert on table testlosttable, the action to perform on the primary is:
SQL> alter index i_testlosttable rebuild online;
SQL> insert into testlosttable values (30,'cccccccccccccccccccccccccc');
SQL> commit;
What about fixing the standby? We can unstuck the recovery on the standby by allowing it to corrupt the block with a lost write (and fix it later, as detailed below). Example:
SQL> alter system set db_lost_write_protect=none; --temporarily disable lost write check if needed
SQL> alter database recover automatic standby database allow 1 corruption;
SQL> --wait till the redo that cause the error message has been applied
SQL> alter database recover cancel;
SQL> -- restart normal Data Guard operations. An example for ADG:
SQL> alter system set db_lost_write_protect=typical;
SQL> alter database open read only;
SQL> alter database recover managed standby database nodelay using current logfile disconnect;
In our example the corrupted block is block 134 of file 10 and we have an active data guard in real time apply. We can use automatic block media recovery (ABMR) to fix it. In my tests AMBR is attempted but does not really work against the corrupted block 134, I can work around this by zeroing out the block. This is an example (intended to be used on test databases):
./asmblk_edit -w -s 134 -a +STDBY_DATADG1/TESTDB/datafile/testlostwrite.347.810578685 -f zeroblock
where zeroblock file is created with: dd if=/dev/zero bs=8192 count=1 of=zeroblock
if my standby was on filesystem I could have used:
dd if=/dev/zero of=testlostwrite.dbf bs=8192 count=1 seek=134 if=blk134.dmp conv=notrunc
dd if=/dev/zero of=testlostwrite.dbf bs=8192 count=1 seek=134 if=blk134.dmp conv=notrunc
If we now query the table (the table testlostwrite in our example) when Oracle reaches the zeroed block it will fetch a copy from production (see the manual for the details of configuring ABMR). This will happen in a transparent way for the user issuing the query, the operation is logged in the alert log of the instance (the 2 lines here below appeared repeated twice in my test using 11.2.0.3):
Automatic block media recovery requested for (file# 10, block# 134)
Automatic block media recovery successful for (file# 10, block# 134)
Short videos with demos on testing lost writes and automatic block media recovery in a Data Guard environment can be found on OTN.
See also support document "Best Practices for Corruption Detection, Prevention, and Automatic Repair - in a Data Guard Configuration" [ID 1302539.1]
Conclusions and future work
This article illustrates a simple example on how to reproduce and investigate the effects lost write in a test environment, which is intended as a training exercise for DBAs.
I would like to end with four points that I still find unsettling when thinking at how lost writes can affect an Oracle setup configured for high availability:
- Finding the root causes of a lost write can prove to be very time consuming: was it caused by the storage, or by an Oracle bug, or somewhere in between? If the issue cannot be re-produced you can easily find yourself in between a finger-pointing discussion between vendors.
- Having a standby is of great help for discovering lost writes. However once a lost write is found the redo apply stops: the recovery is stuck and throws error ORA-600 [3020] or ORA-752 if we try to restart it. A support DBA most likely will have to analyze the situation and decide what to do to fix the issue (for example perform a failover or rather going through a process similar to what described in this article). Moreover if one or more Active Data Guards are used for critical read-only activity, there is time pressure to restart the redo apply.
- How can we check that our primary and standby databases are 'in sync', that is that there are no lost writes waiting to 'explode' as time bombs? I am not aware of a utility that could do such check. This is a very interesting topic, probably material for another post. A brief discussion of this problem and possible solutions can be found at this link and also here.
- Database restores can be affected by lost writes too. For example the restore of a database backup can fail (get stuck on applying redo) because of a lost write that has happened in production after the latest data file backup. This has potential impacts on the disaster and recovery strategy.
Note: the examples reported in this article have been tested against Oracle 11.2.0.3 64 bit for Linux (RHEL5).
↧
↧
Oracle Events' Latency Visualization and Heat Maps in SQL*plus
Topic: This post is about a technique for Oracle performance tuning, the use of heat maps to investigate wait event latency (and in particular I/O-related latency). This post also discusses a SQL*plus-based script/tool I have developed to help with this type of monitoring and performance drill-down (OraLatencyMap).
Context: Oracle exposes latency data for the wait event interface in V$EVENT_HISTOGRAM. This gives an additional dimension to drill down performance data for analysis and troubleshooting. In an older blog post I described an example of troubleshooting a storage issue for an OLTP (RAC) database by investigating the histogram data of the 'db file sequential read' wait event. In that context I had also developed and discussed ehm.sql, a simple PL/SQL script to collect and display data from GV$EVENT_HISTOGRAM.
What's new: An excellent article by Brendan Gregg, "Visualizing System Latency", Communications of the ACM, July 2010 has inspired me to develop an enhanced version of ehm.sql. The idea is to display in real time data of current and historical values of selected wait event histograms. In particular I/O-related events such as db file sequential read and log file sync make excellent candidates for this type of analysis. Moreover those events are relevant in a context that is familiar to me, that is drilling down issues with OLTP performance and access to storage.
As Brendan shows in his article, I/O latency data fits naturally to heat map representation, where time is on the horizontal axis, latency buckets are on the vertical axis and the quantity to display (for example number of waits or time waited) is displayed as color (hence the name heat map).
The tool: OraLatencyMapis a tool I have developed to help extract and represent event histogram data in a heat map. It is intended to be lightweight and 'DBA-friendly'. It's implemented in PL/SQL, it does not require the creation of any DB objects and runs under SQL*plus. OraLatencyMap requires a terminal supporting ANSI escape codes (for example PuTTY, MobaXterm, xterm, notably it does not run under windows' cmd.exe). Making SQL*plus behave like a monitoring tool, requires jumping through hoops. Credits to the experts who have shared their results in this area and thus made my job much easier here. In particular: Tanel Poder (for moats, sqlplus and color, snapper, etc..), Adrian Billington (moats) and Marcin Przepiorowski (topaas).
Example 1: study of db file sequential read
See here below a screen shot of a putty terminal where I ran SQL*plus and @OraLatencyMap. The script samples GV$EVENT_HISTOGRAM roughly every 3 seconds and displays 2 heat maps. The top heat map gives information on the number of waits per second on each latency bucket. The bottom heat map instead represents the estimated wait time per latency bucket. The two graphs represent the same type of information but with 2 different 'points of view'.
SQL> @OraLatencyMap
This type of data is useful when investigating single block read latency issues in OLTP systems for example. I'll leave for another time a discussion of the details and limitations of this approach. I'll just point out that among others, it's also important to make sure the system is not starving with CPU to make sense of the data (CPU data not shown here). When reading the heat map I typically focus on 3 areas: one is low-latency (1ms) where I get info on what is most likely reads from storage cache, the second is the area of latency around the 16 and 32 ms buckets, most likely representing physical reads from 'rotating disks'. The third very interesting area to watch is the 'high latency' (>100ms), that is the area of "IO latency outliers" and can be sign of problems with the storage for example. Note that OraLatencyMap is a drill-down tool based on Oracle instrumentation so the interpretation of the results, especially when extended to storage, will often need additional data from the specific context being investigated.
Example 2: study of log file sync
This example is about visualizing the latency of log file sync. This can be useful when drilling down commit latency issues. This is a critical area for many OLTP systems and the troubleshooting is often not easy: storage performance, server CPU starvation and unexpected behavior by LGWR, among others can all potentially cause problems in this area.
This screen shot was taken from a putty window, this time with a white background.
SQL> @OraLatencyMap_event 3 "log file sync"
Conclusions
Latency investigations of Oracle wait events give an additional and powerful dimension to performance tuning and troubleshooting. This is particularly useful for (synchronous) I/O-related wait events such db file sequential read and log file sync. Latency heat maps are a particularly suited to visualize IO latency (see also Brendan Gregg's article). The first version of a simple SQL*plus script (OraLatencyMap) to collect and visualize event histogram data as heat maps has been discussed.
OraLatencyMap is available for download from http://canali.web.cern.ch/canali/resources.htm
Context: Oracle exposes latency data for the wait event interface in V$EVENT_HISTOGRAM. This gives an additional dimension to drill down performance data for analysis and troubleshooting. In an older blog post I described an example of troubleshooting a storage issue for an OLTP (RAC) database by investigating the histogram data of the 'db file sequential read' wait event. In that context I had also developed and discussed ehm.sql, a simple PL/SQL script to collect and display data from GV$EVENT_HISTOGRAM.
What's new: An excellent article by Brendan Gregg, "Visualizing System Latency", Communications of the ACM, July 2010 has inspired me to develop an enhanced version of ehm.sql. The idea is to display in real time data of current and historical values of selected wait event histograms. In particular I/O-related events such as db file sequential read and log file sync make excellent candidates for this type of analysis. Moreover those events are relevant in a context that is familiar to me, that is drilling down issues with OLTP performance and access to storage.
As Brendan shows in his article, I/O latency data fits naturally to heat map representation, where time is on the horizontal axis, latency buckets are on the vertical axis and the quantity to display (for example number of waits or time waited) is displayed as color (hence the name heat map).
The tool: OraLatencyMapis a tool I have developed to help extract and represent event histogram data in a heat map. It is intended to be lightweight and 'DBA-friendly'. It's implemented in PL/SQL, it does not require the creation of any DB objects and runs under SQL*plus. OraLatencyMap requires a terminal supporting ANSI escape codes (for example PuTTY, MobaXterm, xterm, notably it does not run under windows' cmd.exe). Making SQL*plus behave like a monitoring tool, requires jumping through hoops. Credits to the experts who have shared their results in this area and thus made my job much easier here. In particular: Tanel Poder (for moats, sqlplus and color, snapper, etc..), Adrian Billington (moats) and Marcin Przepiorowski (topaas).
Example 1: study of db file sequential read
See here below a screen shot of a putty terminal where I ran SQL*plus and @OraLatencyMap. The script samples GV$EVENT_HISTOGRAM roughly every 3 seconds and displays 2 heat maps. The top heat map gives information on the number of waits per second on each latency bucket. The bottom heat map instead represents the estimated wait time per latency bucket. The two graphs represent the same type of information but with 2 different 'points of view'.
SQL> @OraLatencyMap
This type of data is useful when investigating single block read latency issues in OLTP systems for example. I'll leave for another time a discussion of the details and limitations of this approach. I'll just point out that among others, it's also important to make sure the system is not starving with CPU to make sense of the data (CPU data not shown here). When reading the heat map I typically focus on 3 areas: one is low-latency (1ms) where I get info on what is most likely reads from storage cache, the second is the area of latency around the 16 and 32 ms buckets, most likely representing physical reads from 'rotating disks'. The third very interesting area to watch is the 'high latency' (>100ms), that is the area of "IO latency outliers" and can be sign of problems with the storage for example. Note that OraLatencyMap is a drill-down tool based on Oracle instrumentation so the interpretation of the results, especially when extended to storage, will often need additional data from the specific context being investigated.
Example 2: study of log file sync
This example is about visualizing the latency of log file sync. This can be useful when drilling down commit latency issues. This is a critical area for many OLTP systems and the troubleshooting is often not easy: storage performance, server CPU starvation and unexpected behavior by LGWR, among others can all potentially cause problems in this area.
This screen shot was taken from a putty window, this time with a white background.
SQL> @OraLatencyMap_event 3 "log file sync"
Conclusions
Latency investigations of Oracle wait events give an additional and powerful dimension to performance tuning and troubleshooting. This is particularly useful for (synchronous) I/O-related wait events such db file sequential read and log file sync. Latency heat maps are a particularly suited to visualize IO latency (see also Brendan Gregg's article). The first version of a simple SQL*plus script (OraLatencyMap) to collect and visualize event histogram data as heat maps has been discussed.
OraLatencyMap is available for download from http://canali.web.cern.ch/canali/resources.htm
↧
OraLatencyMap v1.1 and Testing with SLOB 2
Topic: OraLatencyMap v1.1 is an updated versions of a performance tool aimed at collecting and displaying Oracle wait event histogram data as latency heat maps. We will also briefly discuss an example of the usage of OraLatencyMap in the context of storage testing.
OraLatencyMap v1.1 is now available with a few new features and bug fixes (v1.0 is described here). Many thanks to all who have tried it already and left a note either on the blog or twitter.
The main new feature in v1.1 is an advanced mode allowing for a few more parameters and customization: the number of samples displayed in the map, the number of latency buckets and the possibility limit data collection to a subset of instances (this is relevant for RAC).
Another new feature is that we now display the maximum value of the calculated sum of the displayed values (i.e. the sum of N# of wait events per second and the sum of time waited). This is intended to help with identifying the peak performance values (for example maximum number of IOPS).
README:
OraLatencyMap, a performance widget to visualize Oracle I/O latency using Heat Maps
Luca.Canali@cern.ch, v1.1, May 2013
Credits: Brendan Gregg for "Visualizing System Latency", Communications of the ACM, July 2010, Tanel Poder (snapper, moats, sqlplus and color), Marcin Przepiorowski (topass)
Notes: These scripts need to be run from sqlplus from a terminal supporting ANSI escape codes.
Tested on 11.2.0.3, Linux.
Run from a privileged user (select on v$event_histogram and execute on dbms_lock.sleep)
How to start:
sqlplus / as sysdba
SQL> @OraLatencyMap
More examples:
SQL> @OraLatencyMap_event 3 "log file sync"
SQL> @OraLatencyMap_advanced 5 "db file sequential read" 12 80 "and inst_id=1"
Output: 2 latency heat maps of the given wait event
The top map represents the number of waits per second and per latency bucket
The bottom map represented the estimated time waited per second and per latency bucket
with the advanced script it is possible to customize sampling time, event name, screen size
moreover in RAC, the default is to aggregate histogram data over all nodes, but this is customizable too
Scope: Performance investigations of event latency. For example single block read latency with OraLatencyMap.sql
Related: OraLatencyMap_advanced.sql -> this is the main script for generic investigation of event latency with heat maps
OraLatencyMap_event.sql -> another script based on OraLatencyMap_advanced
OraLatencyMap_internal.sql -> the slave script where all the computation and visualization is done
OraLatencyMap_internal_loop.sql -> the slave script that runs several dozens of iterations of the tool's engine
SLOB 2 by Kevin Closson is a solid reference and overall a great tool for testing storage with Oracle and in particular for testing random I/O activity. Therefore I have used SLOB to drive the workload for the examples here below.
The outline of this simple test: (1) generate test data with SLOB, (2) run the SLOB test for read-only random IO with increasing load values,(3) run OraLatencyMap while the test is running (focus on IOPS and latency values).
The picture here below shows the output of OraLatencyMap taken during 4 different run of SLOB for increasing load (see also slob.conf below and annotations on the graph).
The measured workload is almost entirely dominated by wait event of the type "db file sequential read", that is for random single-block read.
We can see that by increasing the load (number of concurrent SLOB sessions) we can drive more IOPS out of our storage. At the same time we observe that the latency is increasing with increasing load.
How to read IOPS with OraLatencyMap? The sum of the number of waits per second is the metric to look at. I have copied measured values for IOPS as annotations in the figure here below.
The storage system under test is a simple JBOD configuration of 2 storage arrays with 12 SAS 10K rpm disks per array. The storage is connected to the servers via Fiber Channel (8 Gbps). The database is Oracle 11.2.0.3 for Linux x86_64 with ASM. Storage is allocated on a normal redundancy disk group built with 23 disks from the 2 storage arrays.
Why is this useful? First of all it's a quick and easy way to start investigations of the storage. Single block random read latency is very important for many OLTP applications. We can therefore learn about the latency we can expect from the storage at different loads. We can learn about the maximum IOPS, and overall see the behavior at storage saturation for this type of workload.
Note also that after each run SLOB 2 produces a series of reports (including AWR and iostat) with much more information on the workload that what is available by just observing OraLatencyMap output.
Coming back to the example of the JBOD configuration we can see from the figure below that the measured values for IOPS are consistent with expectations: each disk delivering ~200 IOPS. This is consistent with other measurements previously done on the same system, see also this presentation. The measured latency is in the range of 4-8 ms for low load and starts to increase considerably when we start to drive the disks closer to maximum IOPS, also something that is expected.
SQL> @OraLatencyMap_advanced 10 "db file sequential read" 11 110 ""
A potential pitfall when testing storage is to run our tests with too little data and in particular to have test data that fit in the controller's cache. The figure here below shows just an example of that. The test data there were easily cached by the arrays (4 GB in total for this system). The net outcome is that we have very high figures for IOPS that just don't make sense with the number and type of disks we have.
Indeed the measured latency values confirm that we are mostly reading from cache: we see that the majority of the measured wait events are in the 1 ms latency bucket (wait time of 1 ms or less).
Note on the test: the main difference between this test and the test described above is in amount of data used. The SLOB parameter SCALE = 10000 for this test, SCALE = 1000000 for the test discussed above.
Comment: the example described here is quite basic, however it is the case that many storage arrays these days come with large amounts of SSD cache. It is important to understand/measure if test data fit in the cache to make sense of the results of the stress tests.
SQL> @OraLatencyMap
Notes: slob.conf and other details regarding the tests. See SLOB 2 manual for more info on the meaning of the parameters.
slob.conf:
UPDATE_PCT=0
RUN_TIME=200
WORK_LOOP=0
SCALE=1000000 #for test N.2 this is scaled down to 10000
WORK_UNIT=256
REDO_STRESS=HEAVY
LOAD_PARALLEL_DEGREE=8
SHARED_DATA_MODULUS=0
How to create test data:
./setup.sh SLOB 128 #this needs about 1TB of space in the SLOB tablespace
Relevant init.ora parameters:
db_cache_size=50m
db_block_prefetch_limit = 0
db_block_prefetch_quota = 0
db_file_noncontig_mblock_read_count = 0
OraLatencyMap v1.1 is now available with a few new features and bug fixes (v1.0 is described here). Many thanks to all who have tried it already and left a note either on the blog or twitter.
The main new feature in v1.1 is an advanced mode allowing for a few more parameters and customization: the number of samples displayed in the map, the number of latency buckets and the possibility limit data collection to a subset of instances (this is relevant for RAC).
Another new feature is that we now display the maximum value of the calculated sum of the displayed values (i.e. the sum of N# of wait events per second and the sum of time waited). This is intended to help with identifying the peak performance values (for example maximum number of IOPS).
README:
OraLatencyMap, a performance widget to visualize Oracle I/O latency using Heat Maps
Luca.Canali@cern.ch, v1.1, May 2013
Credits: Brendan Gregg for "Visualizing System Latency", Communications of the ACM, July 2010, Tanel Poder (snapper, moats, sqlplus and color), Marcin Przepiorowski (topass)
Notes: These scripts need to be run from sqlplus from a terminal supporting ANSI escape codes.
Tested on 11.2.0.3, Linux.
Run from a privileged user (select on v$event_histogram and execute on dbms_lock.sleep)
How to start:
sqlplus / as sysdba
SQL> @OraLatencyMap
More examples:
SQL> @OraLatencyMap_event 3 "log file sync"
SQL> @OraLatencyMap_advanced 5 "db file sequential read" 12 80 "and inst_id=1"
Output: 2 latency heat maps of the given wait event
The top map represents the number of waits per second and per latency bucket
The bottom map represented the estimated time waited per second and per latency bucket
with the advanced script it is possible to customize sampling time, event name, screen size
moreover in RAC, the default is to aggregate histogram data over all nodes, but this is customizable too
Scope: Performance investigations of event latency. For example single block read latency with OraLatencyMap.sql
Related: OraLatencyMap_advanced.sql -> this is the main script for generic investigation of event latency with heat maps
OraLatencyMap_event.sql -> another script based on OraLatencyMap_advanced
OraLatencyMap_internal.sql -> the slave script where all the computation and visualization is done
OraLatencyMap_internal_loop.sql -> the slave script that runs several dozens of iterations of the tool's engine
OraLatencyMap and storage testing with SLOB
OraLatencyMap was originally written for troubleshooting and drilling down issues with production DBs. I find that OraLatencyMap can be of help also in the context of storage testing (say for example when installing a new system or evaluating a new storage infrastructure).SLOB 2 by Kevin Closson is a solid reference and overall a great tool for testing storage with Oracle and in particular for testing random I/O activity. Therefore I have used SLOB to drive the workload for the examples here below.
The outline of this simple test: (1) generate test data with SLOB, (2) run the SLOB test for read-only random IO with increasing load values,(3) run OraLatencyMap while the test is running (focus on IOPS and latency values).
The picture here below shows the output of OraLatencyMap taken during 4 different run of SLOB for increasing load (see also slob.conf below and annotations on the graph).
The measured workload is almost entirely dominated by wait event of the type "db file sequential read", that is for random single-block read.
We can see that by increasing the load (number of concurrent SLOB sessions) we can drive more IOPS out of our storage. At the same time we observe that the latency is increasing with increasing load.
How to read IOPS with OraLatencyMap? The sum of the number of waits per second is the metric to look at. I have copied measured values for IOPS as annotations in the figure here below.
The storage system under test is a simple JBOD configuration of 2 storage arrays with 12 SAS 10K rpm disks per array. The storage is connected to the servers via Fiber Channel (8 Gbps). The database is Oracle 11.2.0.3 for Linux x86_64 with ASM. Storage is allocated on a normal redundancy disk group built with 23 disks from the 2 storage arrays.
Note also that after each run SLOB 2 produces a series of reports (including AWR and iostat) with much more information on the workload that what is available by just observing OraLatencyMap output.
SQL> @OraLatencyMap_advanced 10 "db file sequential read" 11 110 ""
A potential pitfall when testing storage is to run our tests with too little data and in particular to have test data that fit in the controller's cache. The figure here below shows just an example of that. The test data there were easily cached by the arrays (4 GB in total for this system). The net outcome is that we have very high figures for IOPS that just don't make sense with the number and type of disks we have.
Indeed the measured latency values confirm that we are mostly reading from cache: we see that the majority of the measured wait events are in the 1 ms latency bucket (wait time of 1 ms or less).
Note on the test: the main difference between this test and the test described above is in amount of data used. The SLOB parameter SCALE = 10000 for this test, SCALE = 1000000 for the test discussed above.
SQL> @OraLatencyMap
Notes: slob.conf and other details regarding the tests. See SLOB 2 manual for more info on the meaning of the parameters.
slob.conf:
UPDATE_PCT=0
RUN_TIME=200
WORK_LOOP=0
SCALE=1000000 #for test N.2 this is scaled down to 10000
WORK_UNIT=256
REDO_STRESS=HEAVY
LOAD_PARALLEL_DEGREE=8
SHARED_DATA_MODULUS=0
How to create test data:
./setup.sh SLOB 128 #this needs about 1TB of space in the SLOB tablespace
Relevant init.ora parameters:
db_cache_size=50m
db_block_prefetch_limit = 0
db_block_prefetch_quota = 0
db_file_noncontig_mblock_read_count = 0
Conclusions
OraLatencyMap is a tool for measuring and displaying wait event histogram data in Oracle as latency heat maps. The tool can be used to troubleshoot production issues related to storage latency. OraLatencyMap can be of assistance when testing storage together with Oracle-based stress testing tools such as Kevin Closson's SLOB 2.↧
DTrace Explorations of Oracle Wait Events on Linux and Solaris
DTrace is a great tool to measure and investigate latency for performance troubleshooting. DTrace is now coming to the Linux platform too and I would like to share a few tests I did with it. The following is meant to be a short technology exploration but hopefully it can also highlight what I believe is an area of great potential for performance investigations and troubleshooting of Oracle workloads, namely the integration of DTrace scripts and Oracle's wait events interface.
1 - Warmup: DTrace can probe OS calls coming from Oracle workload. The following is a standard DTrace script to measure the latency of I/O (synchronous) calls. The code here is modified for Linux x86-64 and it works by tracing pread64 system calls. This makes sense when using ASM (without asmlib) and performing single block reads (e.g. waiting for db file sequential read events in Oracle). In Solaris we would trace pread calls instead. The use of the DTrace function quantize allows to aggregate data in histograms, similar to Oracle's v$event_histogram. Finally the values are printed every 10 seconds (and counters reset to zero). Simple, elegant, effective!
dtrace -n '
syscall::pread64:entry { self->s = timestamp; }
syscall::pread64:return /self->s/ { @pread["ns"] = quantize(timestamp -self->s); self->s = 0; }
tick-10s {
printa(@pread);
trunc(@pread);
}'
2 - DTrace meets Oracle. DTrace can also trace user processes, using with the pid Provider. It can trace function calls, see which arguments are passed to them and what is returned. There is an enormous potential to use this to investigate performance issues with Oracle. Let's see here how we can "tap" into Oracle's wait event instrumentation to link DTrace to wait events.
With a little experimentation with the tools described below in the article we find that in 11.2.0.3 Oracle calls kews_update_wait_time with the elapsed time in micro seconds as arg1. Moreover when kskthewt is called, arg1 is the wait event number (to be resolved queriying v$event_name). We can write the following as a test to print all wait events for a given Oracle process (OS process id is 1737 in this example):
# this is for Oracle 11.2.0.3, Linux and Solaris: trace os process pid=1737 (edit os pid as relevant)
dtrace -n '
# this is for Oracle 11.2.0.3, Linux, tested on OEL 6.4 x86-64 (edit os pid as relevant)
dtrace -n '
pid1737:oracle:kslwt_end_snapshot:return {
self->ksuse =arg1;
}
pid1737:oracle:kskthewt:entry {
KSUSESEQ = *(uint16_t *) copyin(self->ksuse+5536,2);
KSUSEOPC = *(uint16_t *) copyin(self->ksuse+5538,2);
KSUSEP1 = *(uint64_t *) copyin(self->ksuse+5544,8);
KSUSEP2 = *(uint64_t *) copyin(self->ksuse+5552,8);
KSUSEP3 = *(uint64_t *) copyin(self->ksuse+5560,8);
KSUSTIM = *(uint32_t *) copyin(self->ksuse+5568,4);
SQL_HASH = *(uint32_t *) copyin(self->ksuse+5796,4);
printf("seq=%u, event#=%u, p1=%u, p2=%u, p3=%u, sql_hash=%u, time=%u", KSUSESEQ, KSUSEOPC, KSUSEP1, KSUSEP2, KSUSEP3, SQL_HASH, KSUSTIM);
}'
FUNCTION:NAME
kskthewt:entry seq=6634, event#=146, p1=4, p2=1345795, p3=1, sql_hash=334560939, time=10290
kskthewt:entry seq=6635, event#=445, p1=0, p2=215, p3=245, sql_hash=334560939, time=12
kskthewt:entry seq=6636, event#=197, p1=4, p2=1345796, p3=124, sql_hash=334560939, time=8499
kskthewt:entry seq=6637, event#=197, p1=4, p2=606850, p3=126, sql_hash=334560939, time=9898
kskthewt:entry seq=6640, event#=352, p1=1650815232, p2=1, p3=0, sql_hash=334560939, time=481
kskthewt:entry seq=6641, event#=348, p1=1650815232, p2=1, p3=0, sql_hash=1029988163, time=12
1 - Warmup: DTrace can probe OS calls coming from Oracle workload. The following is a standard DTrace script to measure the latency of I/O (synchronous) calls. The code here is modified for Linux x86-64 and it works by tracing pread64 system calls. This makes sense when using ASM (without asmlib) and performing single block reads (e.g. waiting for db file sequential read events in Oracle). In Solaris we would trace pread calls instead. The use of the DTrace function quantize allows to aggregate data in histograms, similar to Oracle's v$event_histogram. Finally the values are printed every 10 seconds (and counters reset to zero). Simple, elegant, effective!
dtrace -n '
syscall::pread64:entry { self->s = timestamp; }
syscall::pread64:return /self->s/ { @pread["ns"] = quantize(timestamp -self->s); self->s = 0; }
tick-10s {
printa(@pread);
trunc(@pread);
}'
2 - DTrace meets Oracle. DTrace can also trace user processes, using with the pid Provider. It can trace function calls, see which arguments are passed to them and what is returned. There is an enormous potential to use this to investigate performance issues with Oracle. Let's see here how we can "tap" into Oracle's wait event instrumentation to link DTrace to wait events.
With a little experimentation with the tools described below in the article we find that in 11.2.0.3 Oracle calls kews_update_wait_time with the elapsed time in micro seconds as arg1. Moreover when kskthewt is called, arg1 is the wait event number (to be resolved queriying v$event_name). We can write the following as a test to print all wait events for a given Oracle process (OS process id is 1737 in this example):
# this is for Oracle 11.2.0.3, Linux and Solaris: trace os process pid=1737 (edit os pid as relevant)
dtrace -n '
pid1737:oracle:kews_update_wait_time:entry {
self->ela_time = arg1;
}
pid1737:oracle:kskthewt:entry {
printf("event#= %d, time(mu sec) = %d", arg1, self->ela_time);
}'
In the following example we use the same technique to produce event histograms for 2 Oracle processes (2349 and 2350 in the example below). The measured time is in micro seconds, therefore improving accuracy compared to Oracle's v$event_histogram data:
# this is for Oracle 11.2.0.3, Linux and Solaris (edit os pids as relevant)
dtrace -n '
dtrace -n '
pid2349:oracle:kews_update_wait_time:entry,pid2350:oracle:kews_update_wait_time:entry {
self->ela_time = arg1;
}
pid2349:oracle:kskthewt:entry,pid2350:oracle:kskthewt:entry {
@event_num[arg1] = quantize(self->ela_time);
}
tick-10s {
printa(@event_num);
trunc(@event_num);
}'
Sample output:
1 294099 :tick-10s
146 (edited: db file sequential read)
value ------------- Distribution ------------- count
128 | 0
256 |@ 14
512 |@@@@@ 120
1024 |@@@@@@@@ 182
2048 |@@@@@@@@@@@@@@ 320
4096 |@@@@@@@@@@@@@ 302
8192 | 5
16384 | 4
32768 | 0
3 - DTrace can query V$ views. DTrace ability to snoop into user process, and therefore Oracle's engine, extends also to internal data structures. In the following example we use DTrace to read v$session data from the memory structure underlying X$KSUSE fixed table.
The address for the relevant 'row' of X$KSUSE is taken from arg1 of kslwt_end_snapshot. After that we go and read memory directly triggered by the execution of kskthewt (other choices are possible). Memory offset values for the X$KSUSE fields of interest are port-specific and can be found with a debugger. Here below an example for Linux and Solaris.
The address for the relevant 'row' of X$KSUSE is taken from arg1 of kslwt_end_snapshot. After that we go and read memory directly triggered by the execution of kskthewt (other choices are possible). Memory offset values for the X$KSUSE fields of interest are port-specific and can be found with a debugger. Here below an example for Linux and Solaris.
# this is for Oracle 11.2.0.3, Linux, tested on OEL 6.4 x86-64 (edit os pid as relevant)
dtrace -n '
pid1737:oracle:kslwt_end_snapshot:return {
self->ksuse =arg1;
}
pid1737:oracle:kskthewt:entry {
KSUSESEQ = *(uint16_t *) copyin(self->ksuse+5536,2);
KSUSEOPC = *(uint16_t *) copyin(self->ksuse+5538,2);
KSUSEP1 = *(uint64_t *) copyin(self->ksuse+5544,8);
KSUSEP2 = *(uint64_t *) copyin(self->ksuse+5552,8);
KSUSEP3 = *(uint64_t *) copyin(self->ksuse+5560,8);
KSUSTIM = *(uint32_t *) copyin(self->ksuse+5568,4);
SQL_HASH = *(uint32_t *) copyin(self->ksuse+5796,4);
printf("seq=%u, event#=%u, p1=%u, p2=%u, p3=%u, sql_hash=%u, time=%u", KSUSESEQ, KSUSEOPC, KSUSEP1, KSUSEP2, KSUSEP3, SQL_HASH, KSUSTIM);
}'
Sample output:
pid1737:oracle:kslwt_end_snapshot:return ' matched 2 probesFUNCTION:NAME
kskthewt:entry seq=6634, event#=146, p1=4, p2=1345795, p3=1, sql_hash=334560939, time=10290
kskthewt:entry seq=6635, event#=445, p1=0, p2=215, p3=245, sql_hash=334560939, time=12
kskthewt:entry seq=6636, event#=197, p1=4, p2=1345796, p3=124, sql_hash=334560939, time=8499
kskthewt:entry seq=6637, event#=197, p1=4, p2=606850, p3=126, sql_hash=334560939, time=9898
kskthewt:entry seq=6640, event#=352, p1=1650815232, p2=1, p3=0, sql_hash=334560939, time=481
kskthewt:entry seq=6641, event#=348, p1=1650815232, p2=1, p3=0, sql_hash=1029988163, time=12
From the sample output here above we can see that DTrace is able to read X$KSUSE (or at least several fields there). This opens the possibility to collect data with extended filters, performs aggregations, etc. For example we can can find this techniques to filter all db file sequential read events for a given file number, or for a given I/O wait time threshold and/or for a given sql_hash value. The possibilities are many and worth further explorations.
Note for db file sequential read: p1 = file number, p2 = block number, p3 = number of blocks (see also v$event_name).
Same script, but this time for Solaris:
# this is for Oracle 11.2.0.3, tested on Solaris 11 x86-64Note for db file sequential read: p1 = file number, p2 = block number, p3 = number of blocks (see also v$event_name).
Same script, but this time for Solaris:
dtrace -p 1737 -n '
pid$target:oracle:kslwt_end_snapshot:return {
self->ksuse =arg1;
}
pid$target:oracle:kskthewt:entry {
KSUSESEQ = *(uint16_t *) copyin(self->ksuse+5592,2);
KSUSEOPC = *(uint16_t *) copyin(self->ksuse+5594,2);
KSUSEP1 = *(uint64_t *) copyin(self->ksuse+5600,8);
KSUSEP2 = *(uint64_t *) copyin(self->ksuse+5608,8);
KSUSEP3 = *(uint64_t *) copyin(self->ksuse+5616,8);
KSUSEPOBJ = *(uint64_t *) copyin(self->ksuse+6232,8);
KSUSTIM = *(uint32_t *) copyin(self->ksuse+5624,4);
printf("seq=%u, event#=%u, p1=%u, p2=%u, p3=%u, obj#=%u, time=%u", KSUSESEQ, KSUSEOPC, KSUSEP1, KSUSEP2, KSUSEP3, KSUSEPOBJ, KSUSTIM);
}'
Tools, resources and credits: A great tool for investigating Oracle calls with DTrace is digger by Alexander Anokhin and Brendan Gregg. It works under Solaris (I could not make it work under Linux port yet) and allows to trace all function calls and their arguments. Digger makes quite easy to see for example where Oracle instruments wait events by tracing: ./digger.sh -F -p <os pid> -f '*wt*'
Another great resource to get ideas on how to trace Oracle function is Frits Hoogland log and the article Profiling of Oracle function calls
Tanel Poder has made a great video on DTracing Oracle query execution engine, available on Enkitec TV.
Brendan Gregg is a reference when talking about DTrace and systems performance tuning, see his website and blog. See also Oracle's DTrace User Guide.
Paul Fox is the author of a port of DTrace for Linux (the other being Oracle's of course): gihub repository.
DTrace lab environment for Oracle: I have used a test environment build with VirtualBox and installed with Oracle Linux 6.4. As DTrace port for these tests I have used dtrace4linux.
Note: in my Linux-based DTrace test environment I often get the error message "dtrace: processing aborted: Abort due to systemic unresponsiveness". This can be worked around by adding the -w flag when running DTrace scripts, that is by allowing destructive actions.
It's also beneficial to have a Solaris test environment, where DTrace is more mature. In my tests this was installed under VirtualBox too.
Conclusions
pid$target:oracle:kslwt_end_snapshot:return {
self->ksuse =arg1;
}
pid$target:oracle:kskthewt:entry {
KSUSESEQ = *(uint16_t *) copyin(self->ksuse+5592,2);
KSUSEOPC = *(uint16_t *) copyin(self->ksuse+5594,2);
KSUSEP1 = *(uint64_t *) copyin(self->ksuse+5600,8);
KSUSEP2 = *(uint64_t *) copyin(self->ksuse+5608,8);
KSUSEP3 = *(uint64_t *) copyin(self->ksuse+5616,8);
KSUSEPOBJ = *(uint64_t *) copyin(self->ksuse+6232,8);
KSUSTIM = *(uint32_t *) copyin(self->ksuse+5624,4);
printf("seq=%u, event#=%u, p1=%u, p2=%u, p3=%u, obj#=%u, time=%u", KSUSESEQ, KSUSEOPC, KSUSEP1, KSUSEP2, KSUSEP3, KSUSEPOBJ, KSUSTIM);
}'
Tools, resources and credits: A great tool for investigating Oracle calls with DTrace is digger by Alexander Anokhin and Brendan Gregg. It works under Solaris (I could not make it work under Linux port yet) and allows to trace all function calls and their arguments. Digger makes quite easy to see for example where Oracle instruments wait events by tracing: ./digger.sh -F -p <os pid> -f '*wt*'
Another great resource to get ideas on how to trace Oracle function is Frits Hoogland log and the article Profiling of Oracle function calls
Tanel Poder has made a great video on DTracing Oracle query execution engine, available on Enkitec TV.
Brendan Gregg is a reference when talking about DTrace and systems performance tuning, see his website and blog. See also Oracle's DTrace User Guide.
Paul Fox is the author of a port of DTrace for Linux (the other being Oracle's of course): gihub repository.
DTrace lab environment for Oracle: I have used a test environment build with VirtualBox and installed with Oracle Linux 6.4. As DTrace port for these tests I have used dtrace4linux.
Note: in my Linux-based DTrace test environment I often get the error message "dtrace: processing aborted: Abort due to systemic unresponsiveness". This can be worked around by adding the -w flag when running DTrace scripts, that is by allowing destructive actions.
It's also beneficial to have a Solaris test environment, where DTrace is more mature. In my tests this was installed under VirtualBox too.
Conclusions
DTrace is a powerful tool to explore systems performance. In addition DTrace can explore and snoop inside the workings of user-space programs. This opens the door to many uses for Oracle database investigations, troubleshooting and performance tuning. This short article just scratches the surface of what is possible when combining DTrace with Oracle's wait events interface. The author believes this is an area that deserves further exploration. DTrace for Linux is available and there are currently two ports. The Linux port used for these tests was sufficient in functionality but not yet ready for production usage.
↧