一次library cache pin故障的解决过程

浏览数：24 / 时间：2015年06月20日

在dbsnake 上看到的这篇文章，转过来。主要还是学习解决问题的一个思路。这个往往比问题的解决更重要。

原文链接如下：

http://dbsnake.com/2010/06/solve-library-cache-pin.html

内容如下：

今天接到同事的电话，说他的一个存储过程已经run了一个多小时了，还在继续run，他觉得极不正常，按道理说不应该run这么长时间。

我说那我去看一下吧。

这个库是一个AIX上的10.2.0.4，我采集了一下问题时间段的AWR报告：

Begin Snap:	13302	11-Jun-10 12:00:56	109	4.7
End Snap:	13303	11-Jun-10 13:00:02	97	4.9
Elapsed:		59.10 (mins)
DB Time:		113.98 (mins)

Top 5事件为：

Event	Waits	Time(s)	Avg Wait(ms)	% Total Call Time	Wait Class
library cache pin	1,252	3,656	2,920	53.5	Concurrency
library cache lock	989	2,890	2,922	42.3	Concurrency
CPU time		219		3.2
db file sequential read	5,694	12	2	.2	User I/O
log file parallel write	1,467	11	8	.2	System I/O

从AWR报告结果里我们可以看出在出问题的时间段，系统在经历严重的library cache pin以及library cache lock等待。

从Load Profile中我们又可以看到：

	Per Second	Per Transaction
Parses:	12.83	65.83
Hard parses:	0.05	0.25

也就是说导致上述library cache pin和library cache lock的并不是hard parse。

对于library cache pin等待来说，AWR报告的作用有限，最有效的方式就是找到持有library cache pin以及等待library cache pin的session，然后看看他们在做什么：

SQL> SELECT s.sid, kglpnmod "Mode", kglpnreq "Req", SPID "OS Process"

2 FROM v$session_wait w, x$kglpn p, v$session s ,v$process o

3 WHERE p.kglpnuse=s.saddr

4 AND kglpnhdl=w.p1raw

5 and w.event like ‘%library cache pin%‘

6 and s.paddr=o.addr

7 /

SID Mode Req OS Process

---------- ---------- ---------- ------------

396 0 2 6381970

341 2 0 4092132

363 0 2 3514690

304 0 2 3977478

354 0 3 3137874

20 rows selected

其中:

KGLPNHDL --- Library Cache Handle Address

KGLPNADR --- Library Cache Pin Address.

KGLPNSES --- 识别锁住此pin 的session

KGLPNREQ --- Pin 请求

KGNMODE --- Pin 锁

我那位run存储过程的同事所在的session是396，从上述结果里我们可以看出来396现在想以Share模式（即Req=2）去持有library cache pin，同时现在持有上述library cache pin的是session 341，且341的持有模式也是Share（即Mode=2）。

本来Share和Share是可以共享的，但不幸的是在396之前，session 354想以Exclusive模式（即Req=3）去持有上述library cache pin，这直接导致了396需要处于等待的Queue中，同时处于Queue中的还有363和304。

我为什么这么说呢，因为oracle对library cache pin的解释中有一句非常经典的话：

An X request (3) will be blocked by any pins held S mode (2) on the object.
An S request (2) will be blocked by any X mode (3) pin held, or may queue behind some other X request.

所以从AWR报告和上述查询结果中我们可以得出如下结论：

1、我那位run存储过程的同事为什么run了1个多小时还没有run完是因为这个存储过程正在经历严重的library cache pin等待；

2、而为什么会导致严重的library cache pin等待是因为session 341和354联手达到了这一效果，即341以Share模式持有library cache pin，接着354想以Exclusive模式持有，这直接导致所有的后续请求全部被处于等待的Queue中。也就是说341阻塞了354，而354又间接阻塞了396。

既然知道了原因，那我们去看一下session 341在做什么事情：

SQL> select decode(sql_hash_value,0,prev_hash_value,sql_hash_value) from v$session where sid=341;

DECODE(SQL_HASH_VALUE,0,PREV_H

------------------------------

784727971

间隔10秒钟后再次执行：

SQL> select decode(sql_hash_value,0,prev_hash_value,sql_hash_value) from v$session where sid=341;

DECODE(SQL_HASH_VALUE,0,PREV_H

------------------------------

784727971

间隔10秒钟后再次执行：

SQL> select decode(sql_hash_value,0,prev_hash_value,sql_hash_value) from v$session where sid=341;