Oracle SQL CPU占用高
Oracle数据库经常会遇到CPU利用率很高的情况,这种时候大都是数据库中存在着严重性能低下的SQL语句,这种SQL语句大大的消耗了CPU资源,导致整个系统性能低下。当然,引起严重性能低下的SQL语句的原因是多方面的,具体的原因要具体的来分析,下面通过一个实际的案例来说明如何来诊断和解决CPU利用率高的这类问题。
操作系统:solairs8
数据库:Oracle9.2.0.4
问题描述:现场工程师汇报数据库非常慢,几乎所有应用操作均无法正常进行。
首先登陆主机,执行top发现CPU资源几乎消耗殆尽,存在很多占用CPU很高的进程,而内存和I/O都不高,具体如下:
last
pid: 26136; load averages: 8.89, 8.91, 8.12
216
processes: 204 sleeping, 8 running, 4 on cpu
CPU states: 0.6%
idle, 97.3% user, 1.8% kernel, 0.2% iowait, 0.0% swap
Memory:
8192M real, 1166M free, 14M swap in use, 8179M swap free
PID
USERNAME THR PRI NICE SIZE RES STATE TIME CPU COMMAND
25725
oracle 1 50 0 4550M 4508M cpu2 12:23 11.23% oracle
25774
oracle 1 41 0 4550M 4508M run 14:25 10.66% oracle
26016 oracle
1 31 0 4550M 4508M run 5:41 10.37% oracle
26010 oracle 1 41 0
4550M 4508M run 4:40 9.81% oracle
26014 oracle 1 51 0 4550M
4506M cpu6 4:19 9.76% oracle
25873 oracle 1 41 0 4550M 4508M
run 12:10 9.45% oracle
25723 oracle 1 50 0 4550M 4508M run
15:09 9.40% oracle
26121 oracle 1 41 0 4550M 4506M cpu0 1:13
9.28%
oracle
于是先查看数据库的告警日志ALERT文件,并没有发现有什么错误存在,日志显示数据库运行正常,排除数据库本身存在问题。
然后查看这些占用CPU资源很高的Oracle进程究竟是在做什么操作,使用如下SQL语句:
select
sql_text,spid,v$session.program,process
from
v$sqlarea,v$session,v$process
where
v$sqlarea.address=v$session.sql_address
and
v$sqlarea.hashvalue=v$session.sql_hash_value
and
v$session.paddr=v$process.addr
and v$process.spid in
(PID);
用top中占用CPU很高的进程的PID替换脚本中的PID,得到相应的Oracle进程所执行的SQL语句,发现占用CPU资源很高的进程都是执行同一个SQL语句:
SELECT
d.domainname,d.mswitchdomainid,
a.SERVICEID,a.SERVICECODE,a.USERTYPE,a.STATUS,a.NOTIFYSTATUS,to_char(a.DATECREATED,‘yyyy-mm-dd
hh24:mi:ss‘) DATECREATED,VIPFLAG,STATUS2,CUSTOMERTYPE,CUSTOMERID FROM service a,
gatewayloc b, subbureaunumber c, mswitchdomain d WHERE b.mswitchdomainid =
d.mswitchdomainid and b.gatewaysn = c.gatewaysn AND a.ServiceCode like
c.code||‘%‘ and a.serviceSpecID=1 and a.status!=‘4‘ and a.status!=‘10‘ and
a.servicecode like ‘010987654321%‘ and
SubsidiaryID=999999999
基本上可以肯定是这个SQL引起了系统CPU资源大量被占用,那究竟是什么原因造成这个SQL这么大量占用CPU资源呢,我们先来看看数据库的进程等待事件都有些什么:
SQL>
select sid,event,p1,p1text from v$session_wait;
SID EVENT P1
P1TEXT
----------
----------------------------------------------------------------
12 latch free 4.3982E+12 address
36 latch free 4.3982E+12
address
37 latch free 4.3982E+12 address
84 latch free 4.3982E+12 address
102 latch free 4.3982E+12
address
101 latch free 4.3982E+12 address
85 latch free 4.3982E+12 address
106 latch free 4.3982E+12
address
155 latch free 4.3982E+12 address
151 latch free 4.3982E+12 address
149 latch free 4.3982E+12
address
147 latch free 4.3982E+12 address
1 pmon timer 300 duration
从上面的查询我们可以看出,大都是latch
free的等待事件,然后接着查一下这些latch的等待都是什么进程产生的:
SQL> select spid from
v$process where addr in
(select paddr from v$session where sid
in(84,102,101,106,155,151));
SPID
------------
25774
26010
25873
25725
由此看出latch
free这个等待事件导致了上面的那个SQL语句都在等待,占用了大量的CPU资源。我们来看看究竟主要是那种类型的latch的等待,根据下面的SQL语句:
SQL>
SELECT latch#, name, gets, misses, sleeps
FROM
v$latch
WHERE sleeps>0
ORDER BY
sleeps;
LATCH# NAME GETS MISSES
SLEEPS
----------
----------------------------------------------------------------
15 messages 96876 20 1
159 library cache pin allocation
407322 43 1
132 dml lock allocation 194533 213
2
4 session allocation 304897 48
3
115 redo allocation 238031 286
4
17 enqueue hash chains 277510 85
5
7 session idle bit 2727264 314
16
158 library cache pin 3881788 5586
58
156 shared pool 2771629 6184
662
157 library cache 5637573 25246
801
98 cache buffers chains 1722750424 758400
109837
由上面的查询可以看出最主要的latch等待是cache buffers
chains,这个latch的等待表明数据库存在单独的BLOCK的竞争这些latch,我们来看这个latch存在的子latch及其对应的类型:
SQL>
SELECT addr, latch#, gets, misses, sleeps
FROM
v$latch_children
WHERE
sleeps>0
and latch# = 98
ORDER BY sleeps desc;
ADDR LATCH# GETS MISSES
SLEEPS
---------------- ---------- ---------- ----------
----------
000004000A3DFD10 98 10840661 82891
389
000004000A698C70 98 159510 2
244
0000040009B21738 98 104269771 34926
209
0000040009B227A8 98 107604659 35697
185
000004000A3E0D70 98 5447601 18922
156
000004000A6C2BD0 98 853375 7
134
0000040009B24888 98 85538409 25752
106
……………
接着我们来查看sleep较多的子latch对应都有哪些对象:
SQL>
select distinct a.owner,a.segment_name,a.segment_type
from
dba_extents a,
(select
dbarfil,dbablk
from x$bh
where
hladdr in
(select addr
from (select
addr
from v$latch_children
order by sleeps desc)
where rownum < 5))
b
where a.RELATIVE_FNO = b.dbarfil
and
a.BLOCK_ID <= b.dbablk and a.block_id + a.blocks >
b.dbablk;
OWNER SEGMENT_NAME
SEGMENT_TYPE
---------------------------------------------------------------------------
TEST
I_SERVICE_SERVICESPECID INDEX
TEST I_SERVICE_SUBSIDIARYID
INDEX
TEST SERVICE TABLE
TEST MSWITCHDOMAIN
TABLE
TEST I_SERVICE_SC_S
INDEX
…………………
我们看到在开始的那个SQL语句中的几个对象都有包括在内,于是来看看开始的那个SQL的执行计划:
SQL>
set autotrace trace explain
SQL>SELECT
d.domainname,d.mswitchdomainid,
a.SERVICEID,a.SERVICECODE,a.USERTYPE,a.STATUS,a.NOTIFYSTATUS,to_char(a.DATECREATED,‘yyyy-mm-dd
hh24:mi:ss‘) DATECREATED,VIPFLAG,STATUS2,CUSTOMERTYPE,CUSTOMERID FROM service a,
gatewayloc b, subbureaunumber c, mswitchdomain d WHERE b.mswitchdomainid =
d.mswitchdomainid and b.gatewaysn = c.gatewaysn AND a.ServiceCode like
c.code||‘%‘ and a.serviceSpecID=1 and a.status!=‘4‘ and a.status!=‘10‘ and
a.servicecode like ‘010987654321%‘ and
SubsidiaryID=999999999;
Execution
Plan
----------------------------------------------------------
0 SELECT STATEMENT Optimizer=CHOOSE
1 0 NESTED
LOOPS
2 1 NESTED LOOPS
3 2 NESTED
LOOPS
4 3 TABLE ACCESS (FULL) OF
‘SUBBUREAUNUMBER‘
5 3 TABLE ACCESS (BY INDEX ROWID) OF
‘GATEWAYLOC‘
6 5 INDEX (UNIQUE SCAN) OF ‘PK_GATEWAYLOC‘
(UNIQUE)
7 2 TABLE ACCESS (BY INDEX ROWID) OF
‘MSWITCHDOMAIN‘
8 7 INDEX (UNIQUE SCAN) OF ‘PK_MSWITCHDOMAIN‘
(UNIQUE)
9 1 TABLE ACCESS (BY INDEX ROWID) OF
‘SERVICE‘
10 9 AND-EQUAL
11 10 INDEX
(RANGE SCAN) OF ‘I_SERVICE_SERVICESPECID‘ (NON
-UNIQUE)
12 10 INDEX (RANGE SCAN) OF
‘I_SERVICE_SUBSIDIARYID‘ (NON-
UNIQUE)
根据开始查到的引起latch
free等待中的对象和SQL语句的执行计划,觉得SERVICE表上的索引有问题,似乎存在了过多的扫描,于是将同样的SQL语句在别的地市的同样的数据库上执行一下,查看相应的执行计划:
SQL>
set autotrace trace explain
SQL>SELECT
d.domainname,d.mswitchdomainid,
a.SERVICEID,a.SERVICECODE,a.USERTYPE,a.STATUS,a.NOTIFYSTATUS,to_char(a.DATECREATED,‘yyyy-mm-dd
hh24:mi:ss‘) DATECREATED,VIPFLAG,STATUS2,CUSTOMERTYPE,CUSTOMERID FROM service a,
gatewayloc b, subbureaunumber c, mswitchdomain d WHERE b.mswitchdomainid =
d.mswitchdomainid and b.gatewaysn = c.gatewaysn AND a.ServiceCode like
c.code||‘%‘ and a.serviceSpecID=1 and a.status!=‘4‘ and a.status!=‘10‘ and
a.servicecode like ‘010987654321%‘ and
SubsidiaryID=999999999;
Execution
Plan
----------------------------------------------------------
0 SELECT STATEMENT Optimizer=CHOOSE
1 0 TABLE ACCESS (BY
INDEX ROWID) OF ‘SERVICE‘
2 1 NESTED LOOPS
3 2 NESTED LOOPS
4 3 NESTED LOOPS
5 4
TABLE ACCESS (FULL) OF ‘SUBBUREAUNUMBER‘
6 4 TABLE ACCESS (BY
INDEX ROWID) OF ‘GATEWAYLOC‘
7 6 INDEX (UNIQUE SCAN) OF
‘PK_GATEWAYLOC‘ (UNIQUE)
8 3 TABLE ACCESS (BY INDEX ROWID) OF
‘MSWITCHDOMAIN‘
9 8 INDEX (UNIQUE SCAN) OF ‘PK_MSWITCHDOMAIN‘
(UNIQUE)
10 2 INDEX (RANGE SCAN) OF ‘I_SERVICE_SC_S‘
(NON-UNIQUE)
对比两个执行计划,发现索引I_SERVICE_SERVICESPECID和I_SERVICE_SUBSIDIARYID是不应该走的,于是又对比了两个地方SERVICE表上的索引个数:
SQL>
select index_name from user_indexes where
table_name=‘SERVICE‘;
INDEX_NAME
------------------------------
I_SERVICE_ACCOUNTNUM
I_SERVICE_CID
I_SERVICE_DATEACTIVATED
I_SERVICE_PRICEPLANID
I_SERVICE_SC_S
I_SERVICE_SERVICECODE
I_SERVICE_SERVICESPECID
I_SERVICE_SUBSIDIARYID
PK_SERVICE_SID
SQL>
select index_name from user_indexes where
table_name=‘SERVICE‘;
INDEX_NAME
------------------------------
I_SERVICE_ACCOUNTNUM
I_SERVICE_CID
I_SERVICE_DATEACTIVATED
I_SERVICE_SC_S
I_SERVICE_SERVICECODE
PK_SERVICE_SID
发现存在问题的数据库中的SERVICE表上不知道怎么多出了I_SERVICE_PRICEPLANID、I_SERVICE_SERVICESPECID
、I_SERVICE_SUBSIDIARYID三个索引,而这些索引就是导致了开始那个SQL语句用了不该用的索引,引起latch
free等待和CPU占用很高的罪魁祸首,于是删除了那三个索引,重新执行相应的SQL语句,很快就得出了结果,CPU的利用率也马上下降为正常了,观察结果如下:
last
pid: 26387; load averages: 1.61, 1.38, 1.21
195
processes: 194 sleeping, 1 on cpu
CPU states: 96.2% idle, 1.6%
user, 1.7% kernel, 0.5% iowait, 0.0% swap
Memory: 8192M real,
1183M free, 14M swap in use, 8179M swap free
PID USERNAME THR
PRI NICE SIZE RES STATE TIME CPU COMMAND
26383 oracle 1 59 0
4550M 4506M sleep 0:12 4.52% oracle
409 root 15 59 0 7168K
7008K sleep 173.1H 0.53% picld
25653 oracle 1 59 0 4550M 4508M
sleep 2:12 0.48% oracle
26384 root 1 59 0 2800K 1912K cpu2
0:00 0.21% top-3.5b8-sun4u
25569 oracle 1 59 0 4550M 4508M
sleep 0:12 0.09% oracle
25717 oracle 1 59 0 4550M 4507M sleep
0:07 0.05% oracle
25571 oracle 1 59 0 4550M 4507M sleep 0:10
0.04% oracle
25681 oracle 1 59 0 4550M 4508M sleep 0:10 0.04%
oracle
25544 oracle 1 58 0 4554M 4501M sleep 0:14 0.03%
oracle
25703 oracle 1 59 0 4550M 4506M sleep 0:23 0.03%
oracle
………………
对于CPU利用率过高的情况,如果是SQL语句性能比较低下引起的基本上都可以按照这个思路来诊断和解决问题,当然具体问题还得具体分析,解决问题的方法也有很多种,这里不过是抛砖引玉一下,只要能最终达到我们解决问题的目的就可以了。
转载ItEye上的:
http://blog.csdn.net/fffygapl/article/details/7617089
http://blog.csdn.net/rhys_oracle/article/details/11990935
学习:http://blog.csdn.net/rhys_oracle/article/details/11953127
郑重声明:本站内容如果来自互联网及其他传播媒体,其版权均属原媒体及文章作者所有。转载目的在于传递更多信息及用于网络分享,并不代表本站赞同其观点和对其真实性负责,也不构成任何其他建议。