Hi There,
We have an ETL process that we kindly need your help with. The process been running since Sun, where it transfers the data from one server (via remote query). The process was running ok till last night where it appeared
to have stopped working and/or the session is just idling doing nothing.
Here are some tests that we did to figure out what's going on:
1. when looking at the session IO, we noticed that it's not changing:
etl_user@datap> select sess_io.sid,
2 sess_io.block_gets,
3 sess_io.consistent_gets,
4 sess_io.physical_reads,
5 sess_io.block_changes,
6 sess_io.consistent_changes
7 from v$sess_io sess_io, v$session sesion
8 where sesion.sid = sess_io.sid
9 and sesion.username is not null
10 and sess_io.sid=301
11 order by 1;
logical physical
SID BLOCK_GETS reads reads BLOCK_CHANGES CONSISTENT_CHANGES
----- ---------- ---------- ---------- ------------- ------------------
301 388131317 97721268 26687579 223052804 161334
Elapsed: 00:00:00.01
2. Check there is nothing blocking the session
etl_user@datap> select * from v$lock where sid=301;
ADDR KADDR SID TY ID1 ID2 LMODE REQUEST CTIME BLOCK
-------- -------- ---------- -- ---------- ---------- ---------- ---------- ---------- ----------
684703F0 6847041C 301 DX 35 0 1 0 45237 0
684714C4 684714F0 301 AE 199675 0 4 0 260148 0
619651EC 6196521C 301 TM 52733 0 3 0 45241 0
67F86ACC 67F86B0C 301 TX 458763 52730 6 0 45241 0
3. Check if the session is still valid:
etl_user@datap> select status from v$session where sid=301;
STATUS
--------
ACTIVE
4. Check if there is anything in long ops that has not completed:
etl_user@datap> SELECT SID, SERIAL#, opname, SOFAR, TOTALWORK,
2 ROUND(SOFAR/TOTALWORK*100,2) COMPLETE, TIME_REMAINING/60
3 FROM V$SESSION_LONGOPS
4 WHERE
5 TOTALWORK != 0
6 AND SOFAR != TOTALWORK
7 order by 1;
no rows selected
Elapsed: 00:00:00.00
5. Check if there is anything in long ops for the session:
etl_user@datap> r
1* select SID,SOFAR,TOTALWORK,START_TIME,LAST_UPDATE_TIME,TIME_REMAINING,MESSAGE from V$SESSION_LONGOPS where sid=301
SID SOFAR TOTALWORK START_TIM LAST_UPDA TIME_REMAINING MESSAGE
----- ---------- ---------- --------- --------- -------------- ------------------------------------------------------------------------------
301 0 0 22-JUL-12 22-JUL-12 Gather Table's Index Statistics: Table address_etl : 0 out of 0 Indexes done
Elapsed: 00:00:00.00
This is a bit odd!! This particular step have actually completed successfully on the 22nd of July, and we don't know why it's still showing in long opps!? any ideas?
6. Looking at the sql and what's it actually doing:
etl_user@datap> select a.sid, a.value session_cpu, c.physical_reads,
2 c.consistent_gets,d.event,
3 d.seconds_in_wait
4 from v$sesstat a,v$statname b, v$sess_io c, v$session_wait d
5 where a.sid= &p_sid_number
6 and b.name = 'CPU used by this session'
7 and a.statistic# = b.statistic#
8 and a.sid=c.sid
9 and a.sid=d.sid;
Enter value for p_sid_number: 301
old 5: where a.sid= &p_sid_number
new 5: where a.sid= 301
CPU physical logical seconds
SID used reads reads EVENT waiting
----- ---------- ---------- ---------- ------------------------------ ----------
301 1966595 26687579 97721268 SQL*Net message from dblink 45792
Elapsed: 00:00:00.03
7. We looked at the remote DB where the data resides on, and we noticed that the remote session was also waiting on the db link:
SYS@destp> select a.sid, a.value session_cpu, c.physical_reads,
2 c.consistent_gets,d.event,
3 d.seconds_in_wait
4 from v$sesstat a,v$statname b, v$sess_io c, v$session_wait d
5 where a.sid= &p_sid_number
6 and b.name = 'CPU used by this session'
7 and a.statistic# = b.statistic#
8 and a.sid=c.sid
9 and a.sid=d.sid;
Enter value for p_sid_number: 388
old 5: where a.sid= &p_sid_number
new 5: where a.sid= 390
SID SESSION_CPU PHYSICAL_READS CONSISTENT_GETS EVENT SECONDS_IN_WAIT
---------- ----------- -------------- --------------- ---------------------------------------------------------------- ---------------
390 136 0 7605 SQL*Net message from client 46101
SYS@destp>
We have had an issue in the past where the connection was being dropped by the network when the process runs for few days, hence we have added the following to the sqlnet.ora and listener.ora files:
sqlnet.ora:
SQLNET.EXPIRE_TIME = 1
SQLNET.INBOUND_CONNECT_TIMEOUT = 6000
listener.ora:
INBOUND_CONNECT_TIMEOUT_LISTENER = 6000
What else can we do and/or further investigate to work out the root cause of the problem, and may be help resolve this. We don't want to just stop and start the process again as it took few days already. We have
had a chat to the infrastructure team and they've assured us that there have been no network outages.
Also, the alert logs for both instances (local and remote) shows no errors what so ever!
Your input is highly appreciated.
Thanks
Edited by: rsar001 on Jul 25, 2012 10:22 AM