(cache) Understanding Linux Load Average – Part 3

Understanding Linux Load Average – Part 3

Posted by Harald van Breederode on May 28, 2012

In part 1 we performed a series of experiments to explore the relation between CPU utilization and Linux load average. We concluded that the load average is influenced by processes running on or waiting for the CPU. Based on experiments in part 2 we came to the conclusion that processes that are performing disk I/O also influence the load average on a Linux system. In this posting we will do another experiment to find out if the Linux load average is also affected by processes performing network I/O.

Network I/O and load average

To check if a correlation exists between processes performing network I/O and the load average we will start 10 processes generating network I/O on an otherwise idle system and collect various performance related statistics using the sar command. Note: My load-gen script uses the ping command to generate network I/O.

$ load-gen network 10
Starting 10 network load processes.
$ sar –n DEV 30 6
Linux 2.6.32-300.20.1.el5uek (roger.example.com) 	04/21/2012

09:38:01 PM     IFACE   rxpck/s   txpck/s   rxbyt/s   txbyt/s   rxcmp/s   txcmp/s  rxmcst/s
09:38:31 PM        lo  88953.60  88953.60 135920963.87 135920963.87      0.00      0.00      0.00
09:38:31 PM      eth1      0.13      0.17     11.33     62.33      0.00      0.00      0.00
09:38:31 PM      eth0      0.00      0.00      0.00      0.00      0.00      0.00      0.00

09:38:31 PM     IFACE   rxpck/s   txpck/s   rxbyt/s   txbyt/s   rxcmp/s   txcmp/s  rxmcst/s
09:39:01 PM        lo  89295.13  89295.13 136442626.93 136442626.93      0.00      0.00      0.00
09:39:01 PM      eth1      0.03      0.03      2.60     48.33      0.00      0.00      0.00
09:39:01 PM      eth0      0.00      0.00      0.00      0.00      0.00      0.00      0.00

09:39:01 PM     IFACE   rxpck/s   txpck/s   rxbyt/s   txbyt/s   rxcmp/s   txcmp/s  rxmcst/s
09:39:31 PM        lo  89364.38  89364.38 136548566.91 136548566.91      0.00      0.00      0.00
09:39:31 PM      eth1      0.10      0.10      7.34     47.30      0.00      0.00      0.03
09:39:31 PM      eth0      0.00      0.00      0.00      0.00      0.00      0.00      0.00

09:39:31 PM     IFACE   rxpck/s   txpck/s   rxbyt/s   txbyt/s   rxcmp/s   txcmp/s  rxmcst/s
09:40:01 PM        lo  89410.80  89410.80 136619365.60 136619365.60      0.00      0.00      0.00
09:40:01 PM      eth1      0.03      0.03      2.60     48.33      0.00      0.00      0.00
09:40:01 PM      eth0      0.00      0.00      0.00      0.00      0.00      0.00      0.00

09:40:01 PM     IFACE   rxpck/s   txpck/s   rxbyt/s   txbyt/s   rxcmp/s   txcmp/s  rxmcst/s
09:40:31 PM        lo  89502.30  89502.30 136759314.53 136759314.53      0.00      0.00      0.00
09:40:31 PM      eth1      0.23      0.27     20.60     59.33      0.00      0.00      0.00
09:40:31 PM      eth0      0.00      0.00      0.00      0.00      0.00      0.00      0.00

09:40:31 PM     IFACE   rxpck/s   txpck/s   rxbyt/s   txbyt/s   rxcmp/s   txcmp/s  rxmcst/s
09:41:01 PM        lo  89551.52  89551.52 136834718.24 136834718.24      0.00      0.00      0.00
09:41:01 PM      eth1      0.03      0.03      2.60     48.35      0.00      0.00      0.00
09:41:01 PM      eth0      0.00      0.00      0.00      0.00      0.00      0.00      0.00

Average:        IFACE   rxpck/s   txpck/s   rxbyt/s   txbyt/s   rxcmp/s   txcmp/s  rxmcst/s
Average:           lo  89346.27  89346.27 136520905.51 136520905.51      0.00      0.00      0.00
Average:         eth1      0.09      0.11      7.85     52.33      0.00      0.00      0.01
Average:         eth0      0.00      0.00      0.00      0.00      0.00      0.00      0.00

The above output shows that the lo interface sent and received almost 90 thousand packets per second good for a total of 136 million bytes of traffic. The other two interfaces had virtually no traffic at all. This is because my network load processes are pinging localhost. Let’s have a look at the CPU utilization before taking a look at the run-queue utilization and Load Average.

$ sar –P ALL –u 30 6
Linux 2.6.32-300.20.1.el5uek (roger.example.com) 	04/21/2012

09:38:01 PM       CPU     %user     %nice   %system   %iowait    %steal     %idle
09:38:31 PM       all     13.90      0.00     86.10      0.00      0.00      0.00
09:38:31 PM         0     13.60      0.00     86.40      0.00      0.00      0.00
09:38:31 PM         1     14.17      0.00     85.83      0.00      0.00      0.00

09:38:31 PM       CPU     %user     %nice   %system   %iowait    %steal     %idle
09:39:01 PM       all     13.82      0.00     86.18      0.00      0.00      0.00
09:39:01 PM         0     13.30      0.00     86.70      0.00      0.00      0.00
09:39:01 PM         1     14.37      0.00     85.63      0.00      0.00      0.00

09:39:01 PM       CPU     %user     %nice   %system   %iowait    %steal     %idle
09:39:31 PM       all     13.84      0.00     86.16      0.00      0.00      0.00
09:39:31 PM         0     13.30      0.00     86.70      0.00      0.00      0.00
09:39:31 PM         1     14.37      0.00     85.63      0.00      0.00      0.00

09:39:31 PM       CPU     %user     %nice   %system   %iowait    %steal     %idle
09:40:01 PM       all     13.82      0.00     86.18      0.00      0.00      0.00
09:40:01 PM         0     14.10      0.00     85.90      0.00      0.00      0.00
09:40:01 PM         1     13.53      0.00     86.47      0.00      0.00      0.00

09:40:01 PM       CPU     %user     %nice   %system   %iowait    %steal     %idle
09:40:31 PM       all     13.75      0.00     86.25      0.00      0.00      0.00
09:40:31 PM         0     14.27      0.00     85.73      0.00      0.00      0.00
09:40:31 PM         1     13.20      0.00     86.80      0.00      0.00      0.00

09:40:31 PM       CPU     %user     %nice   %system   %iowait    %steal     %idle
09:41:01 PM       all     13.55      0.00     86.45      0.00      0.00      0.00
09:41:01 PM         0     13.83      0.00     86.17      0.00      0.00      0.00
09:41:01 PM         1     13.27      0.00     86.73      0.00      0.00      0.00

Average:          CPU     %user     %nice   %system   %iowait    %steal     %idle
Average:          all     13.78      0.00     86.22      0.00      0.00      0.00
Average:            0     13.73      0.00     86.27      0.00      0.00      0.00
Average:            1     13.82      0.00     86.18      0.00      0.00      0.00

On average the CPU spent 14% of its time running code in user mode and 86% of the CPU time was spent running code in kernel mode. This is because the Linux kernel has to work quite hard to handle the amount of network traffic. The question is of course: What effect does this have on the Load Average?

$ sar –q 30 6
Linux 2.6.32-300.20.1.el5uek (roger.example.com) 	04/21/2012

09:38:01 PM   runq-sz  plist-sz   ldavg-1   ldavg-5  ldavg-15
09:38:31 PM        10       319      4.03      1.93      1.86
09:39:01 PM        10       319      6.46      2.72      2.12
09:39:31 PM        10       319      7.85      3.41      2.37
09:40:01 PM        10       319      8.69      4.04      2.61
09:40:31 PM        10       319      9.14      4.59      2.84
09:41:01 PM        10       313      9.55      5.12      3.07
Average:           10       318      7.62      3.63      2.48

The above sar output shows that the run-queue was constantly occupied by 10 processes and that the 1-minute Load Average slowly climbed towards 10 as one might expect by now ;-) This could be an indication that Load Average is influenced by processes performing network I/O. But maybe the ping processes are using high amounts of CPU time and thereby forcing the Load Average to go up. To figure this out we will take a look at the top output.

top - 21:41:02 up 11:25,  1 user,  load average: 9.51, 5.19, 3.11
Tasks: 215 total,   9 running, 206 sleeping,   0 stopped,   0 zombie
Cpu(s): 13.6%us, 34.0%sy,  0.0%ni,  0.0%id,  0.0%wa,  0.0%hi, 52.4%si,  0.0%st
Mem:   3074820k total,  2567640k used,   507180k free,   221652k buffers
Swap:  5144568k total,        0k used,  5144568k free,  1161696k cached

  PID USER      PR  NI  VIRT  RES  SHR S %CPU %MEM    TIME+  COMMAND            
30118 root      20   0  8124  768  640 R  1.1  0.0   0:00.32 ping               
30121 root      20   0  8124  768  640 R  0.9  0.0   0:00.27 ping               
30126 root      20   0  8124  772  640 R  0.5  0.0   0:00.15 ping               
30127 root      20   0  8124  772  640 R  0.5  0.0   0:00.15 ping               
30134 root      20   0  8124  772  640 R  0.4  0.0   0:00.13 ping               
30135 root      20   0  8124  768  640 R  0.4  0.0   0:00.11 ping               
30136 root      20   0  8124  764  640 R  0.4  0.0   0:00.11 ping               
30139 root      20   0  8124  768  640 R  0.2  0.0   0:00.05 ping               
27675 hbreeder  20   0 12864 1212  836 R  0.1  0.0   0:00.15 top

It is clear from the above output that the ping processes are not using huge amounts of CPU time at all and that eliminates CPU utilization as the driving force behind the Load Average. The above output also reveals that the high CPU utilization is mainly caused by handling software interrupts, 52% in this case.

Conclusion

Based on this experiment we can conclude that processes performing network I/O have an effect on the Linux Load Average. And based on the experiments in the previous two postings we concluded that processes running on, or waiting for, the CPU and processes performing disk I/O also have an effect on the Linux Load Average. Thus the 3 factors that drive the Load Average on a Linux system are processes that are on the run-queue because they:

Run on, or are waiting for, the CPU

Perform disk I/O

Perform network I/O

Summary

The Linux Load Average is driven by the three factors mentioned above, but how does one interpret a Load Average that seems to be too high? The first step is to look at the CPU utilization. If this isn’t 100% and the Load Average is above the number of CPU’s in the system, the Load Average is primarily driven by processes performing disk I/O, network I/O or the combination of both. Finding the processes responsible for most of the I/O isn’t straightforward because there aren’t many tools available to assist you in doing so. A very useful tool is iotop but it doesn’t seem to work on Oracle Linux 5. It does work on Oracle Linux 6 however. Another tool is atop but it requires one or more kernel patches to be useful.

If the CPU utilization is 100% and the Load Average is above the number of CPUs in the system, the Load Average is either completely driven by processes running on, or waiting for, the CPU or driven by a combination of processes running on, or waiting for, the CPU and processes performing I/O (which could be in turn a combination of disk and network I/O). Using top is an easy method to verify if CPU utilization is indeed solely responsible for the current Load Average or that the other two factors play a role as well. Knowing your system does help a lot when it comes to troubleshooting performance problems. Taking performance baselines using sar is always a good thing to do.
-Harald

This entry was posted on May 28, 2012 at 18:16 and is filed under Linux. You can follow any responses to this entry through the RSS 2.0 feed. You can leave a response, or trackback from your own site.

24 Responses to “Understanding Linux Load Average – Part 3”

samirashukla said

May 28, 2012 at 21:31
excellent article. very easy to understand.
thanks a lot
Samir

Reply
- Harald van Breederode said
  
  May 28, 2012 at 21:58
  Hi Samir,
  
  Thanx for your positive feedback.
  -Harald
  
  Reply
Narendra said

May 29, 2012 at 18:48
Harald,

Thanks for all the 3 articles explaining nicely the load average. I especially liked the fact that you also provided details of how it should be measured and actual statistics to support the conclusions. Can you please also share the “load-gen” script? That would be an icing on the cake!!!
Thanks again…

Reply
- Harald van Breederode said
  
  May 30, 2012 at 15:23
  Hi Narendra,
  
  Thanx for your positive comment.
  Feel free to send me an email to request the load-gen scripts I used.
  -Harald
  
  Reply
Kris said

May 30, 2012 at 04:44
Awesome series Harald!

Reply
- Harald van Breederode said
  
  May 30, 2012 at 15:28
  Thank you Kris.
  -Harald
  
  Reply
Frits Hoogland said

June 7, 2012 at 16:23
Harald, you might want to add ‘sar’ is in the ‘sysstat’ package. Upon installation (using the sysstat rpm package on linux), sar collects data per 10 minutes (via the configuration file /etc/cron.d/sysstat). This data is stored in /var/log/sa.

Reply
- Harald van Breederode said
  
  June 8, 2012 at 11:07
  Hi Frits,
  
  Thanx for youe
  -Harald
  
  Reply
jason smith said

June 15, 2012 at 07:41
Good stuff.

I see you’re using the UEK kernel. I’m particularly trying to find out why the UEK kernel shows better Load Averages than it’s RHEL counterpart. If you install Oracle Linux you get both kernels; and in 6 even the RHEL kernel performs better than back in 5. However, at least in our production systems – the UEK kernel reports much lower load averages.

Have you done any testing or know exactly why the UEK kernel shows better load average numbers vs the RHEL stock kernels?

Thanks,

Reply
- Harald van Breederode said
  
  June 17, 2012 at 15:01
  Hi Jason,
  
  No, I haven’t looked at the differences between the stock and UEK kernels. I doubt there is a change in how the load averages are calculated. I think the UEK reports lower load averages simply because it is a way more optimized kernel.
  -Harald
  
  Reply
  - jason smith said
    
    June 17, 2012 at 19:22
    “…because it is a way more optimized kernel.” – that’s exactly our explanation :)
    
    but yeah are load averages w/ UEK are significantly lower. I’ve run w/ the RHEL kernel and made hot /proc changes to say the cpu scheduler and io scheduler that Oracle defaults to and see differences in system behavior.
siva said

August 22, 2012 at 19:20
Hi ,

Excellent atricle.Could you please share load-gen scripts .

siva

Reply
luciano said

August 24, 2012 at 11:11
Excellent article.

I have just one comment about: In the third part it says: “The above output also reveals that the high CPU utilization is mainly caused by handling software interrupts, 52% in this case.

“, acording to mpstat manual, system cpu time doesn’t take soft or hard interruptions into account.

Reply
Adrian said

October 5, 2012 at 00:24
Hello Harald,

very good article.

Just one interesting note.
In “Load Average” on Linux are also processes in uninterruptible sleep states. So high Load Average on Linux does not always reflect CPU or I/O activity.

Adrian

Reply
- Harald van Breederode said
  
  October 5, 2012 at 14:34
  Hi Adrian,
  
  Thanx for your comment. I didn’t know that. Can you give examples when a process enters a uninterruptable sleep state? (It was long ago that I knew exactly what is going on in the kernel ;-) Maybe I am able to demonstrate this behaviour.
  -Harald
  
  Reply
  - Stefan said
    
    October 21, 2012 at 17:02
    Hi Harald,
    
    you already have demonstrated the behaviour. The uninterruptable state is used for a short term wait and disk I/O falls into this category. Look for the section “process state codes” in the manpage for “ps” and you will find the state “D” for “Uninterruptible sleep (usually IO)”. And surely you have seen LGWR or other I/O intensive processes in this state. So this has already been part of your analysis.
    
    Regards,
    Stefan
Veritas Volume Manager (VxVM) command line examples | IT World said

November 28, 2012 at 14:44
[...] Understanding Linux Load Average – Part 3 [...]

Reply
Understanding Linux CPU Load 资料汇总 | 系统技术非业余研究 said

March 2, 2013 at 05:02
[...] 2. Understanding Linux Load Average 谢谢 @jametong 参考：part1 part2 part3 [...]

Reply
Maurilio said

May 18, 2013 at 17:01
Congratulations, it´s a great explanation of load average. Now all clear for me.

Thanks a lot.

Reply
Vasudevan Rao said

August 7, 2013 at 06:44
It was a very interesting article that has been explained in minute detail. It was very well explained in Unix and Linux System Administration Handbook as well. Your article is indeed superb with command-in-line output. Conclusion and summary part is very excellent. If possible please post in your article load-gen script either in bash or perl script.

Reply
Rupesh said

October 24, 2013 at 06:59
Awesome series Sir. My one of the friend said that, his server having load averge more 500 & some time it goes 1000. Is it possible load on 8 core & 16 GB RAM server. Sorry for my english.

Reply
- Harald van Breederode said
  
  November 5, 2013 at 08:38
  Hi Rubish,
  
  Thank you for your question. Yes, it is quite possible that the Load Average is above 500 or even above 1000. It does not matter how many CPUs or how much memory your system have. During my research I managed to drive the Load Average above 1200 on a dual core system with only 4GBytes of memory…
  -Harald
  
  Reply
Leandro said

January 8, 2014 at 16:46
Amazing Post Sr!… really nice the explanation and also all the proves, could you please send me the scripts to my email? thanks in advance Sr!.

Leo.

Reply
Carlos Martinez said

January 30, 2014 at 08:26
thank you sir it was pretty helpfull, i you can share the load -cpu scripts it would be awesome, btw I had a case that there were high cpu-average 5 5 5 and the cpu% dile was like 97% no swaping, physical memory good, the disks were fine they practically has no utilization % but i found that they were having little peaks of iowait but really really few. and then i found a couple of process at D state at top command..i suspect that what is causing the hig load average cpu are those processes in D state, but the weirdest thing even if I reboot the server those process are back again :S

and actually those processes belongs to the S.O or well that’s what I think the owner is root and the name of the command per process is PAL_EVENT ERROR RETRY PARSE CMPLT IDLE thi is happening on a CentOS 6.5.

If you can give me a clue about this i would appreciate alot

Thanks!

Carlos.

Reply

The Dutch Prutser's Blog

By: Harald van Breederode

Tip for screen reader users:

Recent Posts

Recent Comments

Categories

Archives

Blogroll

Podcasts

Blog Stats

Disclaimer

Subscribe

Email Subscription

Understanding Linux Load Average – Part 3

Network I/O and load average

Conclusion

Summary

Like this:

Related

24 Responses to “Understanding Linux Load Average – Part 3”

samirashukla said

Harald van Breederode said

Narendra said

Harald van Breederode said

Kris said

Harald van Breederode said

Frits Hoogland said

Harald van Breederode said

jason smith said

Harald van Breederode said

jason smith said

siva said

luciano said

Adrian said

Harald van Breederode said

Stefan said

Veritas Volume Manager (VxVM) command line examples | IT World said

Understanding Linux CPU Load 资料汇总 | 系统技术非业余研究 said

Maurilio said

Vasudevan Rao said

Rupesh said

Harald van Breederode said

Leandro said

Carlos Martinez said

Leave a Reply Cancel reply

Follow “The Dutch Prutser's Blog”