The Dutch Prutser's Blog

By: Harald van Breederode

  • Disclaimer

    The views expressed on this blog are my own and do not necessarily reflect the views of Oracle.
  • Subscribe

  • Enter your email address to subscribe to this blog and receive notifications of new posts by email.

    Join 244 other followers

Understanding Linux Load Average – Part 3

Posted by Harald van Breederode on May 28, 2012

In part 1 we performed a series of experiments to explore the relation between CPU utilization and Linux load average. We concluded that the load average is influenced by processes running on or waiting for the CPU. Based on experiments in part 2 we came to the conclusion that processes that are performing disk I/O also influence the load average on a Linux system. In this posting we will do another experiment to find out if the Linux load average is also affected by processes performing network I/O.

Network I/O and load average

To check if a correlation exists between processes performing network I/O and the load average we will start 10 processes generating network I/O on an otherwise idle system and collect various performance related statistics using the sar command. Note: My load-gen script uses the ping command to generate network I/O.

$ load-gen network 10
Starting 10 network load processes.
$ sar –n DEV 30 6
Linux 2.6.32-300.20.1.el5uek (roger.example.com) 	04/21/2012

09:38:01 PM     IFACE   rxpck/s   txpck/s   rxbyt/s   txbyt/s   rxcmp/s   txcmp/s  rxmcst/s
09:38:31 PM        lo  88953.60  88953.60 135920963.87 135920963.87      0.00      0.00      0.00
09:38:31 PM      eth1      0.13      0.17     11.33     62.33      0.00      0.00      0.00
09:38:31 PM      eth0      0.00      0.00      0.00      0.00      0.00      0.00      0.00

09:38:31 PM     IFACE   rxpck/s   txpck/s   rxbyt/s   txbyt/s   rxcmp/s   txcmp/s  rxmcst/s
09:39:01 PM        lo  89295.13  89295.13 136442626.93 136442626.93      0.00      0.00      0.00
09:39:01 PM      eth1      0.03      0.03      2.60     48.33      0.00      0.00      0.00
09:39:01 PM      eth0      0.00      0.00      0.00      0.00      0.00      0.00      0.00

09:39:01 PM     IFACE   rxpck/s   txpck/s   rxbyt/s   txbyt/s   rxcmp/s   txcmp/s  rxmcst/s
09:39:31 PM        lo  89364.38  89364.38 136548566.91 136548566.91      0.00      0.00      0.00
09:39:31 PM      eth1      0.10      0.10      7.34     47.30      0.00      0.00      0.03
09:39:31 PM      eth0      0.00      0.00      0.00      0.00      0.00      0.00      0.00

09:39:31 PM     IFACE   rxpck/s   txpck/s   rxbyt/s   txbyt/s   rxcmp/s   txcmp/s  rxmcst/s
09:40:01 PM        lo  89410.80  89410.80 136619365.60 136619365.60      0.00      0.00      0.00
09:40:01 PM      eth1      0.03      0.03      2.60     48.33      0.00      0.00      0.00
09:40:01 PM      eth0      0.00      0.00      0.00      0.00      0.00      0.00      0.00

09:40:01 PM     IFACE   rxpck/s   txpck/s   rxbyt/s   txbyt/s   rxcmp/s   txcmp/s  rxmcst/s
09:40:31 PM        lo  89502.30  89502.30 136759314.53 136759314.53      0.00      0.00      0.00
09:40:31 PM      eth1      0.23      0.27     20.60     59.33      0.00      0.00      0.00
09:40:31 PM      eth0      0.00      0.00      0.00      0.00      0.00      0.00      0.00

09:40:31 PM     IFACE   rxpck/s   txpck/s   rxbyt/s   txbyt/s   rxcmp/s   txcmp/s  rxmcst/s
09:41:01 PM        lo  89551.52  89551.52 136834718.24 136834718.24      0.00      0.00      0.00
09:41:01 PM      eth1      0.03      0.03      2.60     48.35      0.00      0.00      0.00
09:41:01 PM      eth0      0.00      0.00      0.00      0.00      0.00      0.00      0.00

Average:        IFACE   rxpck/s   txpck/s   rxbyt/s   txbyt/s   rxcmp/s   txcmp/s  rxmcst/s
Average:           lo  89346.27  89346.27 136520905.51 136520905.51      0.00      0.00      0.00
Average:         eth1      0.09      0.11      7.85     52.33      0.00      0.00      0.01
Average:         eth0      0.00      0.00      0.00      0.00      0.00      0.00      0.00 

The above output shows that the lo interface sent and received almost 90 thousand packets per second good for a total of 136 million bytes of traffic. The other two interfaces had virtually no traffic at all. This is because my network load processes are pinging localhost. Let’s have a look at the CPU utilization before taking a look at the run-queue utilization and Load Average.

$ sar –P ALL –u 30 6
Linux 2.6.32-300.20.1.el5uek (roger.example.com) 	04/21/2012

09:38:01 PM       CPU     %user     %nice   %system   %iowait    %steal     %idle
09:38:31 PM       all     13.90      0.00     86.10      0.00      0.00      0.00
09:38:31 PM         0     13.60      0.00     86.40      0.00      0.00      0.00
09:38:31 PM         1     14.17      0.00     85.83      0.00      0.00      0.00

09:38:31 PM       CPU     %user     %nice   %system   %iowait    %steal     %idle
09:39:01 PM       all     13.82      0.00     86.18      0.00      0.00      0.00
09:39:01 PM         0     13.30      0.00     86.70      0.00      0.00      0.00
09:39:01 PM         1     14.37      0.00     85.63      0.00      0.00      0.00

09:39:01 PM       CPU     %user     %nice   %system   %iowait    %steal     %idle
09:39:31 PM       all     13.84      0.00     86.16      0.00      0.00      0.00
09:39:31 PM         0     13.30      0.00     86.70      0.00      0.00      0.00
09:39:31 PM         1     14.37      0.00     85.63      0.00      0.00      0.00

09:39:31 PM       CPU     %user     %nice   %system   %iowait    %steal     %idle
09:40:01 PM       all     13.82      0.00     86.18      0.00      0.00      0.00
09:40:01 PM         0     14.10      0.00     85.90      0.00      0.00      0.00
09:40:01 PM         1     13.53      0.00     86.47      0.00      0.00      0.00

09:40:01 PM       CPU     %user     %nice   %system   %iowait    %steal     %idle
09:40:31 PM       all     13.75      0.00     86.25      0.00      0.00      0.00
09:40:31 PM         0     14.27      0.00     85.73      0.00      0.00      0.00
09:40:31 PM         1     13.20      0.00     86.80      0.00      0.00      0.00

09:40:31 PM       CPU     %user     %nice   %system   %iowait    %steal     %idle
09:41:01 PM       all     13.55      0.00     86.45      0.00      0.00      0.00
09:41:01 PM         0     13.83      0.00     86.17      0.00      0.00      0.00
09:41:01 PM         1     13.27      0.00     86.73      0.00      0.00      0.00

Average:          CPU     %user     %nice   %system   %iowait    %steal     %idle
Average:          all     13.78      0.00     86.22      0.00      0.00      0.00
Average:            0     13.73      0.00     86.27      0.00      0.00      0.00
Average:            1     13.82      0.00     86.18      0.00      0.00      0.00
 

On average the CPU spent 14% of its time running code in user mode and 86% of the CPU time was spent running code in kernel mode. This is because the Linux kernel has to work quite hard to handle the amount of network traffic. The question is of course: What effect does this have on the Load Average?

$ sar –q 30 6
Linux 2.6.32-300.20.1.el5uek (roger.example.com) 	04/21/2012

09:38:01 PM   runq-sz  plist-sz   ldavg-1   ldavg-5  ldavg-15
09:38:31 PM        10       319      4.03      1.93      1.86
09:39:01 PM        10       319      6.46      2.72      2.12
09:39:31 PM        10       319      7.85      3.41      2.37
09:40:01 PM        10       319      8.69      4.04      2.61
09:40:31 PM        10       319      9.14      4.59      2.84
09:41:01 PM        10       313      9.55      5.12      3.07
Average:           10       318      7.62      3.63      2.48
 

The above sar output shows that the run-queue was constantly occupied by 10 processes and that the 1-minute Load Average slowly climbed towards 10 as one might expect by now ;-) This could be an indication that Load Average is influenced by processes performing network I/O. But maybe the ping processes are using high amounts of CPU time and thereby forcing the Load Average to go up. To figure this out we will take a look at the top output.

top - 21:41:02 up 11:25,  1 user,  load average: 9.51, 5.19, 3.11
Tasks: 215 total,   9 running, 206 sleeping,   0 stopped,   0 zombie
Cpu(s): 13.6%us, 34.0%sy,  0.0%ni,  0.0%id,  0.0%wa,  0.0%hi, 52.4%si,  0.0%st
Mem:   3074820k total,  2567640k used,   507180k free,   221652k buffers
Swap:  5144568k total,        0k used,  5144568k free,  1161696k cached

  PID USER      PR  NI  VIRT  RES  SHR S %CPU %MEM    TIME+  COMMAND            
30118 root      20   0  8124  768  640 R  1.1  0.0   0:00.32 ping               
30121 root      20   0  8124  768  640 R  0.9  0.0   0:00.27 ping               
30126 root      20   0  8124  772  640 R  0.5  0.0   0:00.15 ping               
30127 root      20   0  8124  772  640 R  0.5  0.0   0:00.15 ping               
30134 root      20   0  8124  772  640 R  0.4  0.0   0:00.13 ping               
30135 root      20   0  8124  768  640 R  0.4  0.0   0:00.11 ping               
30136 root      20   0  8124  764  640 R  0.4  0.0   0:00.11 ping               
30139 root      20   0  8124  768  640 R  0.2  0.0   0:00.05 ping               
27675 hbreeder  20   0 12864 1212  836 R  0.1  0.0   0:00.15 top                
 

It is clear from the above output that the ping processes are not using huge amounts of CPU time at all and that eliminates CPU utilization as the driving force behind the Load Average. The above output also reveals that the high CPU utilization is mainly caused by handling software interrupts, 52% in this case.

Conclusion

Based on this experiment we can conclude that processes performing network I/O have an effect on the Linux Load Average. And based on the experiments in the previous two postings we concluded that processes running on, or waiting for, the CPU and processes performing disk I/O also have an effect on the Linux Load Average. Thus the 3 factors that drive the Load Average on a Linux system are processes that are on the run-queue because they:

  • Run on, or are waiting for, the CPU

  • Perform disk I/O

  • Perform network I/O

Summary

The Linux Load Average is driven by the three factors mentioned above, but how does one interpret a Load Average that seems to be too high? The first step is to look at the CPU utilization. If this isn’t 100% and the Load Average is above the number of CPU’s in the system, the Load Average is primarily driven by processes performing disk I/O, network I/O or the combination of both. Finding the processes responsible for most of the I/O isn’t straightforward because there aren’t many tools available to assist you in doing so. A very useful tool is iotop but it doesn’t seem to work on Oracle Linux 5. It does work on Oracle Linux 6 however. Another tool is atop but it requires one or more kernel patches to be useful.

If the CPU utilization is 100% and the Load Average is above the number of CPUs in the system, the Load Average is either completely driven by processes running on, or waiting for, the CPU or driven by a combination of processes running on, or waiting for, the CPU and processes performing I/O (which could be in turn a combination of disk and network I/O). Using top is an easy method to verify if CPU utilization is indeed solely responsible for the current Load Average or that the other two factors play a role as well. Knowing your system does help a lot when it comes to troubleshooting performance problems. Taking performance baselines using sar is always a good thing to do.
-Harald

24 Responses to “Understanding Linux Load Average – Part 3”

  1. excellent article. very easy to understand.
    thanks a lot
    Samir

  2. Narendra said

    Harald,

    Thanks for all the 3 articles explaining nicely the load average. I especially liked the fact that you also provided details of how it should be measured and actual statistics to support the conclusions. Can you please also share the “load-gen” script? That would be an icing on the cake!!!
    Thanks again…

    • Harald van Breederode said

      Hi Narendra,

      Thanx for your positive comment.
      Feel free to send me an email to request the load-gen scripts I used.
      -Harald

  3. Kris said

    Awesome series Harald!

  4. Harald, you might want to add ‘sar’ is in the ‘sysstat’ package. Upon installation (using the sysstat rpm package on linux), sar collects data per 10 minutes (via the configuration file /etc/cron.d/sysstat). This data is stored in /var/log/sa.

  5. jason smith said

    Good stuff.

    I see you’re using the UEK kernel. I’m particularly trying to find out why the UEK kernel shows better Load Averages than it’s RHEL counterpart. If you install Oracle Linux you get both kernels; and in 6 even the RHEL kernel performs better than back in 5. However, at least in our production systems – the UEK kernel reports much lower load averages.

    Have you done any testing or know exactly why the UEK kernel shows better load average numbers vs the RHEL stock kernels?

    Thanks,

    • Harald van Breederode said

      Hi Jason,

      No, I haven’t looked at the differences between the stock and UEK kernels. I doubt there is a change in how the load averages are calculated. I think the UEK reports lower load averages simply because it is a way more optimized kernel.
      -Harald

      • jason smith said

        “…because it is a way more optimized kernel.” – that’s exactly our explanation :)

        but yeah are load averages w/ UEK are significantly lower. I’ve run w/ the RHEL kernel and made hot /proc changes to say the cpu scheduler and io scheduler that Oracle defaults to and see differences in system behavior.

  6. siva said

    Hi ,

    Excellent atricle.Could you please share load-gen scripts .

    siva

  7. luciano said

    Excellent article.

    I have just one comment about: In the third part it says: “The above output also reveals that the high CPU utilization is mainly caused by handling software interrupts, 52% in this case.

    “, acording to mpstat manual, system cpu time doesn’t take soft or hard interruptions into account.

  8. Adrian said

    Hello Harald,

    very good article.

    Just one interesting note.
    In “Load Average” on Linux are also processes in uninterruptible sleep states. So high Load Average on Linux does not always reflect CPU or I/O activity.

    Adrian

    • Harald van Breederode said

      Hi Adrian,

      Thanx for your comment. I didn’t know that. Can you give examples when a process enters a uninterruptable sleep state? (It was long ago that I knew exactly what is going on in the kernel ;-) Maybe I am able to demonstrate this behaviour.
      -Harald

      • Stefan said

        Hi Harald,

        you already have demonstrated the behaviour. The uninterruptable state is used for a short term wait and disk I/O falls into this category. Look for the section “process state codes” in the manpage for “ps” and you will find the state “D” for “Uninterruptible sleep (usually IO)”. And surely you have seen LGWR or other I/O intensive processes in this state. So this has already been part of your analysis.

        Regards,
        Stefan

  9. [...] Understanding Linux Load Average – Part 3 [...]

  10. [...] 2. Understanding Linux Load Average 谢谢 @jametong 参考:part1 part2 part3 [...]

  11. Maurilio said

    Congratulations, it´s a great explanation of load average. Now all clear for me.

    Thanks a lot.

  12. Vasudevan Rao said

    It was a very interesting article that has been explained in minute detail. It was very well explained in Unix and Linux System Administration Handbook as well. Your article is indeed superb with command-in-line output. Conclusion and summary part is very excellent. If possible please post in your article load-gen script either in bash or perl script.

  13. Rupesh said

    Awesome series Sir. My one of the friend said that, his server having load averge more 500 & some time it goes 1000. Is it possible load on 8 core & 16 GB RAM server. Sorry for my english.

    • Harald van Breederode said

      Hi Rubish,

      Thank you for your question. Yes, it is quite possible that the Load Average is above 500 or even above 1000. It does not matter how many CPUs or how much memory your system have. During my research I managed to drive the Load Average above 1200 on a dual core system with only 4GBytes of memory…
      -Harald

  14. Leandro said

    Amazing Post Sr!… really nice the explanation and also all the proves, could you please send me the scripts to my email? thanks in advance Sr!.

    Leo.

  15. Carlos Martinez said

    thank you sir it was pretty helpfull, i you can share the load -cpu scripts it would be awesome, btw I had a case that there were high cpu-average 5 5 5 and the cpu% dile was like 97% no swaping, physical memory good, the disks were fine they practically has no utilization % but i found that they were having little peaks of iowait but really really few. and then i found a couple of process at D state at top command..i suspect that what is causing the hig load average cpu are those processes in D state, but the weirdest thing even if I reboot the server those process are back again :S

    and actually those processes belongs to the S.O or well that’s what I think the owner is root and the name of the command per process is PAL_EVENT ERROR RETRY PARSE CMPLT IDLE thi is happening on a CentOS 6.5.

    If you can give me a clue about this i would appreciate alot

    Thanks!

    Carlos.

Leave a Reply

Fill in your details below or click an icon to log in:

WordPress.com Logo

You are commenting using your WordPress.com account. Log Out / Change )

Twitter picture

You are commenting using your Twitter account. Log Out / Change )

Facebook photo

You are commenting using your Facebook account. Log Out / Change )

Google+ photo

You are commenting using your Google+ account. Log Out / Change )

Connecting to %s

 
Follow

Get every new post delivered to your Inbox.

Join 244 other followers

%d bloggers like this: