Linux: how to monitor the nofile limit


In a previous post I explained how to measure the number of processes that are generated when a fork() or clone() call checks the nproc limit. There is another limit in /etc/limits.conf – or in /etc/limits.d – that is displayed by ‘ulimit -n’. It’s the number of open files – ‘nofile’ – and here again we need to know what kind of files are counted.

nofile

‘nofile’ is another limit that may not be easy to monitor, because if you just count the ‘lsof’ output you will include a lot of lines which are not file descriptors. So how can we count the number of files descriptors in a process?

lsof

‘lsof’ is a utility that show all the open files. Let’s take an example:

I get the pid of my pmon process:

1
2
3
[oracle@VM211 ulimit]$ ps -edf | grep pmon
oracle   10586     1  0 19:21 ?        00:00:02 ora_pmon_DEMO
oracle   15494 15290  0 22:12 pts/1    00:00:00 grep pmon

 

And I list the open files for that process

1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
[oracle@VM211 ulimit]$ lsof -p 10586
COMMAND     PID   USER   FD TYPE DEVICE  SIZE/OFF NAME
ora_pmon_ 10586 oracle  cwd  DIR  252,0      4096 /app/oracle/product/12.1/dbs
ora_pmon_ 10586 oracle  rtd  DIR  252,0      4096 /
ora_pmon_ 10586 oracle  txt  REG  252,0 322308753 /app/oracle/product/12.1/bin/oracle
ora_pmon_ 10586 oracle  mem  REG   0,17   4194304 /dev/shm/ora_DEMO_150175744_0
ora_pmon_ 10586 oracle  mem  REG   0,17   4194304 /dev/shm/ora_DEMO_150208513_0
ora_pmon_ 10586 oracle  mem  REG   0,17   4194304 /dev/shm/ora_DEMO_150208513_1
ora_pmon_ 10586 oracle  mem  REG   0,17   4194304 /dev/shm/ora_DEMO_150208513_2
ora_pmon_ 10586 oracle  mem  REG   0,17   4194304 /dev/shm/ora_DEMO_150208513_3
ora_pmon_ 10586 oracle  mem  REG   0,17   4194304 /dev/shm/ora_DEMO_150208513_4
ora_pmon_ 10586 oracle  mem  REG   0,17   4194304 /dev/shm/ora_DEMO_150208513_5
...
ora_pmon_ 10586 oracle  mem  REG  252,0   1135194 /app/oracle/product/12.1/lib/libskgxp12.so
ora_pmon_ 10586 oracle  mem  REG  252,0   6776936 /app/oracle/product/12.1/lib/libcell12.so
ora_pmon_ 10586 oracle  mem  REG  252,0     14597 /app/oracle/product/12.1/lib/libodmd12.so
ora_pmon_ 10586 oracle    0r CHR    1,3       0t0 /dev/null
ora_pmon_ 10586 oracle    1w CHR    1,3       0t0 /dev/null
ora_pmon_ 10586 oracle    2w CHR    1,3       0t0 /dev/null
ora_pmon_ 10586 oracle    3r CHR    1,3       0t0 /dev/null
ora_pmon_ 10586 oracle    4r REG  252,0   1233408 /app/oracle/product/12.1/rdbms/mesg/oraus.msb
ora_pmon_ 10586 oracle    5r DIR    0,3         0 /proc/10586/fd
ora_pmon_ 10586 oracle    6u REG  252,0      1544 /app/oracle/product/12.1/dbs/hc_DEMO.dat
ora_pmon_ 10586 oracle    7u REG  252,0        24 /app/oracle/product/12.1/dbs/lkDEMO_SITE1
ora_pmon_ 10586 oracle    8r REG  252,0   1233408 /app/oracle/product/12.1/rdbms/mesg/oraus.msb

I’ve removed hundreds of lines with FD=mem and size=4M. I’m in AMM with memory_target=800M and SGA is implemented in /dev/shm granules. With lsof, we see all of them. And with a large memory_target we can have thousands of them (even if granule becomes 16M when memory_target is larger than 1GB). But don’t worry, they don’t count in the ‘nofile’ limit. Only ‘real’ file descriptors are counted – those with a numeric FD.

So, if you want to know the processes that are near the limit, you can use the following:

1
2
3
4
5
6
7
8
9
10
11
[oracle@VM211 ulimit]$ lsof | awk '$4 ~ /[0-9]+[rwu -].*/{p[$1"t"$2"t"$3]=p[$1"t"$2"t"$3]+1}END{for (i in p) print p[i],i}' | sort -n | tail
15 ora_dmon_    10634   oracle
16 ora_dbw0_    10608   oracle
16 ora_mmon_    10626   oracle
16 ora_rsm0_    10722   oracle
16 tnslsnr      9785    oracle
17 automount    1482    root
17 dbus-daem    1363    dbus
20 rpc.mount    1525    root
21 ora_lgwr_    10610   oracle
89 master       1811    root

 

The idea is to filter the output of lsof and use awk to keep only the numeric file descriptors, and aggregate per process. Then, we sort them and show the highest counts. Here the Postfix master process has 89 files open. Then log writer follows.

You can get the same information from /proc filesystem where files handles are in /proc//fd:

for p in /proc/[0-9]* ; do echo $(ls $p/fd | wc -l) $(cat $p/cmdline) ; done | sort -n | tail
15 ora_dmon_DEMO
16 ora_dbw0_DEMO
16 ora_mmon_DEMO
16 ora_rsm0_DEMO
16 /app/oracle/product/12.1/bin/tnslsnrLISTENER-inherit
17 automount--pid-file/var/run/autofs.pid
17 dbus-daemon--system
20 rpc.mountd
21 ora_lgwr_DEMO
89 /usr/libexec/postfix/master

 

Same result, much quicker and more information about the process. This is the way I prefer, but remember that if you want to see all processes, you should be logged as root.

The proof

As I did for nproc, I have written a small C program that open files (passed as arguments) for a few seconds, so that I’m sure I’m counting the right things.

And I encourage to do the same on a test system and let me know if your result differs. Here is the source: openfiles.zip

First, I set my nofile limit to only 10

ulimit -n 10

 

Then, let’s open 7 files. In addition with stdin, stdout and stderr we will have 10 file handles:

1
2
3
4
5
6
7
8
[oracle@VM211 ulimit]$ ./openfiles myfile1.tmp myfile2.tmp myfile3.tmp myfile4.tmp myfile5.tmp myfile6.tmp myfile7.tmp &
open file 1 of 7 getrlimit nofile: soft=10 hard=10 myfile1.tmp
open file 2 of 7 getrlimit nofile: soft=10 hard=10 myfile2.tmp
open file 3 of 7 getrlimit nofile: soft=10 hard=10 myfile3.tmp
open file 4 of 7 getrlimit nofile: soft=10 hard=10 myfile4.tmp
open file 5 of 7 getrlimit nofile: soft=10 hard=10 myfile5.tmp
open file 6 of 7 getrlimit nofile: soft=10 hard=10 myfile6.tmp
open file 7 of 7 getrlimit nofile: soft=10 hard=10 myfile7.tmp

 

I was able to open those 7 files. Then I check lsof:

1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
[oracle@VM211 ulimit]$ lsof | grep openfiles
openfiles 21853    oracle  cwd       DIR  0,24    380928    9320 /tmp/ulimit
openfiles 21853    oracle  rtd       DIR 252,0      4096       2 /
openfiles 21853    oracle  txt       REG  0,24      7630    9494 /tmp/ulimit/openfiles
openfiles 21853    oracle  mem       REG 252,0    156928 1579400 /lib64/ld-2.12.so
openfiles 21853    oracle  mem       REG 252,0   1926800 1579401 /lib64/libc-2.12.so
openfiles 21853    oracle    0u      CHR 136,1       0t0       4 /dev/pts/1
openfiles 21853    oracle    1u      CHR 136,1       0t0       4 /dev/pts/1
openfiles 21853    oracle    2u      CHR 136,1       0t0       4 /dev/pts/1
openfiles 21853    oracle    3r      REG  0,24         0    9487 /tmp/myfile1.tmp
openfiles 21853    oracle    4r      REG  0,24         0    9488 /tmp/myfile2.tmp
openfiles 21853    oracle    5r      REG  0,24         0    9489 /tmp/myfile3.tmp
openfiles 21853    oracle    6r      REG  0,24         0    9490 /tmp/myfile4.tmp
openfiles 21853    oracle    7r      REG  0,24         0    9491 /tmp/myfile5.tmp
openfiles 21853    oracle    8r      REG  0,24         0    9492 /tmp/myfile6.tmp
openfiles 21853    oracle    9r      REG  0,24         0    9493 /tmp/myfile7.tmp

 

We see our 10 file handles and this proves that only numeric FD are counted when checking the nofile limit of 10. You see stdin, stdout, stderr as FD 0,1,2 and then my 7 files opened in read only.

Let’s try to open one more file:

1
2
3
4
5
6
7
8
9
10
[oracle@VM211 ulimit]$ ./openfiles myfile1.tmp myfile2.tmp myfile3.tmp myfile4.tmp myfile5.tmp myfile6.tmp myfile7.tmp myfile8.tmp
open file 1 of 8 getrlimit nofile: soft=10 hard=10 myfile1.tmp
open file 2 of 8 getrlimit nofile: soft=10 hard=10 myfile2.tmp
open file 3 of 8 getrlimit nofile: soft=10 hard=10 myfile3.tmp
open file 4 of 8 getrlimit nofile: soft=10 hard=10 myfile4.tmp
open file 5 of 8 getrlimit nofile: soft=10 hard=10 myfile5.tmp
open file 6 of 8 getrlimit nofile: soft=10 hard=10 myfile6.tmp
open file 7 of 8 getrlimit nofile: soft=10 hard=10 myfile7.tmp
open file 8 of 8 getrlimit nofile: soft=10 hard=10 myfile8.tmp
fopen() number 8 failed with errno=24

Here the limit is reached and the open() call returns error 24 (ENFILE) because we reached the nofile=10.

Threads

When counting the processes for the nproc limit, we have seen that threads must be counted as processes. For the nofile limit we don’t need to detail the threads because all threads share the file descriptor table.

Recommended values

Currently this is what is set on Oracle linux 6 for 11gR2 (in /etc/security/limits.conf):

1
2
oracle   soft   nofile    1024
oracle   hard   nofile    65536

 

For 12c, these are set in /etc/security/limits.d/oracle-rdbms-server-12cR1-preinstall.conf which overrides /etc/security/limits.conf:

1
2
oracle soft nofile 1024
oracle hard nofile 65536

 

Do you think it’s a bit low? Just for information, here is what is set in the ODA X4-2:

oracle soft nofile 131072

In any case, it is a good idea to check if you are reaching the limit and the above scripts on lsof or /proc should help for that.



Article Number: 332
Posted: Wed, Jul 25, 2018 2:04 PM
Last Updated: Wed, Jul 25, 2018 2:04 PM

Online URL: http://kb.ictbanking.net/article.php?id=332