Linux: how to monitor the nofile limit
In a previous post I explained how to measure the number of processes that are generated when a fork() or clone() call checks the nproc limit. There is another limit in /etc/limits.conf – or in /etc/limits.d – that is displayed by ‘ulimit -n’. It’s the number of open files – ‘nofile’ – and here again we need to know what kind of files are counted.
nofile
‘nofile’ is another limit that may not be easy to monitor, because if you just count the ‘lsof’ output you will include a lot of lines which are not file descriptors. So how can we count the number of files descriptors in a process?
lsof
‘lsof’ is a utility that show all the open files. Let’s take an example:
I get the pid of my pmon process:
1
2
3
|
[oracle@VM211 ulimit]$ ps -edf | grep pmon oracle 10586 1 0 19 : 21 ? 00 : 00 : 02 ora_pmon_DEMO oracle 15494 15290 0 22 : 12 pts/ 1 00 : 00 : 00 grep pmon |
And I list the open files for that process
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
|
[oracle@VM211 ulimit]$ lsof -p 10586 COMMAND PID USER FD TYPE DEVICE SIZE/OFF NAME ora_pmon_ 10586 oracle cwd DIR 252 , 0 4096 /app/oracle/product/ 12.1 /dbs ora_pmon_ 10586 oracle rtd DIR 252 , 0 4096 / ora_pmon_ 10586 oracle txt REG 252 , 0 322308753 /app/oracle/product/ 12.1 /bin/oracle ora_pmon_ 10586 oracle mem REG 0 , 17 4194304 /dev/shm/ora_DEMO_150175744_0 ora_pmon_ 10586 oracle mem REG 0 , 17 4194304 /dev/shm/ora_DEMO_150208513_0 ora_pmon_ 10586 oracle mem REG 0 , 17 4194304 /dev/shm/ora_DEMO_150208513_1 ora_pmon_ 10586 oracle mem REG 0 , 17 4194304 /dev/shm/ora_DEMO_150208513_2 ora_pmon_ 10586 oracle mem REG 0 , 17 4194304 /dev/shm/ora_DEMO_150208513_3 ora_pmon_ 10586 oracle mem REG 0 , 17 4194304 /dev/shm/ora_DEMO_150208513_4 ora_pmon_ 10586 oracle mem REG 0 , 17 4194304 /dev/shm/ora_DEMO_150208513_5 ... ora_pmon_ 10586 oracle mem REG 252 , 0 1135194 /app/oracle/product/ 12.1 /lib/libskgxp12.so ora_pmon_ 10586 oracle mem REG 252 , 0 6776936 /app/oracle/product/ 12.1 /lib/libcell12.so ora_pmon_ 10586 oracle mem REG 252 , 0 14597 /app/oracle/product/ 12.1 /lib/libodmd12.so ora_pmon_ 10586 oracle 0r CHR 1 , 3 0t0 /dev/ null ora_pmon_ 10586 oracle 1w CHR 1 , 3 0t0 /dev/ null ora_pmon_ 10586 oracle 2w CHR 1 , 3 0t0 /dev/ null ora_pmon_ 10586 oracle 3r CHR 1 , 3 0t0 /dev/ null ora_pmon_ 10586 oracle 4r REG 252 , 0 1233408 /app/oracle/product/ 12.1 /rdbms/mesg/oraus.msb ora_pmon_ 10586 oracle 5r DIR 0 , 3 0 /proc/ 10586 /fd ora_pmon_ 10586 oracle 6u REG 252 , 0 1544 /app/oracle/product/ 12.1 /dbs/hc_DEMO.dat ora_pmon_ 10586 oracle 7u REG 252 , 0 24 /app/oracle/product/ 12.1 /dbs/lkDEMO_SITE1 ora_pmon_ 10586 oracle 8r REG 252 , 0 1233408 /app/oracle/product/ 12.1 /rdbms/mesg/oraus.msb |
I’ve removed hundreds of lines with FD=mem and size=4M. I’m in AMM with memory_target=800M and SGA is implemented in /dev/shm granules. With lsof, we see all of them. And with a large memory_target we can have thousands of them (even if granule becomes 16M when memory_target is larger than 1GB). But don’t worry, they don’t count in the ‘nofile’ limit. Only ‘real’ file descriptors are counted – those with a numeric FD.
So, if you want to know the processes that are near the limit, you can use the following:
1
2
3
4
5
6
7
8
9
10
11
|
[oracle@VM211 ulimit]$ lsof | awk '$4 ~ /[0-9]+[rwu -].*/{p[$1"t"$2"t"$3]=p[$1"t"$2"t"$3]+1}END{for (i in p) print p[i],i}' | sort -n | tail 15 ora_dmon_ 10634 oracle 16 ora_dbw0_ 10608 oracle 16 ora_mmon_ 10626 oracle 16 ora_rsm0_ 10722 oracle 16 tnslsnr 9785 oracle 17 automount 1482 root 17 dbus-daem 1363 dbus 20 rpc.mount 1525 root 21 ora_lgwr_ 10610 oracle 89 master 1811 root |
The idea is to filter the output of lsof and use awk to keep only the numeric file descriptors, and aggregate per process. Then, we sort them and show the highest counts. Here the Postfix master process has 89 files open. Then log writer follows.
You can get the same information from /proc filesystem where files handles are in /proc//fd:
for p in /proc/[0-9]* ; do echo $(ls $p/fd | wc -l) $(cat $p/cmdline) ; done | sort -n | tail
15 ora_dmon_DEMO 16 ora_dbw0_DEMO 16 ora_mmon_DEMO 16 ora_rsm0_DEMO 16 /app/oracle/product/12.1/bin/tnslsnrLISTENER-inherit 17 automount--pid-file/var/run/autofs.pid 17 dbus-daemon--system 20 rpc.mountd 21 ora_lgwr_DEMO 89 /usr/libexec/postfix/master
Same result, much quicker and more information about the process. This is the way I prefer, but remember that if you want to see all processes, you should be logged as root.
The proof
As I did for nproc, I have written a small C program that open files (passed as arguments) for a few seconds, so that I’m sure I’m counting the right things.
And I encourage to do the same on a test system and let me know if your result differs. Here is the source: openfiles.zip
First, I set my nofile limit to only 10
ulimit -n 10
Then, let’s open 7 files. In addition with stdin, stdout and stderr we will have 10 file handles:
1
2
3
4
5
6
7
8
|
[oracle@VM211 ulimit]$ ./openfiles myfile1.tmp myfile2.tmp myfile3.tmp myfile4.tmp myfile5.tmp myfile6.tmp myfile7.tmp & open file 1 of 7 getrlimit nofile: soft= 10 hard= 10 myfile1.tmp open file 2 of 7 getrlimit nofile: soft= 10 hard= 10 myfile2.tmp open file 3 of 7 getrlimit nofile: soft= 10 hard= 10 myfile3.tmp open file 4 of 7 getrlimit nofile: soft= 10 hard= 10 myfile4.tmp open file 5 of 7 getrlimit nofile: soft= 10 hard= 10 myfile5.tmp open file 6 of 7 getrlimit nofile: soft= 10 hard= 10 myfile6.tmp open file 7 of 7 getrlimit nofile: soft= 10 hard= 10 myfile7.tmp |
I was able to open those 7 files. Then I check lsof:
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
|
[oracle@VM211 ulimit]$ lsof | grep openfiles openfiles 21853 oracle cwd DIR 0 , 24 380928 9320 /tmp/ulimit openfiles 21853 oracle rtd DIR 252 , 0 4096 2 / openfiles 21853 oracle txt REG 0 , 24 7630 9494 /tmp/ulimit/openfiles openfiles 21853 oracle mem REG 252 , 0 156928 1579400 /lib64/ld- 2.12 .so openfiles 21853 oracle mem REG 252 , 0 1926800 1579401 /lib64/libc- 2.12 .so openfiles 21853 oracle 0u CHR 136 , 1 0t0 4 /dev/pts/ 1 openfiles 21853 oracle 1u CHR 136 , 1 0t0 4 /dev/pts/ 1 openfiles 21853 oracle 2u CHR 136 , 1 0t0 4 /dev/pts/ 1 openfiles 21853 oracle 3r REG 0 , 24 0 9487 /tmp/myfile1.tmp openfiles 21853 oracle 4r REG 0 , 24 0 9488 /tmp/myfile2.tmp openfiles 21853 oracle 5r REG 0 , 24 0 9489 /tmp/myfile3.tmp openfiles 21853 oracle 6r REG 0 , 24 0 9490 /tmp/myfile4.tmp openfiles 21853 oracle 7r REG 0 , 24 0 9491 /tmp/myfile5.tmp openfiles 21853 oracle 8r REG 0 , 24 0 9492 /tmp/myfile6.tmp openfiles 21853 oracle 9r REG 0 , 24 0 9493 /tmp/myfile7.tmp |
We see our 10 file handles and this proves that only numeric FD are counted when checking the nofile limit of 10. You see stdin, stdout, stderr as FD 0,1,2 and then my 7 files opened in read only.
Let’s try to open one more file:
1
2
3
4
5
6
7
8
9
10
|
[oracle@VM211 ulimit]$ ./openfiles myfile1.tmp myfile2.tmp myfile3.tmp myfile4.tmp myfile5.tmp myfile6.tmp myfile7.tmp myfile8.tmp open file 1 of 8 getrlimit nofile: soft= 10 hard= 10 myfile1.tmp open file 2 of 8 getrlimit nofile: soft= 10 hard= 10 myfile2.tmp open file 3 of 8 getrlimit nofile: soft= 10 hard= 10 myfile3.tmp open file 4 of 8 getrlimit nofile: soft= 10 hard= 10 myfile4.tmp open file 5 of 8 getrlimit nofile: soft= 10 hard= 10 myfile5.tmp open file 6 of 8 getrlimit nofile: soft= 10 hard= 10 myfile6.tmp open file 7 of 8 getrlimit nofile: soft= 10 hard= 10 myfile7.tmp open file 8 of 8 getrlimit nofile: soft= 10 hard= 10 myfile8.tmp fopen() number 8 failed with errno= 24 |
Here the limit is reached and the open() call returns error 24 (ENFILE) because we reached the nofile=10.
Threads
When counting the processes for the nproc limit, we have seen that threads must be counted as processes. For the nofile limit we don’t need to detail the threads because all threads share the file descriptor table.
Recommended values
Currently this is what is set on Oracle linux 6 for 11gR2 (in /etc/security/limits.conf):
1
2
|
oracle soft nofile 1024 oracle hard nofile 65536 |
For 12c, these are set in /etc/security/limits.d/oracle-rdbms-server-12cR1-preinstall.conf which overrides /etc/security/limits.conf:
1
2
|
oracle soft nofile 1024 oracle hard nofile 65536 |
Do you think it’s a bit low? Just for information, here is what is set in the ODA X4-2:
oracle soft nofile 131072
In any case, it is a good idea to check if you are reaching the limit and the above scripts on lsof or /proc should help for that.