一次oom问题排查
问题描述
收到基础组件错误报警, 检查发现ES oom.
问题排查
10:21:50 vmtools 申请4kB(order=0) 内存触发了OOM .
Apr 11 10:21:50 kernel: ##vmtoolsd## invoked oom-killer: gfp_mask=0x280da, ##order=0##, oom_score_adj=0
Apr 11 10:21:50 kernel: vmtoolsd cpuset=/ mems_allowed=0
Apr 11 10:21:50 kernel: CPU: 6 PID: 867 Comm: vmtoolsd Not tainted 3.10.0-693.21.1.el7.x86_64 #1
Apr 11 10:21:50 kernel: Hardware name: VMware, Inc. VMware Virtual Platform/440BX Desktop Reference Platform, BIOS 6.00 12/12/2018
Apr 11 10:21:50 kernel: Call Trace:
Apr 11 10:21:50 kernel: [<ffffffff816ae7c8>] dump_stack+0x19/0x1b
Apr 11 10:21:50 kernel: [<ffffffff816a9b90>] dump_header+0x90/0x229
Apr 11 10:21:50 kernel: [<ffffffff810ecec2>] ? ktime_get_ts64+0x52/0xf0
Apr 11 10:21:50 kernel: [<ffffffff8114140f>] ? delayacct_end+0x8f/0xb0
Apr 11 10:21:50 kernel: [<ffffffff8118a884>] oom_kill_process+0x254/0x3d0
Apr 11 10:21:50 kernel: [<ffffffff8118a32d>] ? oom_unkillable_task+0xcd/0x120
Apr 11 10:21:50 kernel: [<ffffffff8118a3d6>] ? find_lock_task_mm+0x56/0xc0
Apr 11 10:21:50 kernel: [<ffffffff8118b0c6>] out_of_memory+0x4b6/0x4f0
Apr 11 10:21:50 kernel: [<ffffffff816aa694>] __alloc_pages_slowpath+0x5d6/0x724
Apr 11 10:21:50 kernel: [<ffffffff811912a5>] __alloc_pages_nodemask+0x405/0x420
Apr 11 10:21:50 kernel: [<ffffffff811d8a75>] alloc_pages_vma+0xb5/0x200
Apr 11 10:21:50 kernel: [<ffffffff811b6c50>] handle_mm_fault+0xb60/0xfa0
Apr 11 10:21:50 kernel: [<ffffffff81333563>] ? number.isra.2+0x323/0x360
Apr 11 10:21:50 kernel: [<ffffffff816bb504>] __do_page_fault+0x154/0x450
Apr 11 10:21:50 kernel: [<ffffffff816bb835>] do_page_fault+0x35/0x90
Apr 11 10:21:50 kernel: [<ffffffff816b7768>] page_fault+0x28/0x30
Apr 11 10:21:50 kernel: [<ffffffff81336539>] ? copy_user_generic_unrolled+0x89/0xc0
Apr 11 10:21:50 kernel: [<ffffffff8122b369>] ? seq_read+0x2c9/0x3e0
Apr 11 10:21:50 kernel: [<ffffffff812756b0>] proc_reg_read+0x40/0x80
Apr 11 10:21:50 kernel: [<ffffffff812054ef>] vfs_read+0x9f/0x170
Apr 11 10:21:50 kernel: [<ffffffff812063bf>] SyS_read+0x7f/0xe0
Apr 11 10:21:50 kernel: [<ffffffff816c0715>] system_call_fastpath+0x1c/0x21
Apr 11 10:21:50 kernel: Mem-Info:
Apr 11 10:21:50 kernel: active_anon:2911137 inactive_anon:58987 isolated_anon:0#012 active_file:231 inactive_file:1006 isolated_file:193#012 unevictable:941692 dirty:28 writeback:0 unstable:0#012 slab_reclaimable:57650 slab_unreclaimable:23641#012 mapped:69758 shmem:206054 pagetables:15540 bounce:0#012 free:33111 free_pcp:150 free_cma:0
Apr 11 10:21:50 kernel: Node 0 DMA ##free:15840kB## min:64kB low:80kB ##high:96kB## active_anon:0kB inactive_anon:0kB active_file:0kB inactive_file:0kB unevictable:0kB isolated(anon):0kB isolated(file):0kB present:15988kB managed:15904kB mlocked:0kB dirty:0kB writeback:0kB mapped:0kB shmem:0kB slab_reclaimable:0kB slab_unreclaimable:32kB kernel_stack:0kB pagetables:0kB unstable:0kB bounce:0kB free_pcp:0kB local_pcp:0kB free_cma:0kB writeback_tmp:0kB pages_scanned:0 all_unreclaimable? yes
Apr 11 10:21:50 kernel: lowmem_reserve[]: 0 2977 ##16015## 16015
Apr 11 10:21:50 kernel: Node 0 DMA32 ##free:61692kB## min:12544kB low:15680kB ##high:18816kB## active_anon:2234420kB inactive_anon:41828kB active_file:0kB inactive_file:28kB unevictable:526724kB isolated(anon):0kB isolated(file):0kB present:3129152kB managed:3048960kB mlocked:526724kB dirty:0kB writeback:0kB mapped:55232kB shmem:147496kB slab_reclaimable:70336kB slab_unreclaimable:44232kB kernel_stack:12192kB pagetables:9352kB unstable:0kB bounce:0kB free_pcp:0kB local_pcp:0kB free_cma:0kB writeback_tmp:0kB pages_scanned:88 all_unreclaimable? yes
Apr 11 10:21:50 kernel: lowmem_reserve[]: 0 0 ##13037## 13037
Apr 11 10:21:50 kernel: Node 0 Normal ##free:54912kB min:54968kB## low:68708kB high:82452kB active_anon:9410128kB inactive_anon:194120kB active_file:924kB inactive_file:3996kB unevictable:3240044kB isolated(anon):0kB isolated(file):772kB present:13631488kB managed:13350504kB mlocked:3240044kB dirty:112kB writeback:0kB mapped:223800kB shmem:676720kB slab_reclaimable:160264kB slab_unreclaimable:50300kB kernel_stack:9600kB pagetables:52808kB unstable:0kB bounce:0kB free_pcp:600kB local_pcp:0kB free_cma:0kB writeback_tmp:0kB pages_scanned:650 all_unreclaimable? no
Apr 11 10:21:50 kernel: lowmem_reserve[]: 0 0 0 0
Apr 11 10:21:50 kernel: Node 0 DMA: 0*4kB 0*8kB 0*16kB 1*32kB (U) 1*64kB (U) 1*128kB (U) 1*256kB (U) 0*512kB 1*1024kB (U) 1*2048kB (M) 3*4096kB (M) = 15840kB
Apr 11 10:21:50 kernel: Node 0 DMA32: 4174*4kB (UEM) 1746*8kB (UEM) 1143*16kB (UEM) 387*32kB (EM) 0*64kB 0*128kB 0*256kB 0*512kB 0*1024kB 0*2048kB 0*4096kB = 61336kB
Apr 11 10:21:50 kernel: Node 0 Normal: 13786*4kB (UE) 19*8kB (U) 1*16kB (M) 0*32kB 0*64kB 0*128kB 0*256kB 0*512kB 0*1024kB 0*2048kB 0*4096kB = 55312kB
Apr 11 10:21:50 kernel: Node 0 hugepages_total=0 hugepages_free=0 hugepages_surp=0 hugepages_size=1048576kB
Apr 11 10:21:50 kernel: Node 0 hugepages_total=0 hugepages_free=0 hugepages_surp=0 hugepages_size=2048kB
Apr 11 10:21:50 kernel: 273226 total pagecache pages
Apr 11 10:21:50 kernel: 0 pages in swap cache
Apr 11 10:21:50 kernel: Swap cache stats: add 0, delete 0, find 0/0
Apr 11 10:21:50 kernel: Free swap = 0kB
Apr 11 10:21:50 kernel: Total swap = 0kB
Apr 11 10:21:50 kernel: 4194157 pages RAM
Apr 11 10:21:50 kernel: 0 pages HighMem/MovableOnly
Apr 11 10:21:50 kernel: 90315 pages reserved
此时node0 normal 4kB 是包含E类型的内存页的,为什么会申请失败,先计算一下?
UME, 分别 表示UNMOVABEL/RECLAIMABAL/MOVABLE)
1. vmtoolsd 申请4k(order=0)内存, 此时normal的free 内存是54912kB < min 54968kB,normal申请失败;
2. 转向DMA32 申请,high(18816kB)+ lowmem_reserve[2](13037*4kB)=70964kB > free(61692kB)不允许申请, DMA32申请失败;
3. 转向DMA申请,high(96kB)+ lowmem_reserve[2](16015*4kB)=64156kB > free(15840kB) 不允许申请,DMA申请失败
其中用##
标注的字段重点关注,用到的公式
normal : min < free 才能申请成功
high+lowmem_reserve[2] < free DMA32 DMA才能申请成功
申请顺序 normal -> DMA32 -> DMA
看一下各进程的内存使用情况
Apr 11 10:21:50 kernel: [ pid ] uid tgid total_vm rss nr_ptes swapents oom_score_adj name
Apr 11 10:21:50 kernel: [ 641] 0 641 15651 5236 34 0 0 systemd-journal
Apr 11 10:21:50 kernel: [ 667] 0 667 66578 635 33 0 0 lvmetad
Apr 11 10:21:50 kernel: [ 669] 0 669 11231 477 23 0 -1000 systemd-udevd
Apr 11 10:21:50 kernel: [ 835] 0 835 13863 140 26 0 -1000 auditd
Apr 11 10:21:50 kernel: [ 858] 81 858 13504 263 30 0 -900 dbus-daemon
Apr 11 10:21:50 kernel: [ 859] 0 859 117945 1113 83 0 0 NetworkManager
Apr 11 10:21:50 kernel: [ 862] 999 862 159624 2479 60 0 0 polkitd
Apr 11 10:21:50 kernel: [ 866] 0 866 24902 401 40 0 0 VGAuthService
Apr 11 10:21:50 kernel: [ 867] 0 867 76271 482 57 0 0 vmtoolsd
Apr 11 10:21:50 kernel: [ 869] 0 869 166217 4091 136 0 0 rsyslogd
Apr 11 10:21:50 kernel: [ 873] 0 873 5406 143 15 0 0 irqbalance
Apr 11 10:21:50 kernel: [ 876] 0 876 31567 257 20 0 0 crond
Apr 11 10:21:50 kernel: [ 885] 998 885 5619 103 16 0 0 chronyd
Apr 11 10:21:50 kernel: [ 900] 0 900 27511 31 9 0 0 agetty
Apr 11 10:21:50 kernel: [ 1156] 0 1156 26499 261 54 0 -1000 sshd
Apr 11 10:21:50 kernel: [ 1162] 0 1162 141806 2682 96 0 0 tuned
Apr 11 10:21:50 kernel: [ 1368] 0 1368 9913 469 23 0 0 nginx
Apr 11 10:21:50 kernel: [ 2127] 0 2127 8873 79 17 0 0 gseMaster
Apr 11 10:21:50 kernel: [ 2131] 0 2131 469581 7598 81 0 0 agentWorker
Apr 11 10:21:50 kernel: [ 2291] 0 2291 6754 2299 18 0 0 basereport
Apr 11 10:21:50 kernel: [ 3256] 0 3256 28963 3832 21 0 0 node_exporter
Apr 11 10:21:50 kernel: [ 6759] 0 6759 6143 362 16 0 0 systemd-logind
Apr 11 10:21:50 kernel: [24650] 993 24650 1745496 85641 349 0 0 java
Apr 11 10:21:50 kernel: [112489] 2048 112489 28805 2815 18 0 0 mysqld_exporter
Apr 11 10:21:50 kernel: [63602] 992 63602 2937758 600771 1652 0 0 java
Apr 11 10:21:50 kernel: [65678] 0 65678 177274 1212 19 0 0 canal-cfg
Apr 11 10:21:50 kernel: [91161] 0 91161 3488124 455619 2405 0 0 vmstorage-prod
Apr 11 10:21:50 kernel: [30303] 0 30303 311149 98508 268 0 0 vmselect-prod
Apr 11 10:21:50 kernel: [45887] 991 45887 579175 94956 427 0 0 mongod
Apr 11 10:21:50 kernel: [126529] 2049 126529 182474 607 24 0 0 gogstash
Apr 11 10:21:50 kernel: [126535] 2049 126535 200459 3544 77 0 0 gogstash
Apr 11 10:21:50 kernel: [126536] 2049 126536 268933 17364 208 0 0 gogstash
Apr 11 10:21:50 kernel: [126540] 2049 126540 252132 8255 190 0 0 gogstash
Apr 11 10:21:50 kernel: [19058] 27 19058 1243415 132219 407 0 0 mysqld
Apr 11 10:21:50 kernel: [101875] 0 101875 180408 6846 55 0 0 vminsert-prod
Apr 11 10:21:50 kernel: [102791] 994 102791 44639 1496 31 0 0 redis-server
Apr 11 10:21:50 kernel: [77582] 0 77582 10085 612 23 0 0 nginx
Apr 11 10:21:50 kernel: [77583] 0 77583 10085 604 23 0 0 nginx
Apr 11 10:21:50 kernel: [77584] 0 77584 10085 636 23 0 0 nginx
Apr 11 10:21:50 kernel: [77585] 0 77585 10085 621 23 0 0 nginx
Apr 11 10:21:50 kernel: [77586] 0 77586 10085 608 23 0 0 nginx
Apr 11 10:21:50 kernel: [77587] 0 77587 10085 590 23 0 0 nginx
Apr 11 10:21:50 kernel: [77588] 0 77588 10085 613 23 0 0 nginx
Apr 11 10:21:50 kernel: [77589] 0 77589 10085 608 23 0 0 nginx
Apr 11 10:21:50 kernel: [16335] 0 16335 2272518 419842 991 0 0 java
Apr 11 10:21:50 kernel: [21482] 0 21482 181686 4520 39 0 0 fc-dqueue
Apr 11 10:21:50 kernel: [21511] 0 21511 183845 6005 43 0 0 fc-engine
Apr 11 10:21:50 kernel: [21534] 0 21534 184950 6094 47 0 0 fc-alert
Apr 11 10:21:50 kernel: [28651] 0 28651 202733 15175 84 0 0 fc-scrape
Apr 11 10:21:50 kernel: [14336] 0 14336 1001051 700273 1646 0 0 fc-insight
Apr 11 10:21:50 kernel: [16802] 996 16802 13376462 1012810 5283 0 0 java
Apr 11 10:21:50 kernel: [17051] 996 17051 13618 280 20 0 0 controller
Apr 11 10:21:50 kernel: [126835] 0 126835 45055 508 45 0 0 crond
Apr 11 10:21:50 kernel: [126838] 0 126838 45055 512 45 0 0 crond
Apr 11 10:21:50 kernel: [126839] 0 126839 45054 493 45 0 0 crond
Apr 11 10:21:50 kernel: Out of memory: Kill process 16802 (java) score 248 or sacrifice child
Apr 11 10:21:50 kernel: Killed process 17051 (controller) total-vm:54472kB, anon-rss:596kB, file-rss:524kB, shmem-rss:0kB
10:21:55 kthreadd申请16KB(order=2)内存, 触发oom. 这时 normal已经没有16KB的内存了, 需要看DMA32和DMA是否可以申请。
Apr 11 10:21:55 kernel: kthreadd invoked oom-killer: gfp_mask=0x3000d0, order=2, oom_score_adj=0
Apr 11 10:21:55 kernel: kthreadd cpuset=/ mems_allowed=0
Apr 11 10:21:55 kernel: CPU: 3 PID: 2 Comm: kthreadd Not tainted 3.10.0-693.21.1.el7.x86_64 #1
Apr 11 10:21:55 kernel: Hardware name: VMware, Inc. VMware Virtual Platform/440BX Desktop Reference Platform, BIOS 6.00 12/12/2018
Apr 11 10:21:55 kernel: Call Trace:
Apr 11 10:21:55 kernel: [<ffffffff816ae7c8>] dump_stack+0x19/0x1b
Apr 11 10:21:55 kernel: [<ffffffff816a9b90>] dump_header+0x90/0x229
Apr 11 10:21:55 kernel: [<ffffffff810ecec2>] ? ktime_get_ts64+0x52/0xf0
Apr 11 10:21:55 kernel: [<ffffffff8114140f>] ? delayacct_end+0x8f/0xb0
Apr 11 10:21:55 kernel: [<ffffffff8118a884>] oom_kill_process+0x254/0x3d0
Apr 11 10:21:55 kernel: [<ffffffff8118a32d>] ? oom_unkillable_task+0xcd/0x120
Apr 11 10:21:55 kernel: [<ffffffff8118a3d6>] ? find_lock_task_mm+0x56/0xc0
Apr 11 10:21:55 kernel: [<ffffffff8118b0c6>] out_of_memory+0x4b6/0x4f0
Apr 11 10:21:55 kernel: [<ffffffff816aa694>] __alloc_pages_slowpath+0x5d6/0x724
Apr 11 10:21:55 kernel: [<ffffffff811912a5>] __alloc_pages_nodemask+0x405/0x420
Apr 11 10:21:55 kernel: [<ffffffff810b3f60>] ? insert_kthread_work+0x40/0x40
Apr 11 10:21:55 kernel: [<ffffffff8108859d>] copy_process+0x1dd/0x1970
Apr 11 10:21:55 kernel: [<ffffffff81089ee1>] do_fork+0x91/0x320
Apr 11 10:21:55 kernel: [<ffffffff810b3f60>] ? insert_kthread_work+0x40/0x40
Apr 11 10:21:55 kernel: [<ffffffff8108a196>] kernel_thread+0x26/0x30
Apr 11 10:21:55 kernel: [<ffffffff810b49f1>] kthreadd+0x2c1/0x300
Apr 11 10:21:55 kernel: [<ffffffff810b4730>] ? kthread_create_on_cpu+0x60/0x60
Apr 11 10:21:55 kernel: [<ffffffff816c055d>] ret_from_fork+0x5d/0xb0
Apr 11 10:21:55 kernel: [<ffffffff810b4730>] ? kthread_create_on_cpu+0x60/0x60
Apr 11 10:21:55 kernel: Mem-Info:
Apr 11 10:21:55 kernel: active_anon:2910986 inactive_anon:58989 isolated_anon:0#012 active_file:115 inactive_file:174 isolated_file:0#012 unevictable:941692 dirty:1 writeback:2 unstable:0#012 slab_reclaimable:57632 slab_unreclaimable:23712#012 mapped:69724 shmem:206062 pagetables:15519 bounce:0#012 free:32711 free_pcp:128 free_cma:0
Apr 11 10:21:55 kernel: Node 0 DMA free:15840kB min:64kB low:80kB high:96kB active_anon:0kB inactive_anon:0kB active_file:0kB inactive_file:0kB unevictable:0kB isolated(anon):0kB isolated(file):0kB present:15988kB managed:15904kB mlocked:0kB dirty:0kB writeback:0kB mapped:0kB shmem:0kB slab_reclaimable:0kB slab_unreclaimable:32kB kernel_stack:0kB pagetables:0kB unstable:0kB bounce:0kB free_pcp:0kB local_pcp:0kB free_cma:0kB writeback_tmp:0kB pages_scanned:0 all_unreclaimable? yes
Apr 11 10:21:55 kernel: lowmem_reserve[]: 0 2977 16015 16015
Apr 11 10:21:55 kernel: Node 0 DMA32 free:60112kB min:12544kB low:15680kB high:18816kB active_anon:2233580kB inactive_anon:41828kB active_file:0kB inactive_file:28kB unevictable:526724kB isolated(anon):0kB isolated(file):0kB present:3129152kB managed:3048960kB mlocked:526724kB dirty:0kB writeback:0kB mapped:55296kB shmem:147496kB slab_reclaimable:70336kB slab_unreclaimable:44456kB kernel_stack:12032kB pagetables:9344kB unstable:0kB bounce:0kB free_pcp:0kB local_pcp:0kB free_cma:0kB writeback_tmp:0kB pages_scanned:184 all_unreclaimable? yes
Apr 11 10:21:55 kernel: lowmem_reserve[]: 0 0 13037 13037
Apr 11 10:21:55 kernel: Node 0 Normal free:55152kB min:54968kB low:68708kB high:82452kB active_anon:9410364kB inactive_anon:194128kB active_file:460kB inactive_file:668kB unevictable:3240044kB isolated(anon):0kB isolated(file):0kB present:13631488kB managed:13350504kB mlocked:3240044kB dirty:4kB writeback:8kB mapped:223600kB shmem:676752kB slab_reclaimable:160192kB slab_unreclaimable:50360kB kernel_stack:9600kB pagetables:52732kB unstable:0kB bounce:0kB free_pcp:404kB local_pcp:0kB free_cma:0kB writeback_tmp:0kB pages_scanned:32 all_unreclaimable? no
Apr 11 10:21:55 kernel: lowmem_reserve[]: 0 0 0 0
Apr 11 10:21:55 kernel: Node 0 DMA: 0*4kB 0*8kB 0*16kB 1*32kB (U) 1*64kB (U) 1*128kB (U) 1*256kB (U) 0*512kB 1*1024kB (U) 1*2048kB (M) 3*4096kB (M) = 15840kB
Apr 11 10:21:55 kernel: Node 0 DMA32: 4382*4kB (UEM) 1747*8kB (UEM) 1152*16kB (UEM) 317*32kB (EM) 0*64kB 0*128kB 0*256kB 0*512kB 0*1024kB 0*2048kB 0*4096kB = 60080kB
Apr 11 10:21:55 kernel: Node 0 Normal: 13849*4kB (UEM) 55*8kB (UEM) 0*16kB 0*32kB 0*64kB 0*128kB 0*256kB 0*512kB 0*1024kB 0*2048kB 0*4096kB = 55836kB
Apr 11 10:21:55 kernel: Node 0 hugepages_total=0 hugepages_free=0 hugepages_surp=0 hugepages_size=1048576kB
Apr 11 10:21:55 kernel: Node 0 hugepages_total=0 hugepages_free=0 hugepages_surp=0 hugepages_size=2048kB
Apr 11 10:21:55 kernel: 272055 total pagecache pages
Apr 11 10:21:55 kernel: 0 pages in swap cache
Apr 11 10:21:55 kernel: Swap cache stats: add 0, delete 0, find 0/0
Apr 11 10:21:55 kernel: Free swap = 0kB
Apr 11 10:21:55 kernel: Total swap = 0kB
Apr 11 10:21:55 kernel: 4194157 pages RAM
Apr 11 10:21:55 kernel: 0 pages HighMem/MovableOnly
Apr 11 10:21:55 kernel: 90315 pages reserved
1 kthreadd 申请16kB连续内存(order=2),此时normal的free 内存是54968kB < min 55152kB,normal申请失败;
2 转向DMA32 申请,high(18816kB)+ lowmem_reserve[2](13037*4kB)=70964kB > free(60112kB)不允许申请, DMA32申请失败;
3 转向DMA申请,high(96kB)+ lowmem_reserve[2](16015*4kB)=64156kB > free(15840kB) 不允许申请,DMA申请失败
此时各进程的内存使用情况
Apr 11 10:21:55 kernel: [ pid ] uid tgid total_vm rss nr_ptes swapents oom_score_adj name
Apr 11 10:21:55 kernel: [ 641] 0 641 15651 5271 34 0 0 systemd-journal
Apr 11 10:21:55 kernel: [ 667] 0 667 66578 635 33 0 0 lvmetad
Apr 11 10:21:55 kernel: [ 669] 0 669 11231 477 23 0 -1000 systemd-udevd
Apr 11 10:21:55 kernel: [ 835] 0 835 13863 164 26 0 -1000 auditd
Apr 11 10:21:55 kernel: [ 858] 81 858 13504 286 30 0 -900 dbus-daemon
Apr 11 10:21:55 kernel: [ 859] 0 859 117945 1113 83 0 0 NetworkManager
Apr 11 10:21:55 kernel: [ 862] 999 862 159624 2442 60 0 0 polkitd
Apr 11 10:21:55 kernel: [ 866] 0 866 24902 401 40 0 0 VGAuthService
Apr 11 10:21:55 kernel: [ 867] 0 867 76271 482 57 0 0 vmtoolsd
Apr 11 10:21:55 kernel: [ 869] 0 869 166218 4092 136 0 0 rsyslogd
Apr 11 10:21:55 kernel: [ 873] 0 873 5406 143 15 0 0 irqbalance
Apr 11 10:21:55 kernel: [ 876] 0 876 31567 257 20 0 0 crond
Apr 11 10:21:55 kernel: [ 885] 998 885 5619 103 16 0 0 chronyd
Apr 11 10:21:55 kernel: [ 900] 0 900 27511 31 9 0 0 agetty
Apr 11 10:21:55 kernel: [ 1156] 0 1156 26499 261 54 0 -1000 sshd
Apr 11 10:21:55 kernel: [ 1162] 0 1162 141806 2682 96 0 0 tuned
Apr 11 10:21:55 kernel: [ 1368] 0 1368 9913 469 23 0 0 nginx
Apr 11 10:21:55 kernel: [ 2127] 0 2127 8873 79 17 0 0 gseMaster
Apr 11 10:21:55 kernel: [ 2131] 0 2131 469582 7598 81 0 0 agentWorker
Apr 11 10:21:55 kernel: [ 2291] 0 2291 6754 2299 18 0 0 basereport
Apr 11 10:21:55 kernel: [ 3256] 0 3256 28963 3932 21 0 0 node_exporter
Apr 11 10:21:55 kernel: [ 6759] 0 6759 6143 374 16 0 0 systemd-logind
Apr 11 10:21:55 kernel: [24650] 993 24650 1745496 85641 349 0 0 java
Apr 11 10:21:55 kernel: [112489] 2048 112489 28805 2815 18 0 0 mysqld_exporter
Apr 11 10:21:55 kernel: [63602] 992 63602 2937758 600771 1652 0 0 java
Apr 11 10:21:55 kernel: [65678] 0 65678 177274 1212 19 0 0 canal-cfg
Apr 11 10:21:55 kernel: [91161] 0 91161 3488135 455496 2405 0 0 vmstorage-prod
Apr 11 10:21:55 kernel: [30303] 0 30303 311149 98508 268 0 0 vmselect-prod
Apr 11 10:21:55 kernel: [45887] 991 45887 579175 94953 427 0 0 mongod
Apr 11 10:21:55 kernel: [126529] 2049 126529 182474 607 24 0 0 gogstash
Apr 11 10:21:55 kernel: [126535] 2049 126535 200459 3544 77 0 0 gogstash
Apr 11 10:21:55 kernel: [126536] 2049 126536 268933 17364 208 0 0 gogstash
Apr 11 10:21:55 kernel: [126540] 2049 126540 252132 8255 190 0 0 gogstash
Apr 11 10:21:55 kernel: [19058] 27 19058 1243415 132226 407 0 0 mysqld
Apr 11 10:21:55 kernel: [101875] 0 101875 180408 6846 55 0 0 vminsert-prod
Apr 11 10:21:55 kernel: [102791] 994 102791 44639 1484 31 0 0 redis-server
Apr 11 10:21:55 kernel: [77582] 0 77582 10085 581 23 0 0 nginx
Apr 11 10:21:55 kernel: [77583] 0 77583 10085 603 23 0 0 nginx
Apr 11 10:21:55 kernel: [77584] 0 77584 10085 605 23 0 0 nginx
Apr 11 10:21:55 kernel: [77585] 0 77585 10085 641 23 0 0 nginx
Apr 11 10:21:55 kernel: [77586] 0 77586 10085 616 23 0 0 nginx
Apr 11 10:21:55 kernel: [77587] 0 77587 10085 593 23 0 0 nginx
Apr 11 10:21:55 kernel: [77588] 0 77588 10085 612 23 0 0 nginx
Apr 11 10:21:55 kernel: [77589] 0 77589 10085 626 23 0 0 nginx
Apr 11 10:21:55 kernel: [16335] 0 16335 2272518 419845 991 0 0 java
Apr 11 10:21:55 kernel: [21482] 0 21482 181686 4528 39 0 0 fc-dqueue
Apr 11 10:21:55 kernel: [21511] 0 21511 183845 6007 43 0 0 fc-engine
Apr 11 10:21:55 kernel: [21534] 0 21534 184950 6103 47 0 0 fc-alert
Apr 11 10:21:55 kernel: [28651] 0 28651 202733 15175 84 0 0 fc-scrape
Apr 11 10:21:55 kernel: [14336] 0 14336 1001051 700302 1646 0 0 fc-insight
Apr 11 10:21:55 kernel: [16802] 996 16802 13376462 1012801 5283 0 0 java
Apr 11 10:21:55 kernel: [126835] 0 126835 45054 490 45 0 0 crond
Apr 11 10:21:55 kernel: [126838] 0 126838 45054 493 45 0 0 crond
Apr 11 10:21:55 kernel: [126839] 0 126839 45054 493 45 0 0 crond
Apr 11 10:21:55 kernel: Out of memory: Kill process 16802 (java) score 248 or sacrifice child
Apr 11 10:21:55 kernel: Killed process 16802 (java) total-vm:53505848kB, anon-rss:3793240kB, file-rss:257964kB, shmem-rss:0kB
对比5s内进程的内存,看看是否有内存突增的进程. pid 为
641,835,858,862,869,2131,3256,6759,91161,45887,19058,102791,77582,77583,77584,77585,77586,77587,77588,77589,16335,21482,21511,21534,28651,14336,16802,17051,126835,126838
的进程内存有增加, 但是增加幅度最大的是pid为3256的node_exporter ,增幅为100kb
12:00:44 elasticsearch 申请4KB内存触发oom,
Apr 11 12:00:41 kernel: elasticsearch[n invoked oom-killer: gfp_mask=0xd0, order=0, oom_score_adj=0
Apr 11 12:00:43 kernel: elasticsearch[n cpuset=/ mems_allowed=0
Apr 11 12:00:44 kernel: CPU: 3 PID: 1649 Comm: elasticsearch[n Not tainted 3.10.0-693.21.1.el7.x86_64 #1
Apr 11 12:00:44 kernel: Hardware name: VMware, Inc. VMware Virtual Platform/440BX Desktop Reference Platform, BIOS 6.00 12/12/2018
Apr 11 12:00:44 kernel: Call Trace:
Apr 11 12:00:44 kernel: [<ffffffff816ae7c8>] dump_stack+0x19/0x1b
Apr 11 12:00:44 kernel: [<ffffffff816a9b90>] dump_header+0x90/0x229
Apr 11 12:00:44 kernel: [<ffffffff810ecec2>] ? ktime_get_ts64+0x52/0xf0
Apr 11 12:00:44 kernel: [<ffffffff8114140f>] ? delayacct_end+0x8f/0xb0
Apr 11 12:00:44 kernel: [<ffffffff8118a884>] oom_kill_process+0x254/0x3d0
Apr 11 12:00:44 kernel: [<ffffffff8118a32d>] ? oom_unkillable_task+0xcd/0x120
Apr 11 12:00:44 kernel: [<ffffffff8118a3d6>] ? find_lock_task_mm+0x56/0xc0
Apr 11 12:00:44 kernel: [<ffffffff8118b0c6>] out_of_memory+0x4b6/0x4f0
Apr 11 12:00:44 kernel: [<ffffffff816aa694>] __alloc_pages_slowpath+0x5d6/0x724
Apr 11 12:00:44 kernel: [<ffffffff811912a5>] __alloc_pages_nodemask+0x405/0x420
Apr 11 12:00:44 kernel: [<ffffffff811d5a38>] alloc_pages_current+0x98/0x110
Apr 11 12:00:44 kernel: [<ffffffff81576a16>] skb_page_frag_refill+0x96/0xc0
Apr 11 12:00:44 kernel: [<ffffffff815777d3>] sk_page_frag_refill+0x23/0xd0
Apr 11 12:00:44 kernel: [<ffffffff815e4f83>] tcp_sendmsg+0x263/0xc30
Apr 11 12:00:44 kernel: [<ffffffff816109b9>] inet_sendmsg+0x69/0xb0
Apr 11 12:00:44 kernel: [<ffffffff815725ad>] sock_aio_write+0x15d/0x180
Apr 11 12:00:44 kernel: [<ffffffff81204ba3>] do_sync_write+0x93/0xe0
Apr 11 12:00:44 kernel: [<ffffffff81205785>] vfs_write+0x1c5/0x1f0
Apr 11 12:00:44 kernel: [<ffffffff8120649f>] SyS_write+0x7f/0xe0
Apr 11 12:00:44 kernel: [<ffffffff816c0991>] tracesys+0x9d/0xc3
Apr 11 12:00:44 kernel: Mem-Info:
Apr 11 12:00:44 kernel: active_anon:2922193 inactive_anon:58912 isolated_anon:0#012 active_file:164 inactive_file:27 isolated_file:72#012 unevictable:936747 dirty:44 writeback:29 unstable:0#012 slab_reclaimable:54958 slab_unreclaimable:22419#012 mapped:70287 shmem:206057 pagetables:13716 bounce:0#012 free:32270 free_pcp:24 free_cma:0
Apr 11 12:00:44 kernel: Node 0 DMA free:15840kB min:64kB low:80kB high:96kB active_anon:0kB inactive_anon:0kB active_file:0kB inactive_file:0kB unevictable:0kB isolated(anon):0kB isolated(file):0kB present:15988kB managed:15904kB mlocked:0kB dirty:0kB writeback:0kB mapped:0kB shmem:0kB slab_reclaimable:0kB slab_unreclaimable:32kB kernel_stack:0kB pagetables:0kB unstable:0kB bounce:0kB free_pcp:0kB local_pcp:0kB free_cma:0kB writeback_tmp:0kB pages_scanned:0 all_unreclaimable? yes
Apr 11 12:00:44 kernel: lowmem_reserve[]: 0 2977 16015 16015
Apr 11 12:00:44 kernel: Node 0 DMA32 free:58552kB min:12544kB low:15680kB high:18816kB active_anon:1940780kB inactive_anon:41804kB active_file:0kB inactive_file:24kB unevictable:839692kB isolated(anon):0kB isolated(file):0kB present:3129152kB managed:3048960kB mlocked:839692kB dirty:0kB writeback:0kB mapped:38592kB shmem:147504kB slab_reclaimable:62024kB slab_unreclaimable:39432kB kernel_stack:12128kB pagetables:9084kB unstable:0kB bounce:0kB free_pcp:0kB local_pcp:0kB free_cma:0kB writeback_tmp:0kB pages_scanned:91 all_unreclaimable? yes
Apr 11 12:00:44 kernel: lowmem_reserve[]: 0 0 13037 13037
Apr 11 12:00:44 kernel: Node 0 Normal free:54688kB min:54968kB low:68708kB high:82452kB active_anon:9747992kB inactive_anon:193844kB active_file:656kB inactive_file:84kB unevictable:2907296kB isolated(anon):0kB isolated(file):288kB present:13631488kB managed:13350504kB mlocked:2907296kB dirty:176kB writeback:116kB mapped:242556kB shmem:676724kB slab_reclaimable:157808kB slab_unreclaimable:50212kB kernel_stack:10704kB pagetables:45780kB unstable:0kB bounce:0kB free_pcp:96kB local_pcp:0kB free_cma:0kB writeback_tmp:0kB pages_scanned:1160 all_unreclaimable? yes
Apr 11 12:00:44 kernel: lowmem_reserve[]: 0 0 0 0
Apr 11 12:00:44 kernel: Node 0 DMA: 0*4kB 0*8kB 0*16kB 1*32kB (U) 1*64kB (U) 1*128kB (U) 1*256kB (U) 0*512kB 1*1024kB (U) 1*2048kB (M) 3*4096kB (M) = 15840kB
Apr 11 12:00:44 kernel: Node 0 DMA32: 2340*4kB (UEM) 987*8kB (UEM) 620*16kB (UEM) 221*32kB (UEM) 218*64kB (UEM) 81*128kB (UEM) 1*256kB (M) 0*512kB 0*1024kB 0*2048kB 0*4096kB = 58824kB
Apr 11 12:00:44 kernel: Node 0 Normal: 13832*4kB (UEM) 25*8kB (UEM) 0*16kB 0*32kB 0*64kB 0*128kB 0*256kB 0*512kB 0*1024kB 0*2048kB 0*4096kB = 55528kB
Apr 11 12:00:44 kernel: Node 0 hugepages_total=0 hugepages_free=0 hugepages_surp=0 hugepages_size=1048576kB
Apr 11 12:00:44 kernel: Node 0 hugepages_total=0 hugepages_free=0 hugepages_surp=0 hugepages_size=2048kB
Apr 11 12:00:44 kernel: 271887 total pagecache pages
Apr 11 12:00:44 kernel: 0 pages in swap cache
Apr 11 12:00:44 kernel: Swap cache stats: add 0, delete 0, find 0/0
Apr 11 12:00:44 kernel: Free swap = 0kB
Apr 11 12:00:44 kernel: Total swap = 0kB
Apr 11 12:00:44 kernel: 4194157 pages RAM
Apr 11 12:00:44 kernel: 0 pages HighMem/MovableOnly
Apr 11 12:00:44 kernel: 90315 pages reserved
1 es申请4kB连续内存(order=0),此时normal的free 内存是54968kB < min 54688kB,normal申请失败;
2 转向DMA32 申请,high(18816kB)+ lowmem_reserve[2](13037*4kB)=70964kB > free(58552kB)不允许申请, DMA32申请失败;
3 转向DMA申请,high(96kB)+ lowmem_reserve[2](16015*4kB)=64156kB > free(15840kB) 不允许申请,DMA申请失败
Apr 11 12:00:44 kernel: [ pid ] uid tgid total_vm rss nr_ptes swapents oom_score_adj name
Apr 11 12:00:44 kernel: [ 641] 0 641 15650 6231 35 0 0 systemd-journal
Apr 11 12:00:44 kernel: [ 667] 0 667 66578 635 33 0 0 lvmetad
Apr 11 12:00:44 kernel: [ 669] 0 669 11231 477 23 0 -1000 systemd-udevd
Apr 11 12:00:44 kernel: [ 835] 0 835 13863 142 26 0 -1000 auditd
Apr 11 12:00:44 kernel: [ 858] 81 858 13504 271 30 0 -900 dbus-daemon
Apr 11 12:00:44 kernel: [ 859] 0 859 117945 1100 83 0 0 NetworkManager
Apr 11 12:00:44 kernel: [ 862] 999 862 159624 2457 60 0 0 polkitd
Apr 11 12:00:44 kernel: [ 866] 0 866 24902 401 40 0 0 VGAuthService
Apr 11 12:00:44 kernel: [ 867] 0 867 76270 467 57 0 0 vmtoolsd
Apr 11 12:00:44 kernel: [ 869] 0 869 166217 4566 138 0 0 rsyslogd
Apr 11 12:00:44 kernel: [ 873] 0 873 5406 143 15 0 0 irqbalance
Apr 11 12:00:44 kernel: [ 876] 0 876 31567 258 20 0 0 crond
Apr 11 12:00:44 kernel: [ 885] 998 885 5619 147 16 0 0 chronyd
Apr 11 12:00:44 kernel: [ 900] 0 900 27511 31 9 0 0 agetty
Apr 11 12:00:44 kernel: [ 1156] 0 1156 26499 244 54 0 -1000 sshd
Apr 11 12:00:44 kernel: [ 1162] 0 1162 141806 2682 96 0 0 tuned
Apr 11 12:00:44 kernel: [ 1368] 0 1368 9913 468 23 0 0 nginx
Apr 11 12:00:44 kernel: [ 2127] 0 2127 8873 79 17 0 0 gseMaster
Apr 11 12:00:44 kernel: [ 2131] 0 2131 469581 7602 81 0 0 agentWorker
Apr 11 12:00:44 kernel: [ 2291] 0 2291 6754 1923 18 0 0 basereport
Apr 11 12:00:44 kernel: [ 3256] 0 3256 28963 3399 21 0 0 node_exporter
Apr 11 12:00:44 kernel: [ 6759] 0 6759 6143 381 16 0 0 systemd-logind
Apr 11 12:00:44 kernel: [24650] 993 24650 1745496 163672 501 0 0 java
Apr 11 12:00:44 kernel: [112489] 2048 112489 28805 2815 18 0 0 mysqld_exporter
Apr 11 12:00:44 kernel: [63602] 992 63602 2969796 616252 1707 0 0 java
Apr 11 12:00:44 kernel: [65678] 0 65678 177274 1290 19 0 0 canal-cfg
Apr 11 12:00:44 kernel: [91161] 0 91161 3514031 462173 2450 0 0 vmstorage-prod
Apr 11 12:00:44 kernel: [30303] 0 30303 329175 101500 298 0 0 vmselect-prod
Apr 11 12:00:44 kernel: [45887] 991 45887 581759 95156 436 0 0 mongod
Apr 11 12:00:44 kernel: [126529] 2049 126529 182474 607 24 0 0 gogstash
Apr 11 12:00:44 kernel: [126535] 2049 126535 200459 3535 77 0 0 gogstash
Apr 11 12:00:44 kernel: [126536] 2049 126536 268933 18937 208 0 0 gogstash
Apr 11 12:00:44 kernel: [126540] 2049 126540 252132 8203 190 0 0 gogstash
Apr 11 12:00:44 kernel: [19058] 27 19058 1243415 132322 407 0 0 mysqld
Apr 11 12:00:44 kernel: [101875] 0 101875 180408 6172 55 0 0 vminsert-prod
Apr 11 12:00:44 kernel: [102791] 994 102791 47967 1668 35 0 0 redis-server
Apr 11 12:00:44 kernel: [77582] 0 77582 10085 617 23 0 0 nginx
Apr 11 12:00:44 kernel: [77583] 0 77583 10085 618 23 0 0 nginx
Apr 11 12:00:44 kernel: [77584] 0 77584 10085 625 23 0 0 nginx
Apr 11 12:00:44 kernel: [77585] 0 77585 10085 645 23 0 0 nginx
Apr 11 12:00:44 kernel: [77586] 0 77586 10085 618 23 0 0 nginx
Apr 11 12:00:44 kernel: [77587] 0 77587 10085 590 23 0 0 nginx
Apr 11 12:00:44 kernel: [77588] 0 77588 10085 615 23 0 0 nginx
Apr 11 12:00:44 kernel: [77589] 0 77589 10085 613 23 0 0 nginx
Apr 11 12:00:44 kernel: [21482] 0 21482 181686 4999 39 0 0 fc-dqueue
Apr 11 12:00:44 kernel: [21511] 0 21511 183845 5901 43 0 0 fc-engine
Apr 11 12:00:44 kernel: [21534] 0 21534 184950 6506 50 0 0 fc-alert
Apr 11 12:00:44 kernel: [28651] 0 28651 202813 15435 88 0 0 fc-scrape
Apr 11 12:00:44 kernel: [14336] 0 14336 1170181 752852 1890 0 0 fc-insight
Apr 11 12:00:44 kernel: [129174] 0 129174 36424 609 75 0 0 sshd
Apr 11 12:00:44 kernel: [129176] 2048 129176 36424 384 72 0 0 sshd
Apr 11 12:00:44 kernel: [ 1104] 2048 1104 29008 470 13 0 0 bash
Apr 11 12:00:44 kernel: [ 1129] 0 1129 49924 441 52 0 0 sudo
Apr 11 12:00:44 kernel: [ 1130] 0 1130 48430 450 53 0 0 su
Apr 11 12:00:44 kernel: [ 1131] 0 1131 29176 671 14 0 0 bash
Apr 11 12:00:44 kernel: [ 1255] 0 1255 2272263 288765 734 0 0 java
Apr 11 12:00:44 kernel: [ 1291] 996 1291 4468402 988823 2881 0 0 java
Apr 11 12:00:44 kernel: [ 1580] 996 1580 13618 279 20 0 0 controller
Apr 11 12:00:44 kernel: [ 2000] 2048 2000 29008 475 13 0 0 bash
Apr 11 12:00:44 kernel: [ 2029] 2048 2029 40514 692 35 0 0 mysql
Apr 11 12:00:44 kernel: [ 2154] 2048 2154 29036 525 14 0 0 bash
与10:21:55的进程内存使用对比 增幅最大的是pid为24650的zookeeper,增幅为78,031KB(近两个小时,不算突增)。
使用率一直处于98% 波动 , 去掉cache+buffer ,使用率在76%-87%之间波动。如果流量增加或者添加指标,这里是存在容量风险的.
问题结论
由于系统free内存不足,触发OOM.
解决方案
临时方案
1 关闭thp ,减少浪费
echo never > /sys/kernel/mm/transparent_hugepage/defrag
echo never > /sys/kernel/mm/transparent_hugepage/enabled
2 降低es oom被kill 可能性
OOMScoreAdjust=-500
3 提高系统保留内存,提高系统内存碎片整理频率
sysctl -w vm.extfrag_threshold=200
sysctl -w vm.min_free_kbytes=135168 #134MB <-67MB
按照红帽文档建议,其中min_free_kbytes 设置不超过系统5% ,保险起见先调整为原来的2倍
调整后效果
cat /proc/zoneinfo
free高出min有近7k pages了
去掉buffer cache后,使用率在77%左右
长期方案
- 申请加内存 业务添加内存后,去掉buffer+cache 使用率到了39% ,泪流满面
- 添加组件(进程)级别的监控