Linux系统中,有很多内存管理的配置参数,本文就详细分析lowmem_reserve_ratio参数。

系统环境介绍

  • 发行版:centos7.5
  • 内核版本:3.10.0-862.14.4.el7.x86_64
  • 处理器:40core(Intel(R) Xeon(R) CPU E5-2630 v4 @ 2.20GHz)
  • 内存:128GB,两个NUMA node

官方解释

lowmem_reserve_ratio的官方解释如下:

 1
 2
 3
 4
 5
 6
 7
 8
 9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
For some specialised workloads on highmem machines it is dangerous for
the kernel to allow process memory to be allocated from the "lowmem"
zone.  This is because that memory could then be pinned via the mlock()
system call, or by unavailability of swapspace.

And on large highmem machines this lack of reclaimable lowmem memory
can be fatal.

So the Linux page allocator has a mechanism which prevents allocations
which _could_ use highmem from using too much lowmem.  This means that
a certain amount of lowmem is defended from the possibility of being
captured into pinned user memory.

(The same argument applies to the old 16 megabyte ISA DMA region.  This
mechanism will also defend that region from allocations which could use
highmem or lowmem).

The `lowmem_reserve_ratio' tunable determines how aggressive the kernel is
in defending these lower zones.

If you have a machine which uses highmem or ISA DMA and your
applications are using mlock(), or if you are running with no swap then
you probably should change the lowmem_reserve_ratio setting.

总的来说,就是防止进程过多的使用lower zones中的内存。 具体实现如下:

  • 系统上每个zone都会有一个protection 数组,在内存分配时,用它和对用的zone的watermark[high]来判断是否能够分配内存
  • 而每个zoneprotection 的计算方法跟lowmem_reserve_ratio有关。

接下来我们看一下每个zoneprotection数组的计算方法。

zoneprotection计算方法

lowmem_reserve_ratio是一个数组,可以通过文件/proc/sys/vm/lowmem_reserve_ratio查看其值:

1
2
$ cat /proc/sys/vm/lowmem_reserve_ratio 
256     256     32

目前该值为:

  • 256: 如果zoneDMA或者DMA32
  • 32: 其它zone

内核利用上述的lowmem_reserve_ratio数组计算每个zone的预留page量,计算出来也是数组形式,从/proc/zoneinfo里可以查看:

 1
 2
 3
 4
 5
 6
 7
 8
 9
10
11
12
13
Node 0, zone      DMA
  pages free     1355
        min      3
        low      3
        high     4
	:
	:
    numa_other   0
        protection: (0, 2004, 2004, 2004)
	^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
  pagesets
    cpu: 0 pcp: 0
        :

在进行内存分配时,这些预留页数值和watermark相加来一起决定现在是满足分配请求,还是认为空闲内存量过低需要启动回收。

例如,如果一个normal区(index = 2)的页申请来试图分配DMA区的内存,且现在使用的判断标准是watermark[high]时,内核计算出 page_free = 1355,而watermark + protection[2] = 4 + 2004 = 2008 > page_free,则认为空闲内存太少而不予以分配。如果分配请求本就来自DMA zone,则 protection[0] = 0会被使用,而满足分配申请。

zone[i]protection[j] 计算规则如下:

1
2
3
4
5
6
7
8
(i < j):
  zone[i]->protection[j]
  = (total sums of managed_pages from zone[i+1] to zone[j] on the node)
    / lowmem_reserve_ratio[i];
(i = j):
   (should not be protected. = 0;
(i > j):
   (not necessary, but looks 0)

从上面的计算规则可以看出,预留内存值是ratio的倒数关系,如果是256则代表 1/256,即为 0.39% 的高端zone内存大小。 如果想要预留更多页,应该设更小一点的值,最小值是11/1 -> 100%)。

计算示例

根据上述计算方法,结合我的系统环境,计算出的每个zoneprotection数组如下:

node zone manage_pages protection[0] protection[1] protection[2] protection[3]
0 DMA 3976 0 1383 63848 83848
0 DMM32 354201 0 62464 62464
0 NORAML 15991024 0 0
0 MOVABLE 0 0
1 DMA 0 0 0 64508 64508
1 DMA32 0 0 64508 64508
1 NORMAL 16514229 0 0
1 MOVABLE 0 0

通过/proc/zoneinfocrash命令,我们可以验证一下计算结果是否正确:

 1
 2
 3
 4
 5
 6
 7
 8
 9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
$ cat /proc/zoneinfo | grep protection
        protection: (0, 1383, 63848, 63848)
        protection: (0, 0, 62464, 62464)
        protection: (0, 0, 0, 0)
        protection: (0, 0, 0, 0)
		
		
crash> struct zone.lowmem_reserve  ffff88107ffd9000
  lowmem_reserve = {0, 1383, 63848, 63848}
crash> struct zone.lowmem_reserve  ffff88107ffd9800
  lowmem_reserve = {0, 0, 62464, 62464}
crash> struct zone.lowmem_reserve  ffff88107ffda000
  lowmem_reserve = {0, 0, 0, 0}
crash> struct zone.lowmem_reserve  ffff88107ffda800
  lowmem_reserve = {0, 0, 0, 0}
crash> struct zone.lowmem_reserve  ffff88207ffd6000
  lowmem_reserve = {0, 0, 64508, 64508}
crash> struct zone.lowmem_reserve  ffff88207ffd6800
  lowmem_reserve = {0, 0, 64508, 64508}
crash> struct zone.lowmem_reserve  ffff88207ffd7000
  lowmem_reserve = {0, 0, 0, 0}
crash> struct zone.lowmem_reserve  ffff88207ffd7800
  lowmem_reserve = {0, 0, 0, 0}

lowmem_reserve_ratio影响

通过分析,我们知道lowmem_reserve_ratio会影响系统预留内存的大小,且预留的数量是ratio的倒数,所以,如果系统预留稍微多一点的内存,应该将lowmem_reserve_ratio适当调小。

一般情况下,很少回调整lowmem_reserve_ratio的值。

参考文档