linux内核探索之内存管理（二）：linux系统中的内存组织--结点、内存域和页帧

浏览数：36 / 时间：2015年06月20日

本文主要参考《深入linux内核架构》（3.2节）及Linux3.18.3内核源码

概述：本文主要描述了内存管理相关的数据结构：结点pg_data_t、内存域struct zone以及页帧（物理页）：struct page ，以及该结构相关的一些基本概念。

1. 概述

内存划分为接点，每个结点关联到系统中的一个处理器，在内核中表示为pg_data_t.

各个结点又划分为内存域，比如DMA内存域，高端内存域，普通内存域。

内核内存域的宏：

enum zone_type {
#ifdef CONFIG_ZONE_DMA                                                                              
        ZONE_DMA,                                                                                   
#endif
#ifdef CONFIG_ZONE_DMA32                                                                            
        ZONE_DMA32,                                                                                 
#endif
        ZONE_NORMAL,                                                                                
#ifdef CONFIG_HIGHMEM                                                                               
        ZONE_HIGHMEM,
#endif
        ZONE_MOVABLE,
        __MAX_NR_ZONES
};

ZONE_DMA：标记用于DMA的内存区。该区域的长度依赖于处理器类型。在IA-32平台，一般为16MB。

ZONE_DMA32：标记了使用32位地址字可寻址、适合DMA的区域，只有在64位系统上，两种DMA内存域才有差别，在32位计算机上，这里是空的。

ZONE_NORMAL：普通内存域。所有体系结构上都保证存在。但无法保证该地址范围对应了实际的物理内存。

ZONE_HIGHMEM：高端内存区。标记超出内核段的物理内存（比如大于896M，内核地址空间无法全部映射物理内存）。64位系统不需要高端内存。

ZONE_MOVABLE：这是一个伪内存域，用于防止物理内存碎片。

各个内存域都关联了一个数组，用来组织属于该内存域的物理内存页（页帧）。对于每个页帧，都分配了一个struct page实例以及所需的管理数据。

各个内存节点保存在一个单链表中，供内核遍历。

技术分享

2. 数据结构

（1）结点和节点状态

结点管理

pg_data_t表示节点，定义如下：

include/linux/mmzone.h 
typedef struct pglist_data {
        struct zone node_zones[MAX_NR_ZONES];
        struct zonelist node_zonelists[MAX_ZONELISTS];
        int nr_zones;
        struct page *node_mem_map;
        struct page_cgroup *node_page_cgroup;                                                       
        struct bootmem_data *bdata;                                                                 
        spinlock_t node_size_lock;                                                                  
        unsigned long node_start_pfn;                                                               
        unsigned long node_present_pages; /* total number of physical pages */                      
        unsigned long node_spanned_pages; /* total size of physical page                            
                                             range, including holes */                              
        int node_id;                                                                                
        wait_queue_head_t kswapd_wait;                                                              
        wait_queue_head_t pfmemalloc_wait;                                                          
        struct task_struct *kswapd;     /* Protected by                                             
                                           mem_hotplug_begin/end() */                               
        int kswapd_max_order;
        enum zone_type classzone_idx;                                                               
        spinlock_t numabalancing_migrate_lock;
        unsigned long numabalancing_migrate_next_window;
unsigned long numabalancing_migrate_nr_pages;
} pg_data_t;

node_zones是一个数组，包含了节点中各内存域的数据结构

node_zonelists指定了备用结点及其内存区的列表，以便在当前结点没有可用空间时，在备用结点分配内存。

nr_zones保存结点中不同内存区的数目

node_mem_map是指向page实例数组的指针，用于描述结点的所有物理内存页，它包含了结点中所有内存区的页。

bdata指向自举内存分配器数据结构的实例。在系统启动时，内存管理子系统初始化之前，内核也需要使用内存，此时使用了自举内存分配器。

node_start_pfn是该NUMA结点第一个页帧的逻辑编号。系统中所有结点的页帧是依次编号的，每个页帧的号码都是全局（不止本结点）唯一的。在UMA系统中，该值总是0.

node_present_pages指定了结点中页帧的数目

node_spanned_pages则给出了该结点以页帧为单位计算的长度，包含空洞。

node_id是一个全局结点ID

pgdat_next连接到下一个内存结点，系统中所有的结点都通过单链表连接，其末尾通过空指针标记。

kswapd_wait是交换守护进程的等待队列，在将页帧换出结点时会用到。Kswapd指向负责该结点的交换守护进程的task_struct。Kswapd_max_order用于页交换子系统的实现，用来定义需要释放的区域的长度。

结点状态管理

include/linux/nodemask.h：
/*
 * Bitmasks that are kept for all the nodes.
 */
enum node_states {
        N_POSSIBLE,             /* The node could become online at some point */
        N_ONLINE,               /* The node is online */
        N_NORMAL_MEMORY,        /* The node has regular memory */
#ifdef CONFIG_HIGHMEM
        N_HIGH_MEMORY,          /* The node has regular or high memory */
#else
        N_HIGH_MEMORY = N_NORMAL_MEMORY,
#endif
#ifdef CONFIG_MOVABLE_NODE
        N_MEMORY,               /* The node has memory(regular, high, movable) */
#else
        N_MEMORY = N_HIGH_MEMORY,
#endif
        N_CPU,          /* The node has one or more cpus */
        NR_NODE_STATES
};

状态N_POSSIBLE，N_ONLINE和N_CPU用于CPU和内存的热插拔。对于内存管理有必要的标志是N_HIGH_MEMORY、N_NORMAL_MEMORY。如果结点有普通或高端内存则使用N_HIGH_MEMORY，仅当结点没有高端内存时才设置N_NORMAL_MEMROY.

static inline void node_set_state(int node, enum node_states state);用于设置位于特定结点中的一个比特位。

static inline void node_clear_state(int node, enum node_states state);用于设置位于特定结点中的一个比特位。

#define for_each_node_state(__node, __state) for_each_node_mask((__node), node_states[__state]) 该宏用于遍历出于特定状态的所有结点

#define for_each_online_node(node) for_each_node_state(node, N_ONLINE) 该宏用于遍历所有活动结点。

如果内核只支持单个结点，上述操作为空操作。

2. 内存域

内核使用struct zone来表述内存域，其定义如下：

enum zone_watermarks {
        WMARK_MIN,
        WMARK_LOW,
        WMARK_HIGH,
        NR_WMARK
};

#define min_wmark_pages(z) (z->watermark[WMARK_MIN])

#define low_wmark_pages(z) (z->watermark[WMARK_LOW])
#define high_wmark_pages(z) (z->watermark[WMARK_HIGH])

struct zone {
        unsigned long watermark[NR_WMARK];
        long lowmem_reserve[MAX_NR_ZONES];
        int node;
        unsigned int inactive_ratio;
        struct pglist_data      *zone_pgdat;
        struct per_cpu_pageset __percpu *pageset;
        unsigned long           dirty_balance_reserve;
        unsigned long           *pageblock_flags;
        unsigned long           min_unmapped_pages;
        unsigned long           min_slab_pages;
        unsigned long           zone_start_pfn;
        unsigned long           managed_pages;
        unsigned long           spanned_pages;
        unsigned long           present_pages;
        const char              *name;
        int                     nr_migrate_reserve_block;
        unsigned long           nr_isolate_pageblock;
        seqlock_t               span_seqlock;
        wait_queue_head_t       *wait_table;
        unsigned long           wait_table_hash_nr_entries;
        unsigned long           wait_table_bits;
        ZONE_PADDING(_pad1_)
        spinlock_t              lock;
        struct free_area        free_area[MAX_ORDER];
        unsigned long           flags;
        ZONE_PADDING(_pad2_)
        spinlock_t              lru_lock;
        struct lruvec           lruvec;
        atomic_long_t           inactive_age;
        unsigned long percpu_drift_mark;
        unsigned long           compact_cached_free_pfn;
        unsigned long           compact_cached_migrate_pfn[2];
        unsigned int            compact_considered;
        unsigned int            compact_defer_shift;
        int                     compact_order_failed;
        bool                    compact_blockskip_flush;
        ZONE_PADDING(_pad3_)
        atomic_long_t           vm_stat[NR_VM_ZONE_STAT_ITEMS];
} ____cacheline_internodealigned_in_smp;

该结构由ZONE_PENDDING分隔为几个部分，这是因为对zone结构的访问非常频繁。在多处理器系统上，通常会有不同的CPU试图访问结构成员。因此使用锁防止他们彼此干扰，避免错误和不一致。由于内核对该结构的访问非常频繁，因此会经常性的获取该结构的两个自旋锁zone->lock和zone->lru_lock。

如果数据保存在CPU高速缓存中，那么会处理的更快。高速缓存分为行，每一行负责不同的内存区。内核使用ZONE_PADDING宏生成填充字段添加到结构中，以确保每个自旋锁都出于自身的缓存行中。关键字 ____cacheline_internodealigned_in_smp，用来实现最优的高速缓存对齐方式。

该结构的最后两个部分也通过填充字段彼此分隔，主要目的是将数据保留在一个缓存行中，便于快速访问。

第一部分主要成员：

watermark[NR_WMARK]是页换出时使用的水印。如果内存不足，内核可以将页写到硬盘，这三个成员会影响到交换守护进程的行为。

如果空闲页多于watermark[WMARK_HIGH]，则内存域的状态是理想的。

如果空闲页的数目低于watermark[WMARK_LOW]，则内核开始将页换出到硬盘

如果空闲页的数目低于watermark[WMARK_MIN]，那么页回收工作的压力就会比较大，因为内存域中急需空闲页。

lowmem_reserve[MAX_NR_ZONES]分别为各种内存域指定了若干页，用于一些无论如何都不能失败的关键性内存分配。

Pageset是一个数组，用于实现每个CPU的冷/热页帧列表。内核使用这些列表来保存可用于满足实现的“新鲜”页。但冷热页帧对应的高速缓存状态不同：有些页帧很可能仍然在高速缓存中，因此可以快速访问，这些成为热页帧；为缓存的页帧成为冷页帧。

free_area[MAX_ORDER]是同名数据结构的数组，用于实现伙伴系统。每个数组元素（#define MAX_ORDER 11）都表示某种固定长度的一些连续内存区域。对于包含在每个区域中的空闲内存页的管理，free_area是一个起点。

第二部分主要成员：

第二部涉及的成员，用来根据活动情况对内存域中的使用的页进行编目。如果也访问频繁，则内核认为它是活动的。在需要换出页时，这种区别是很重要的：如果可能的话，频繁使用的页应该保持不动，而多于的页则可以换出。

Flasgs描述内存域的当前状态，允许使用如下标志：

/* zone flags, see below */

unsigned long flags;

enum zone_flags {
        ZONE_RECLAIM_LOCKED,  //防止并发回收
        ZONE_OOM_LOCKED,  // zone is in OOM killer zonelist，内存域即刻被回收 
        ZONE_CONGESTED,  // zone has many dirty pages backed by a congested BDI 
        ZONE_DIRTY,       //reclaim scanning has recently found  many dirty file pages at the tail  of the LRU.
        ZONE_WRITEBACK,  //reclaim scanning has recently found many pages under writeback
        ZONE_FAIR_DEPLETED,//fair zone policy batch depleted 
};

vm_stat[NR_VM_ZONE_STAT_ITEMS]维护了大量有关该内存区的统计信息。

wait_queue_head_t *wait_table;

unsigned long wait_table_hash_nr_entries;

unsigned long wait_table_bits;

这三个成员实现一个等待队列，可供等待某一页变为可用的进程使用。进程排成一个队列，等待某些条件，该条件为真时，内核会通知进程恢复工作。

struct pglist_data *zone_pgdat;内存域和父节点之间的关联由zone_pgdat建立，zone_pgdat指向对应的pg_list_data实例。

unsigned long zone_start_pfn;是内存域第一个页帧的索引。zone_start_pfn == zone_start_paddr >> PAGE_SHIFT。

const char *name;是一个字符串，保存该内存域惯用名称。目前3个选项：Normal、DMA和HighMem.

unsigned long spanned_pages;指定内存域中页的总数，但并非所有页都是可用的，可能存在内存空洞。

unsigned long present_pages;是实际可用的页数目，一般与spanned_pages相同。

3. 内存域水印的计算

在计算各种水印之前，内核首先需要确定需要为关键性分配保留的内存空间的大小值。该值随可用内存的大小而非线性增长，并保存在全局变量min_free_kbytes中。

用户层可以通过文件/proc/sys/vm/min_free_kbytes来读取和修改关键性分配内存空间最小值。如下是主内存域min_fre_kbytes的一个经验值：

主内存大小	Min_free_kbytes
16MB	512KB
32MB	724KB
256MB	2MB
512MB	2896KB
1024MB	4MB
2048MB	5792KB
4096MB	8MB

结构体中水印值的填充由init_per_zone_pages_min处理，该函数由内核在启动期间调用，无需显式调用。

技术分享

__setup_per_zone_wmark设置struct zone的watermark[WMARK_MIN]、watermark[WMARK_LOW]、watermark[WMARK_HIGH]

static void __setup_per_zone_wmarks(void)//page_alloc.c 
{
        unsigned long pages_min = min_free_kbytes >> (PAGE_SHIFT - 10);
        unsigned long lowmem_pages = 0;
        struct zone *zone;
        unsigned long flags;
 
        /* Calculate total number of !ZONE_HIGHMEM pages */
        for_each_zone(zone) {
                if (!is_highmem(zone))
                        lowmem_pages += zone->managed_pages;
        }
 
        for_each_zone(zone) {
                u64 tmp;
 
                spin_lock_irqsave(&zone->lock, flags);
                tmp = (u64)pages_min * zone->managed_pages;
                do_div(tmp, lowmem_pages);
                if (is_highmem(zone)) {
                        unsigned long min_pages;
                        min_pages = zone->managed_pages / 1024;
                        min_pages = clamp(min_pages, SWAP_CLUSTER_MAX, 128UL);
                        zone->watermark[WMARK_MIN] = min_pages;
                } else {
                        zone->watermark[WMARK_MIN] = tmp;
                }
                zone->watermark[WMARK_LOW]  = min_wmark_pages(zone) + (tmp >> 2);
                zone->watermark[WMARK_HIGH] = min_wmark_pages(zone) + (tmp >> 1);
 
                __mod_zone_page_state(zone, NR_ALLOC_BATCH,
                        high_wmark_pages(zone) - low_wmark_pages(zone) -
                        atomic_long_read(&zone->vm_stat[NR_ALLOC_BATCH]));
 
                setup_zone_migrate_reserve(zone);
                spin_unlock_irqrestore(&zone->lock, flags);
        }
 
        /* update totalreserve_pages */
        calculate_totalreserve_pages();
}

高端内存域的下界SWAP_CLUSTER_MAX，对整个页面回收子系统来说，是一个重要的数值。该子系统的代码经常对页进行分组式批处理操作，SWAP_CLUSTER_MAX定义了分组的大小。

lowmem_reserve[MAX_NR_ZONES]的计算由setup_per_zone_lowmem_reserve();完成。内核遍历系统的所有结点，对每个结点的各个内存域分别计算预留内存最小值，具体的算法是将内存域中页帧的总数除以sysctl_lowmem_reserve_ratio[idx]。除数的默认设置对低端内存域是256,对高端内存域是32。

/*
 * setup_per_zone_lowmem_reserve - called whenever
 *      sysctl_lower_zone_reserve_ratio changes.  Ensures that each zone
 *      has a correct pages reserved value, so an adequate number of
 *      pages are left in the zone after a successful __alloc_pages().
 */
static void setup_per_zone_lowmem_reserve(void)//page_alloc.c 
{
        struct pglist_data *pgdat;
        enum zone_type j, idx;
 
        for_each_online_pgdat(pgdat) {
                for (j = 0; j < MAX_NR_ZONES; j++) {
                        struct zone *zone = pgdat->node_zones + j;
                        unsigned long managed_pages = zone->managed_pages;
 
                        zone->lowmem_reserve[j] = 0;
 
                        idx = j;
                        while (idx) {
                                struct zone *lower_zone;
 
                                idx--;
 
                                if (sysctl_lowmem_reserve_ratio[idx] < 1)
                                        sysctl_lowmem_reserve_ratio[idx] = 1;
 
                                lower_zone = pgdat->node_zones + idx;
                                lower_zone->lowmem_reserve[j] = managed_pages /
                                        sysctl_lowmem_reserve_ratio[idx];
                                managed_pages += lower_zone->managed_pages;
                        }
                }
        }
 
        /* update totalreserve_pages */
        calculate_totalreserve_pages();
}

4. 冷热页

Struct zone的pageset成员用于实现冷热分配器（hot-n-cold allocator）。页时热的，指页已经加载到CPU高速缓存，与在内存中的也相比，其数据能够更快地访问。页是冷的，指页不在高速缓存中。在多处理器系统上每个CPU都有一个或多个高速缓存，各个CPU的管理必须是独立的。

尽管内存域可能属于一个特定的NUMA结点，因而关联到某个特定的CPU。但是其他CPU的高速缓存仍然可能包含该内存域中的页。每个处理器都可以访问系统中所有的页，尽管速度不同。因此，特定于内存域的数据结构不仅要考虑到所属NUMA结点的CPU，还必须考虑到系统中其他的CPU。

老的内核中pageset是一个数组，在最新的3.18.3内核中是指针：struct per_cpu_pageset __percpu *pageset。但是无论数组还是指针，在单处理器系统上都是只有一个元素，而SMP系统编译的内核中，其值可能在2~32中之间，该值并不是系统中实际存在的CPU数目，而是内内核支持的CPU的最大数目。

数组元素的类型为

include/linux/mmzone.h：

struct per_cpu_pageset {

struct per_cpu_pages pcp;

#ifdef CONFIG_NUMA

s8 expire;

#endif

#ifdef CONFIG_SMP

s8 stat_threshold;

s8 vm_stat_diff[NR_VM_ZONE_STAT_ITEMS];

#endif

};

该结构主要由struct per_cpu_pages结构体构成：

struct per_cpu_pages {

int count; /* number of pages in the list */

int high; /* high watermark, emptying needed */

int batch; /* chunk size for buddy add/remove */

/* Lists of pages, one per migrate type stored on the pcp-lists */

struct list_head lists[MIGRATE_PCPTYPES];

};

Count记录了与该列表相关的页的数目，high是一个水印。如果count值超出了high，则表明列表中的页太多了。对容量过低的状态没有显式使用水印：如果列表中没有成员，则冲洗填充。

List是一个双向链表，保存了当前CPU的冷页或热页，可以使用内核的标准方法处理。

CPU的高速缓存不是用单个页来填充的，而是用多个页组成的块，batch是每次添加页数的一个参考值。

5. 页帧

页帧代表系统内存的最小单位，对内存中的每个页都会创建struct page的一个实例。所以struct page需要保持尽可能小。因为系统内存会分解为大量的页：即使主内存为384MB，一个page为4KB大小的话，也大概有10000个页。这就是为什么尽力保持struct page尽可能小的原因。在典型应用中，页的数量巨大，对page结构的小改动，也可能导致保存所有page实例所需的物理内存暴涨。

页的广泛使用，增加了保持结构长度的难度。内存管理的许多部分使用页，用于各种不同的用途。内核的一个部分可能完全依赖于struct page提供的信息，而该信息对内核的另一部分可能完全无用。所以struct page中使用了union类型。

struct page结构的定义：

Include/linux/mm_types.h 
/*
 * Each physical page in the system has a struct page associated with
 * it to keep track of whatever it is we are using the page for at the
 * moment. Note that we have no way to track which tasks are using
 * a page, though if it is a pagecache page, rmap structures can tell us
 * who is mapping it.
 *
 * The objects in struct page are organized in double word blocks in
 * order to allows us to use atomic double word operations on portions
 * of struct page. That is currently only used by slub but the arrangement
 * allows the use of atomic double word operations on the flags/mapping
 * and lru list pointers also.
 */
struct page {
        /* First double word block */
        unsigned long flags;          //原子标记，有些情况下会异步更新
 
        union {
                struct address_space *mapping;  //如果最低位为0，则指向inode address_space，或为NULL
                                              //如果页映射为匿名内存，最低位置位，而且该指针指向anon_vma对象，
                                              //则参考PAGE_MAPPING_ANON.
                void *s_mem;                    /* slab first object */
        };
 
        /* Second double word */
        struct {
                union {
                        pgoff_t index;        //在映射内的偏移量
                        void *freelist;        //SLUB：freelist req. Slab lock 
                        bool pfmemalloc;     
                };
 
                union {
#if defined(CONFIG_HAVE_CMPXCHG_DOUBLE) &&  defined(CONFIG_HAVE_ALIGNED_STRUCT_PAGE)
                        /* Used for cmpxchg_double in slub */
                        unsigned long counters;
#else
                        unsigned counters;
#endif
 
                        struct {
 
                                union {
                                        atomic_t _mapcount;// 内存管理子系统中映射的页表项计数，用于表示
                                                        //页是否已经映射，还用于限制逆向映射搜索
 
                                        struct { /* SLUB */ //用于slab分配器
                                                unsigned inuse:16;	//对象的数目
                                                unsigned objects:15;
                                                unsigned frozen:1;
                                        };
                                        int units;      /* SLOB */
                                };
                                atomic_t _count;                //使用计数
                        };
                        unsigned int active;    /* SLAB */
                };
        };
 
        /* Third double word block */
        union {
                struct list_head lru;   //换出页列表，例如由zone->lru_lock保护的active_list
                struct {                
                        struct page *next; 
#ifdef CONFIG_64BIT
                        int pages;      /* Nr of partial slabs left */
                        int pobjects;   /* Approximate # of objects */
#else
                        short int pages;
                        short int pobjects;
#endif
                };
 
                struct slab *slab_page; /* slab fields */
                struct rcu_head rcu_head;       /* Used by SLAB
                                                 * when destroying via RCU
                                                 */
#if defined(CONFIG_TRANSPARENT_HUGEPAGE) && USE_SPLIT_PMD_PTLOCKS
                pgtable_t pmd_huge_pte; /* protected by page->ptl */
#endif
        };
 
        /* Remainder is not double word aligned */
        union {
                unsigned long private; //由映射私有，不透明数据：
                                      //如果设置了PagePrivate，通常用于buffer_heads；
                                      //如果设置了PageSwapCache，则用于swp_entry_t;
                                      //如果设置了PG_buddy，则用于表示伙伴系统中的阶
#if USE_SPLIT_PTE_PTLOCKS
#if ALLOC_SPLIT_PTLOCKS
                spinlock_t *ptl;
#else
                spinlock_t ptl;
#endif
#endif
                struct kmem_cache *slab_cache;  //用于slub分配器，指向slab的指针
                struct page *first_page;         //用于复合页的页尾，指向首页
        };
#if defined(WANT_PAGE_VIRTUAL)
        void *virtual;                //内核虚拟地址（如果没有映射则为NULL，即高端内存）
 
#endif /* WANT_PAGE_VIRTUAL */
#ifdef CONFIG_WANT_PAGE_DEBUG_FLAGS
        unsigned long debug_flags;      /* Use atomic bitops on this */
#endif
 
#ifdef CONFIG_KMEMCHECK
        void *shadow;
#endif
 
#ifdef LAST_CPUPID_NOT_IN_PAGE_FLAGS
        int _last_cpupid;
#endif
}

这个结构体的定义是非常复杂的，原因就是内核中每一个物理页帧都要对应这么一个结构，所以需要很多个这样的结构，所以这里需要保持该结构很小。而该结构用处又非常多，用处多导致每个地方有不同的需求，导致成员多。所以就出来这么纠结的一个定义。这里只保留非常重要的部分。

该结构的格式是体系结构无关的，不依赖于CPU类型，每个页帧都由该结构描述。出了slub相关成员以外（slab、freelist和inuse），page结构也包含了若干其他成员。这里只是概述一些内容，后面还会有介绍。

flags:存储了体系结构无关的标志，用于描述页的属性

_count:引用计数，表示内核中引用该页的次数。在其值为0时，内核就知道page实例当前未使用，因此可以删除。如果其值大于0，该实例决不会从内存删除。

_mapcount表示在页表中有多少项指向该页

lru：是一个表头，用于在各种链表上维护该页，以便将页按不同类别分组，以便将页按照不同类别分组，最重要的是活动页和不活动页

内核将多个毗连的页合并为较大的复合页（compound page）。分组中的第一个页称为首页(head page)，而所有其余页叫做尾页（tail page）。所有尾页对应的page实例中，都将first_pag设为指向首页。

Mapping指定了页帧所在的地址空间。Index是页帧在映射内部的偏移量。地址空间是一个非常一般的概念。例如可以用在向向内存读取文件时，地址空间用于将文件的内容与装载数据的内存区关联起来。Mapping不仅能够保存一个指针，而且还能包含一些额外的信息，用于判断页是否属于未关联到地址空间的某个匿名内存区。

Private是一个指向“私有”数据的指针，虚拟内存管理会忽略该数据。根据页的用途，可以用不同的方式使用该指针。大多数情况下它将用于页与数据缓冲区关联起来。

Virtual用于高端内存区域中的页，无法直接映射到内核内存中的页，virtual用于存储该页的虚拟地址。

页的不同属性通过一系列页标志描述，存储为struct page的flags成员中的各个比特位。这些标志独立于使用的体系结构。各个标志是由page-types.h中的宏定义的，此外还有一些宏用于标志的设置、删除、查询。

/*
 * Various page->flags bits:
 *
 * PG_reserved is set for special pages, which can never be swapped out. Some
 * of them might not even exist (eg empty_bad_page)...
 *
 * The PG_private bitflag is set on pagecache pages if they contain filesystem
 * specific data (which is normally at page->private). It can be used by
 * private allocations for its own usage.
 *
 * During initiation of disk I/O, PG_locked is set. This bit is set before I/O
 * and cleared when writeback _starts_ or when read _completes_. PG_writeback
 * is set before writeback starts and cleared when it finishes.
 *
 * PG_locked also pins a page in pagecache, and blocks truncation of the file
 * while it is held.
 *
 * page_waitqueue(page) is a wait queue of all tasks waiting for the page
 * to become unlocked.
 *
 * PG_uptodate tells whether the page's contents is valid.  When a read
 * completes, the page becomes uptodate, unless a disk I/O error happened.
 *
 * PG_referenced, PG_reclaim are used for page reclaim for anonymous and
 * file-backed pagecache (see mm/vmscan.c).
 *
 * PG_error is set to indicate that an I/O error occurred on this page.
 *
 * PG_arch_1 is an architecture specific page state bit.  The generic code
 * guarantees that this bit is cleared for a page when it first is entered into
 * the page cache.
 *
 * PG_highmem pages are not permanently mapped into the kernel virtual address
 * space, they need to be kmapped separately for doing IO on the pages.  The
 * struct page (these bits with information) are always mapped into kernel
 * address space...
 *
 * PG_hwpoison indicates that a page got corrupted in hardware and contains
 * data with incorrect ECC bits that triggered a machine check. Accessing is
 * not safe since it may cause another machine check. Don't touch!
 */
 
/*
 * Don't use the *_dontuse flags.  Use the macros.  Otherwise you'll break
 * locked- and dirty-page accounting.
 *
 * The page flags field is split into two parts, the main flags area
 * which extends from the low bits upwards, and the fields area which
 * extends from the high bits downwards.
 *
 *  | FIELD | ... | FLAGS |
 *  N-1           ^       0
 *               (NR_PAGEFLAGS)
 *
 * The fields area is reserved for fields mapping zone, node (for NUMA) and
 * SPARSEMEM section (for variants of SPARSEMEM that require section ids like
 * SPARSEMEM_EXTREME with !SPARSEMEM_VMEMMAP).
 */
enum pageflags {
        PG_locked,              /* Page is locked. Don't touch. */
        PG_error,
        PG_referenced,
        PG_uptodate,
        PG_dirty,
        PG_lru,
        PG_active,
        PG_slab,
        PG_owner_priv_1,        /* Owner use. If pagecache, fs may use*/
        PG_arch_1,
        PG_reserved,
        PG_private,             /* If pagecache, has fs-private data */
        PG_private_2,           /* If pagecache, has fs aux data */
        PG_writeback,           /* Page is under writeback */
#ifdef CONFIG_PAGEFLAGS_EXTENDED
        PG_head,                /* A head page */
        PG_tail,                /* A tail page */
#else
        PG_compound,            /* A compound page */
#endif
        PG_swapcache,           /* Swap page: swp_entry_t in private */
        PG_mappedtodisk,        /* Has blocks allocated on-disk */
        PG_reclaim,             /* To be reclaimed asap */
        PG_swapbacked,          /* Page is backed by RAM/swap */
        PG_unevictable,         /* Page is "unevictable"  */
#ifdef CONFIG_MMU
        PG_mlocked,             /* Page is vma mlocked */
#endif
#ifdef CONFIG_ARCH_USES_PG_UNCACHED
        PG_uncached,            /* Page has been mapped as uncached */
#endif
#ifdef CONFIG_MEMORY_FAILURE
        PG_hwpoison,            /* hardware poisoned page. Don't touch */
#endif
#ifdef CONFIG_TRANSPARENT_HUGEPAGE
        PG_compound_lock,
#endif
        __NR_PAGEFLAGS,
 
        /* Filesystems */
        PG_checked = PG_owner_priv_1,
 
        /* Two page bits are conscripted by FS-Cache to maintain local caching
         * state.  These bits are set on pages belonging to the netfs's inodes
         * when those inodes are being locally cached.
         */
        PG_fscache = PG_private_2,      /* page backed by cache */
 
        /* XEN */
        PG_pinned = PG_owner_priv_1,
        PG_savepinned = PG_dirty,
 
        /* SLOB */
        PG_slob_free = PG_private,
};

内核定义了一些标准宏，用于检查页是否设置了某个特定的比特位，或者操作某个比特位。这些宏的名称有一定的模式，如下所述：

PageXXX(page)会检查页是否设置了PG_XXX位，比如PageDirty检查PG_dirty位，而PageActive检查PG_active位等等。

SetPageXXX在某个比特位没有设置的情况下，设置该Bit

ClearPageXXX无条件清除某个谁知的Bit

这些操作的试下是原子的。很多情况下，需要等待页的状态改变，然后才能恢复工作。内核提供了两个辅助函数等待状态的改变：

/* 
 * Wait for a page to be unlocked.
 *
 * This must be called with the caller "holding" the page,
 * ie with increased "page->count" so that the page won't
 * go away during the wait..
 */
static inline void wait_on_page_locked(struct page *page)
{
        if (PageLocked(page))
                wait_on_page_bit(page, PG_locked);
}
 
/* 
 * Wait for a page to complete writeback
 */
static inline void wait_on_page_writeback(struct page *page)
{
        if (PageWriteback(page))
                wait_on_page_bit(page, PG_writeback);
}

假定内核的一部分在等待一个被锁定的页面，直至页面解锁。wait_on_page_locked提供了该功能。在页面锁定的情况下调用该函数，内核将进入睡眠，在页面解锁之后，睡眠进程被自动唤醒并继续工作。

wait_on_page_writeback会等待到与页面相关的所有待决回写操作结束，将页面包含的数据同步到块设备为止。

总结：本文主要描述了内存管理相关的数据结构：结点pg_data_t、内存域struct zone以及页帧（物理页）：struct page ，以及该结构相关的一些基本概念。

郑重声明：本站内容如果来自互联网及其他传播媒体，其版权均属原媒体及文章作者所有。转载目的在于传递更多信息及用于网络分享，并不代表本站赞同其观点和对其真实性负责，也不构成任何其他建议。

linux内核探索之内存管理（二）：linux系统中的内存组织--结点、内存域和页帧