memcached源码分析-----slab automove和slab rebalance

浏览数：13 / 时间：2015年06月20日

转载请注明出处：http://blog.csdn.net/luotuo44/article/details/43015129

需求：

考虑这样的一个情景：在一开始，由于业务原因向memcached存储大量长度为1KB的数据，也就是说memcached服务器进程里面有很多大小为1KB的item。现在由于业务调整需要存储大量10KB的数据，并且很少使用1KB的那些数据了。由于数据越来越多，内存开始吃紧。大小为10KB的那些item频繁访问，并且由于内存不够需要使用LRU淘汰一些10KB的item。

对于上面的情景，会不会觉得大量1KB的item实在太浪费了。由于很少访问这些item，所以即使它们超时过期了，还是会占据着哈希表和LRU队列。LRU队列还好，不同大小的item使用不同的LRU队列。但对于哈希表来说大量的僵尸item会增加哈希冲突的可能性，并且在迁移哈希表的时候也浪费时间。有没有办法干掉这些item？使用LRU爬虫+lru_crawler命令是可以强制干掉这些僵尸item。但干掉这些僵尸item后，它们占据的内存是归还到1KB的那些slab分配器中。1KB的slab分配器不会为10KB的item分配内存。所以还是功亏一篑。

那有没有别的办法呢？是有的。memcached提供的slab automove 和 rebalance两个东西就是完成这个功能的。在默认情况下，memcached不启动这个功能，所以要想使用这个功能必须在启动memcached的时候加上参数-o slab_reassign。之后就可以在客户端发送命令slabsreassign <source class> <dest class>，手动将source class的内存页分给dest class。后文会把这个工作称为内存页重分配。而命令slabs automove则是让memcached自动检测是否需要进行内存页重分配，如果需要的话就自动去操作，这样一切都不需要人工的干预。

如果在启动memcached的时候使用了参数-o slab_reassign，那么就会把settings.slab_reassign赋值为true(该变量的默认值为false)。还记得《slab内存分配器》说到的每一个内存页的大小吗？在do_slabs_newslab函数中，一个内存页的大小会根据settings.slab_reassign是否为true而不同。

static int do_slabs_newslab(const unsigned int id) {
    slabclass_t *p = &slabclass[id];
	//settings.slab_reassign的默认值为false
    int len = settings.slab_reassign ? settings.item_size_max
        : p->size * p->perslab;

	//len就是一个内存页的大小
	...
}

当settings.slab_reassign为true，也就是启动rebalance功能的时候，slabclass数组中所有slabclass_t的内存页都是一样大的，等于settings.item_size_max(默认为1MB)。这样做的好处就是在需要将一个内存页从某一个slabclass_t强抢给另外一个slabclass_t时，比较好处理。不然的话，slabclass[i]从slabclass[j] 抢到的一个内存页可以切分为n个item，而从slabclass[k]抢到的一个内存页却切分为m个item，而本身的一个内存页有s个item。这样的话是相当混乱的。假如毕竟统一了内存页大小，那么无论从哪里抢到的内存页都是切分成一样多的item个数。

启动和终止rebalance：

main函数会调用start_slab_maintenance_thread函数启动rebalance线程和automove线程。main函数是在settings.slab_reassign为true时才会调用的。

//slabs.c文件
static pthread_cond_t maintenance_cond = PTHREAD_COND_INITIALIZER;
static pthread_cond_t slab_rebalance_cond = PTHREAD_COND_INITIALIZER;
static volatile int do_run_slab_thread = 1;
static volatile int do_run_slab_rebalance_thread = 1;

#define DEFAULT_SLAB_BULK_CHECK 1
int slab_bulk_check = DEFAULT_SLAB_BULK_CHECK;

static pthread_mutex_t slabs_lock = PTHREAD_MUTEX_INITIALIZER;
static pthread_mutex_t slabs_rebalance_lock = PTHREAD_MUTEX_INITIALIZER;

static pthread_t maintenance_tid;
static pthread_t rebalance_tid;



//由main函数调用，如果settings.slab_reassign为false将不会调用本函数(默认是false)
int start_slab_maintenance_thread(void) {
    int ret;
    slab_rebalance_signal = 0;
    slab_rebal.slab_start = NULL;
    char *env = getenv("MEMCACHED_SLAB_BULK_CHECK");
    if (env != NULL) {
        slab_bulk_check = atoi(env);
        if (slab_bulk_check == 0) {
            slab_bulk_check = DEFAULT_SLAB_BULK_CHECK;
        }
    }

    if (pthread_cond_init(&slab_rebalance_cond, NULL) != 0) {
        fprintf(stderr, "Can't intiialize rebalance condition\n");
        return -1;
    }
    pthread_mutex_init(&slabs_rebalance_lock, NULL);

    if ((ret = pthread_create(&maintenance_tid, NULL,
                              slab_maintenance_thread, NULL)) != 0) {
        fprintf(stderr, "Can't create slab maint thread: %s\n", strerror(ret));
        return -1;
    }
    if ((ret = pthread_create(&rebalance_tid, NULL,
                              slab_rebalance_thread, NULL)) != 0) {
        fprintf(stderr, "Can't create rebal thread: %s\n", strerror(ret));
        return -1;
    }
    return 0;
}

void stop_slab_maintenance_thread(void) {
    mutex_lock(&cache_lock);
    do_run_slab_thread = 0;
    do_run_slab_rebalance_thread = 0;
    pthread_cond_signal(&maintenance_cond);
    pthread_mutex_unlock(&cache_lock);

    /* Wait for the maintenance thread to stop */
    pthread_join(maintenance_tid, NULL);
    pthread_join(rebalance_tid, NULL);
}

要注意的是，start_slab_maintenance_thread函数启动了两个线程：rebalance线程和automove线程。automove线程会自动检测是否需要进行内存页重分配。如果检测到需要重分配，那么就会叫rebalance线程执行这个内存页重分配工作。

默认情况下是不开启自动检测功能的，即使在启动memcached的时候加入了-o slab_reassign参数。自动检测功能由全局变量settings.slab_automove控制(默认值为0，0就是不开启)。如果要开启可以在启动memcached的时候加入slab_automove选项，并将其参数数设置为1。比如命令$memcached -o slab_reassign,slab_automove=1就开启了自动检测功能。当然也是可以在启动memcached后通过客户端命令启动automove功能，使用命令slabsautomove <0|1>。其中0表示关闭automove，1表示开启automove。客户端的这个命令只是简单地设置settings.slab_automove的值，不做其他任何工作。

automove线程：

item状态记录仪：

由于rebalance线程启动后就会由于等待条件变量而进入休眠状态，等待别人给它内存页重分配任务。所以我们先来看一下automove线程。

automove线程要进行自动检测，检测就需要一些实时数据进行分析。然后得出结论：哪个slabclass_t需要更多的内存，哪个又不需要。automove线程通过全局变量itemstats收集item的各种数据。下面看一下itemstats变量以及它的类型定义。

//items.c文件
typedef struct {
    uint64_t evicted;//因为LRU踢了多少个item
    //即使一个item的exptime设置为0，也是会被踢的
    uint64_t evicted_nonzero;//被踢的item中，超时时间(exptime)不为0的item数

	//最后一次踢item时，被踢的item已经过期多久了
	//itemstats[id].evicted_time = current_time - search->time;
    rel_time_t evicted_time;

	
    uint64_t reclaimed;//在申请item时，发现过期并回收的item数量
    uint64_t outofmemory;//为item申请内存，失败的次数
    uint64_t tailrepairs;//需要修复的item数量(除非worker线程有问题否则一般为0)
	
	//直到被超时删除时都还没被访问过的item数量
    uint64_t expired_unfetched;
	//直到被LRU踢出时都还没有被访问过的item数量
    uint64_t evicted_unfetched;
	
    uint64_t crawler_reclaimed;//被LRU爬虫发现的过期item数量

	//申请item而搜索LRU队列时，被其他worker线程引用的item数量
    uint64_t lrutail_reflocked;
} itemstats_t;

#define POWER_LARGEST  200
#define LARGEST_ID POWER_LARGEST
static itemstats_t itemstats[LARGEST_ID];

注意上面代码是在items.c文件的，并且全局变量itemstats是static类型。itemstats变量是一个数组，它是和slabclass数组一一对应的。itemstats数组的元素负责收集slabclass数组中对应元素的信息。itemstats_t结构体虽然提供了很多成员，可以收集很多信息，但automove线程只用到第一个成员evicted。automove线程需要知道每一个尺寸的item的被踢情况，然后判断哪一类item资源紧缺，哪一类item资源又过剩。

itemstats广泛分布在items.c文件的多个函数中(主要是为了能收集各种数据)，所以这里就不给出itemstats的具体收集实现了。当然由于evicted是重要的而且只在一个函数出现，就贴出evicted的收集代码吧。

item *do_item_alloc(char *key, const size_t nkey, const int flags,
                    const rel_time_t exptime, const int nbytes,
                    const uint32_t cur_hv) {
    item *it = NULL;

    int tries = 5;
    item *search;
    item *next_it;
    rel_time_t oldest_live = settings.oldest_live;

    search = tails[id];
    for (; tries > 0 && search != NULL; tries--, search=next_it) {
        /* we might relink search mid-loop, so search->prev isn't reliable */
        next_it = search->prev;

		...
		
        if ((search->exptime != 0 && search->exptime < current_time)
            || (search->time <= oldest_live && oldest_live <= current_time)) {
			...	
        } else if ((it = slabs_alloc(ntotal, id)) == NULL) {//申请内存失败
			//此刻，过期失效的item没有找到，申请内存又失败了。看来只能使用
			//LRU淘汰一个item(即使这个item并没有过期失效)
			
            if (settings.evict_to_free == 0) {//设置了不进行LRU淘汰item
            	//此时只能向客户端回复错误了
                itemstats[id].outofmemory++;
            } else {
                itemstats[id].evicted++;//增加被踢的item数
                itemstats[id].evicted_time = current_time - search->time;
				//即使一个item的exptime成员设置为永不超时(0)，还是会被踢的
				if (search->exptime != 0)
                    itemstats[id].evicted_nonzero++;
                if ((search->it_flags & ITEM_FETCHED) == 0) {
                    itemstats[id].evicted_unfetched++;
                }
                it = search;

                //一旦发现有item被踢，那么就启动内存页重分配操作
                //这个太频繁了，不推荐				
                if (settings.slab_automove == 2)
                    slabs_reassign(-1, id);
            }
        }

        break;
    }

	...
    return it;
}

从上面的代码可以看到，如果某个item因为LRU被踢了，那么就会被记录起来。在最后还可以看到如果settings.slab_automove 等于2，那么一旦有item被踢了就调用slabs_reassign函数。slabs_reassign函数就是内存页重分配处理函数。明显一有item被踢就重分配太频繁了，所以这是不推荐的。

确定贫穷和富有item：

现在回过来看一下automove线程的线程函数slab_maintenance_thread。

static void *slab_maintenance_thread(void *arg) {
    int src, dest;

    while (do_run_slab_thread) {
        if (settings.slab_automove == 1) {//启动了automove功能
            if (slab_automove_decision(&src, &dest) == 1) {
                /* Blind to the return codes. It will retry on its own */
                slabs_reassign(src, dest);
            }
            sleep(1);
        } else {//等待用户启动automove
            /* Don't wake as often if we're not enabled.
             * This is lazier than setting up a condition right now. */
            sleep(5);
        }
    }
    return NULL;
}

可以看到如果settings.slab_automove就调用slab_automove_decision判断是否应该进行内存页重分配。返回1就说明需要重分配内存页，此时调用slabs_reassign进行处理。现在来看一下automove线程是怎么判断要不要进行内存页重分配的。

//items.c文件
void item_stats_evictions(uint64_t *evicted) {
    int i;
    mutex_lock(&cache_lock);
    for (i = 0; i < LARGEST_ID; i++) {
        evicted[i] = itemstats[i].evicted;
    }
    mutex_unlock(&cache_lock);
}


//slabs.c文件
//本函数选出最佳被踢选手，和最佳不被踢选手。返回1表示成功选手两位选手
//返回0表示没有选出。要同时选出两个选手才返回1。并用src参数记录最佳不
//不踢选手的id，dst记录最佳被踢选手的id
static int slab_automove_decision(int *src, int *dst) {
    static uint64_t evicted_old[POWER_LARGEST];
    static unsigned int slab_zeroes[POWER_LARGEST];
    static unsigned int slab_winner = 0;
    static unsigned int slab_wins   = 0;
    uint64_t evicted_new[POWER_LARGEST];
    uint64_t evicted_diff = 0;
    uint64_t evicted_max  = 0;
    unsigned int highest_slab = 0;
    unsigned int total_pages[POWER_LARGEST];
    int i;
    int source = 0;
    int dest = 0;
    static rel_time_t next_run;

    /* Run less frequently than the slabmove tester. */
	//本函数的调用不能过于频繁，至少10秒调用一次
    if (current_time >= next_run) {
        next_run = current_time + 10;
    } else {
        return 0;
    }

	//获取每一个slabclass的被踢item数
    item_stats_evictions(evicted_new);
    pthread_mutex_lock(&cache_lock);
    for (i = POWER_SMALLEST; i < power_largest; i++) {
        total_pages[i] = slabclass[i].slabs;
    }
    pthread_mutex_unlock(&cache_lock);

	//本函数会频繁被调用，所以有次数可说。
	
    /* Find a candidate source; something with zero evicts 3+ times */
	//evicted_old记录上一个时刻每一个slabclass的被踢item数
	//evicted_new则记录了现在每一个slabclass的被踢item数
	//evicted_diff则能表现某一个LRU队列被踢的频繁程度
    for (i = POWER_SMALLEST; i < power_largest; i++) {
        evicted_diff = evicted_new[i] - evicted_old[i];
        if (evicted_diff == 0 && total_pages[i] > 2) {
			//evicted_diff等于0说明这个slabclass没有item被踢，而且
			//它又占有至少两个slab。			
            slab_zeroes[i]++;//增加计数
            //这个slabclass已经历经三次都没有被踢记录，说明空间多得很
            //就选你了,最佳不被踢选手
            if (source == 0 && slab_zeroes[i] >= 3)
                source = i;
        } else {
            slab_zeroes[i] = 0;//计数清零
            if (evicted_diff > evicted_max) {
                evicted_max = evicted_diff;
                highest_slab = i;
            }
        }
        evicted_old[i] = evicted_new[i];
    }

    /* Pick a valid destination */
	//选出一个slabclass，这个slabclass要连续3次都是被踢最多item的那个slabclass
    if (slab_winner != 0 && slab_winner == highest_slab) {
        slab_wins++;
        if (slab_wins >= 3)//这个slabclass已经连续三次成为最佳被踢选手了
            dest = slab_winner;
    } else {
        slab_wins = 1;//计数清零(当然这里是1)
        slab_winner = highest_slab;//本次的最佳被踢选手
    }

    if (source && dest) {
        *src = source;
        *dst = dest;
        return 1;
    }
    return 0;
}

从上面的代码也可以看到，其实判断的方法也比较简单。从slabclass数组中选出两个选手：一个是连续三次没有被踢item了，另外一个则是连续三次都成为最佳被踢手。如果找到了满足条件的两个选手，那么返回1。此时automove线程就会调用slabs_reassign函数。

下达 rebalance任务：

在贴出slabs_reassign函数前，回想一下slabs reassign命令。前面讲的都是自动检测要不要进行内存页重分配，都快要忘了还有一个手动要求内存页重分配的命令。如果客户端使用了slabs reassign命令，那么worker线程在接收到这个命令后，就会调用slabs_reassign函数，函数参数是slabs reassign命令的参数。现在自动检测和手动设置大一统了。

enum reassign_result_type {
    REASSIGN_OK=0, REASSIGN_RUNNING, REASSIGN_BADCLASS, REASSIGN_NOSPARE,
    REASSIGN_SRC_DST_SAME
};


enum reassign_result_type slabs_reassign(int src, int dst) {
    enum reassign_result_type ret;
    if (pthread_mutex_trylock(&slabs_rebalance_lock) != 0) {
        return REASSIGN_RUNNING;
    }
    ret = do_slabs_reassign(src, dst);
    pthread_mutex_unlock(&slabs_rebalance_lock);
    return ret;
}


static enum reassign_result_type do_slabs_reassign(int src, int dst) {
    if (slab_rebalance_signal != 0)
        return REASSIGN_RUNNING;

    if (src == dst)//不能相同
        return REASSIGN_SRC_DST_SAME;

    /* Special indicator to choose ourselves. */
    if (src == -1) {//客户端命令要求随机选出一个源slab class
		//选出一个页数大于1的slab class，并且该slab class不能是dst
		//指定的那个。如果不存在这样的slab class，那么返回-1
        src = slabs_reassign_pick_any(dst);
        /* TODO: If we end up back at -1, return a new error type */
    }

    if (src < POWER_SMALLEST || src > power_largest ||
        dst < POWER_SMALLEST || dst > power_largest)
        return REASSIGN_BADCLASS;

	//源slab class没有或者只有一个内存页，那么就不能分给别的slab class
    if (slabclass[src].slabs < 2)
        return REASSIGN_NOSPARE;

	//全局变量slab_rebal
    slab_rebal.s_clsid = src;//保存源slab class
    slab_rebal.d_clsid = dst;//保存目标slab class

    slab_rebalance_signal = 1;
	//唤醒slab_rebalance_thread函数的线程.
	//在slabs_reassign函数中已经锁上了slabs_rebalance_lock
    pthread_cond_signal(&slab_rebalance_cond);

    return REASSIGN_OK;
}


//选出一个内存页数大于1的slab class，并且该slab class不能是dst
//指定的那个。如果不存在这样的slab class，那么返回-1
static int slabs_reassign_pick_any(int dst) {
    static int cur = POWER_SMALLEST - 1;
    int tries = power_largest - POWER_SMALLEST + 1;
    for (; tries > 0; tries--) {
        cur++;
        if (cur > power_largest)
            cur = POWER_SMALLEST;
        if (cur == dst)
            continue;
        if (slabclass[cur].slabs > 1) {
            return cur;
        }
    }
    return -1;
}

do_slabs_reassign会把源slab class 和目标slab class保存在全局变量slab_rebal，并且在最后会调用pthread_cond_signal唤醒rebalance线程。

rebalance线程：

现在automove线程已经退出历史舞台了，rebalance线程也从沉睡中苏醒过来并登上舞台。现在来看一下rebalance线程的线程函数slab_rebalance_thread。注意：在一开始slab_rebalance_signal是等于0的，当需要进行内存页重分配就会把slab_rebalance_signal变量赋值为1。

static void *slab_rebalance_thread(void *arg) {
    int was_busy = 0;
    /* So we first pass into cond_wait with the mutex held */
    mutex_lock(&slabs_rebalance_lock);

    while (do_run_slab_rebalance_thread) {
        if (slab_rebalance_signal == 1) {
			//标志要移动的内存页的信息，并将slab_rebalance_signal赋值为2
			//slab_rebal.done赋值为0，表示没有完成
            if (slab_rebalance_start() < 0) {//失败
                /* Handle errors with more specifity as required. */
                slab_rebalance_signal = 0;
            }

            was_busy = 0;
        } else if (slab_rebalance_signal && slab_rebal.slab_start != NULL) {
            was_busy = slab_rebalance_move();//进行内存页迁移操作
        }

        if (slab_rebal.done) {//完成内存页重分配操作
            slab_rebalance_finish();
        } else if (was_busy) {//有worker线程在使用内存页上的item
            /* Stuck waiting for some items to unlock, so slow down a bit
             * to give them a chance to free up */
            usleep(50);//休眠一会儿，等待worker线程放弃使用item，然后再次尝试
        }

        if (slab_rebalance_signal == 0) {//一开始就在这里休眠
            /* always hold this lock while we're running */
            pthread_cond_wait(&slab_rebalance_cond, &slabs_rebalance_lock);
        }
    }
    return NULL;
}

锁定内存页：

函数slab_rebalance_start对要源slab class进行一些标注，当worker线程要访问源slab class的时候意识到正在内存页重分配。

//memcached.h文件
struct slab_rebalance {
	//记录要移动的页的信息。slab_start指向页的开始位置。slab_end指向页
	//的结束位置。slab_pos则记录当前处理的位置(item)
    void *slab_start;
    void *slab_end;
    void *slab_pos;
    int s_clsid; //源slab class的下标索引
    int d_clsid; //目标slab class的下标索引
    int busy_items; //是否worker线程在引用某个item
    uint8_t done;//是否完成了内存页移动
};
//memcached.c文件
struct slab_rebalance slab_rebal;

//slabs.c文件
static int slab_rebalance_start(void) {
    slabclass_t *s_cls;
    int no_go = 0;

    pthread_mutex_lock(&cache_lock);
    pthread_mutex_lock(&slabs_lock);

    if (slab_rebal.s_clsid < POWER_SMALLEST ||
        slab_rebal.s_clsid > power_largest  ||
        slab_rebal.d_clsid < POWER_SMALLEST ||
        slab_rebal.d_clsid > power_largest  ||
        slab_rebal.s_clsid == slab_rebal.d_clsid)//非法下标索引
        no_go = -2;

    s_cls = &slabclass[slab_rebal.s_clsid];

	//为这个目标slab class增加一个页表项都失败，那么就
	//根本无法为之增加一个页了
    if (!grow_slab_list(slab_rebal.d_clsid)) {
        no_go = -1;
    }

    if (s_cls->slabs < 2)//目标slab class页数太少了，无法分一个页给别人
        no_go = -3;

    if (no_go != 0) {
        pthread_mutex_unlock(&slabs_lock);
        pthread_mutex_unlock(&cache_lock);
        return no_go; /* Should use a wrapper function... */
    }

	//标志将源slab class的第几个内存页分给目标slab class
	//这里是默认是将第一个内存页分给目标slab class
    s_cls->killing = 1;

	//记录要移动的页的信息。slab_start指向页的开始位置。slab_end指向页
	//的结束位置。slab_pos则记录当前处理的位置(item)
    slab_rebal.slab_start = s_cls->slab_list[s_cls->killing - 1];
    slab_rebal.slab_end   = (char *)slab_rebal.slab_start +
        (s_cls->size * s_cls->perslab);
    slab_rebal.slab_pos   = slab_rebal.slab_start;
    slab_rebal.done       = 0;

    /* Also tells do_item_get to search for items in this slab */
    slab_rebalance_signal = 2;//要rebalance线程接下来进行内存页移动
  

    pthread_mutex_unlock(&slabs_lock);
    pthread_mutex_unlock(&cache_lock);

    return 0;
}

slab_rebalance_start会将一个slab class的一个内存页标注为要移动的，此时就不能让worker线程访问这个内存页的item了。现在看一下假如worker线程刚好要访问这个内存页的一个item时会发生什么。

item *do_item_get(const char *key, const size_t nkey, const uint32_t hv) {
    item *it = assoc_find(key, nkey, hv);//assoc_find函数内部没有加锁
    
    if (it != NULL) {//找到了，此时item的引用计数至少为1
        refcount_incr(&it->refcount);//线程安全地自增一
        /* Optimization for slab reassignment. prevents popular items from
         * jamming in busy wait. Can only do this here to satisfy lock order
         * of item_lock, cache_lock, slabs_lock. */
        if (slab_rebalance_signal &&
            ((void *)it >= slab_rebal.slab_start && (void *)it < slab_rebal.slab_end)) {
			//这个item刚好在要移动的内存页里面。此时不能返回这个item
			//worker线程要负责把这个item从哈希表和LRU队列中删除这个item，避免
			//后面有其他worker线程又访问这个不能使用的item
			do_item_unlink_nolock(it, hv);
            do_item_remove(it);
            it = NULL;
        }
    }

 	...
    return it;
}

移动(归还)item：

现在回过头继续看rebalance线程。前面说到已经标注了源slab class的一个内存页。标注完rebalance线程就会调用slab_rebalance_move函数完成真正的内存页迁移操作。源slab class上的内存页是有item的，那么在迁移的时候怎么处理这些item呢？memcached的处理方式是很粗暴的：直接删除。如果这个item还有worker线程在使用，rebalance线程就等你一下。如果这个item没有worker线程在引用，那么即使这个item没有过期失效也将直接删除。

因为一个内存页可能会有很多个item，所以memcached也采用分期处理的方法，每次只处理少量的item(默认为一个)。所以呢，slab_rebalance_move函数会在slab_rebalance_thread线程函数中多次调用，直到处理了所有的item。

/* refcount == 0 is safe since nobody can incr while cache_lock is held.
 * refcount != 0 is impossible since flags/etc can be modified in other
 * threads. instead, note we found a busy one and bail. logic in do_item_get
 * will prevent busy items from continuing to be busy
 */
static int slab_rebalance_move(void) {
    slabclass_t *s_cls;
    int x;
    int was_busy = 0;
    int refcount = 0;
    enum move_status status = MOVE_PASS;

    pthread_mutex_lock(&cache_lock);
    pthread_mutex_lock(&slabs_lock);

    s_cls = &slabclass[slab_rebal.s_clsid];

	//会在start_slab_maintenance_thread函数中读取环境变量设置slab_bulk_check
	//默认值为1.同样这里也是采用分期处理的方案处理一个页上的多个item
    for (x = 0; x < slab_bulk_check; x++) {
        item *it = slab_rebal.slab_pos;
        status = MOVE_PASS;
        if (it->slabs_clsid != 255) {
            void *hold_lock = NULL;
            uint32_t hv = hash(ITEM_key(it), it->nkey);
            if ((hold_lock = item_trylock(hv)) == NULL) {
                status = MOVE_LOCKED;
            } else {
                refcount = refcount_incr(&it->refcount);
                if (refcount == 1) { /* item is unlinked, unused */
					//如果it_flags&ITEM_SLABBED为真，那么就说明这个item
					//根本就没有分配出去。如果为假，那么说明这个item被分配
					//出去了，但处于归还途中。参考do_item_get函数里面的
					//判断语句，有slab_rebalance_signal作为判断条件的那个。
                    if (it->it_flags & ITEM_SLABBED) {//没有分配出去
                        /* remove from slab freelist */
                        if (s_cls->slots == it) {
                            s_cls->slots = it->next;
                        }
                        if (it->next) it->next->prev = it->prev;
                        if (it->prev) it->prev->next = it->next;
                        s_cls->sl_curr--;
                        status = MOVE_DONE;//这个item处理成功
                    } else {//此时还有另外一个worker线程在归还这个item
                        status = MOVE_BUSY;
                    }
                } else if (refcount == 2) { /* item is linked but not busy */
                	//没有worker线程引用这个item
                    if ((it->it_flags & ITEM_LINKED) != 0) {
						//直接把这个item从哈希表和LRU队列中删除
                        do_item_unlink_nolock(it, hv);
                        status = MOVE_DONE;
                    } else {
                        /* refcount == 1 + !ITEM_LINKED means the item is being
                         * uploaded to, or was just unlinked but hasn't been freed
                         * yet. Let it bleed off on its own and try again later */
                        status = MOVE_BUSY;
                    }
                } else {//现在有worker线程正在引用这个item
                    status = MOVE_BUSY;
                }
                item_trylock_unlock(hold_lock);
            }
        }

        switch (status) {
            case MOVE_DONE:
                it->refcount = 0;//引用计数清零
                it->it_flags = 0;//清零所有属性
                it->slabs_clsid = 255;
                break;
            case MOVE_BUSY:
                refcount_decr(&it->refcount); //注意这里没有break
            case MOVE_LOCKED:
                slab_rebal.busy_items++;
                was_busy++;//记录是否有不能马上处理的item
                break;
            case MOVE_PASS:
                break;
        }

		//处理这个页的下一个item
        slab_rebal.slab_pos = (char *)slab_rebal.slab_pos + s_cls->size;
        if (slab_rebal.slab_pos >= slab_rebal.slab_end)//遍历完了这个页
            break;
    }

	//遍历完了这个页的所有item
    if (slab_rebal.slab_pos >= slab_rebal.slab_end) {
        /* Some items were busy, start again from the top */
		//在处理的时候，跳过了一些item(因为有worker线程在引用)
        if (slab_rebal.busy_items) {//此时需要从头再扫描一次这个页
            slab_rebal.slab_pos = slab_rebal.slab_start;
            slab_rebal.busy_items = 0;
        } else {
            slab_rebal.done++;//标志已经处理完这个页的所有item
        }
    }

    pthread_mutex_unlock(&slabs_lock);
    pthread_mutex_unlock(&cache_lock);

    return was_busy;//返回记录
}

劫富济贫：

上面代码中的was_busy就标志了是否有worker线程在引用内存页中的一个item。其实slab_rebalance_move函数的名字取得不好，因为实现的不是移动(迁移)，而是把内存页中的item删除从哈希表和LRU队列中删除。如果处理完内存页的所有item，那么就会slab_rebal.done++，标志处理完成。在线程函数slab_rebalance_thread中，如果slab_rebal.done为真就会调用slab_rebalance_finish函数完成真正的内存页迁移操作，把一个内存页从一个slab class 转移到另外一个slab class中。

static void slab_rebalance_finish(void) {
    slabclass_t *s_cls;
    slabclass_t *d_cls;

    pthread_mutex_lock(&cache_lock);
    pthread_mutex_lock(&slabs_lock);

    s_cls = &slabclass[slab_rebal.s_clsid];
    d_cls   = &slabclass[slab_rebal.d_clsid];

    /* At this point the stolen slab is completely clear */
	//相当于把指针赋NULL值
    s_cls->slab_list[s_cls->killing - 1] =
        s_cls->slab_list[s_cls->slabs - 1];
    s_cls->slabs--;//源slab class的内存页数减一
    s_cls->killing = 0;

	//内存页所有字节清零，这个也很重要的
    memset(slab_rebal.slab_start, 0, (size_t)settings.item_size_max);

	//将slab_rebal.slab_start指向的一个页内存馈赠给目标slab class
	//slab_rebal.slab_start指向的页是从源slab class中得到的。
    d_cls->slab_list[d_cls->slabs++] = slab_rebal.slab_start;
	//按照目标slab class的item尺寸进行划分这个页，并且将这个页的
	//内存并入到目标slab class的空闲item队列中
    split_slab_page_into_freelist(slab_rebal.slab_start,
        slab_rebal.d_clsid);

	//清零
    slab_rebal.done       = 0;
    slab_rebal.s_clsid    = 0;
    slab_rebal.d_clsid    = 0;
    slab_rebal.slab_start = NULL;
    slab_rebal.slab_end   = NULL;
    slab_rebal.slab_pos   = NULL;

    slab_rebalance_signal = 0;//rebalance线程完成工作后，再次进入休眠状态

    pthread_mutex_unlock(&slabs_lock);
    pthread_mutex_unlock(&cache_lock);

}

郑重声明：本站内容如果来自互联网及其他传播媒体，其版权均属原媒体及文章作者所有。转载目的在于传递更多信息及用于网络分享，并不代表本站赞同其观点和对其真实性负责，也不构成任何其他建议。

memcached源码分析-----slab automove和slab rebalance