HBase Compaction算法之ExploringCompactionPolicy
在0.98版本中,默认的compaction算法换成了ExploringCompactionPolicy,之前是RatioBasedCompactionPolicy
ExploringCompactionPolicy继承RatioBasedCompactionPolicy,重写了applyCompactionPolicy方法,applyCompactionPolicy是对minor compaction的选择文件的策略算法。
applyCompactionPolicy方法内容:
public List<StoreFile> applyCompactionPolicy(final List<StoreFile> candidates, boolean mightBeStuck, boolean mayUseOffPeak, int minFiles, int maxFiles) { //此ratio为后面算法使用,可设置非高峰时间段的ratio(默认:5.0)从而合并更多的数据 final double currentRatio = mayUseOffPeak ? comConf.getCompactionRatioOffPeak() : comConf.getCompactionRatio(); // Start off choosing nothing. List<StoreFile> bestSelection = new ArrayList<StoreFile>(0); List<StoreFile> smallest = mightBeStuck ? new ArrayList<StoreFile>(0) : null; long bestSize = 0; long smallestSize = Long.MAX_VALUE; int opts = 0, optsInRatio = 0, bestStart = -1; // for debug logging // Consider every starting place. for (int start = 0; start < candidates.size(); start++) { // Consider every different sub list permutation in between start and end with min files. for (int currentEnd = start + minFiles - 1; currentEnd < candidates.size(); currentEnd++) { List<StoreFile> potentialMatchFiles = candidates.subList(start, currentEnd + 1); // Sanity checks if (potentialMatchFiles.size() < minFiles) { continue; } if (potentialMatchFiles.size() > maxFiles) { continue; } // Compute the total size of files that will // have to be read if this set of files is compacted. long size = getTotalStoreSize(potentialMatchFiles); // Store the smallest set of files. This stored set of files will be used // if it looks like the algorithm is stuck. if (mightBeStuck && size < smallestSize) { smallest = potentialMatchFiles; smallestSize = size; } if (size > comConf.getMaxCompactSize()) { continue; } ++opts; if (size >= comConf.getMinCompactSize() && !filesInRatio(potentialMatchFiles, currentRatio)) { continue; } ++optsInRatio; if (isBetterSelection(bestSelection, bestSize, potentialMatchFiles, size, mightBeStuck)) { bestSelection = potentialMatchFiles; bestSize = size; bestStart = start; } } } if (bestSelection.size() == 0 && mightBeStuck) { LOG.debug("Exploring compaction algorithm has selected " + smallest.size() + " files of size "+ smallestSize + " because the store might be stuck"); return new ArrayList<StoreFile>(smallest); } LOG.debug("Exploring compaction algorithm has selected " + bestSelection.size() + " files of size " + bestSize + " starting at candidate #" + bestStart + " after considering " + opts + " permutations with " + optsInRatio + " in ratio"); return new ArrayList<StoreFile>(bestSelection);
从代码得知,主要算法如下:
- 从头到尾遍历文件,判断所有符合条件的组合
- 选择组合的文件数必须 >= minFiles(默认值:3)
- 选择组合的文件数必须 <= maxFiles(默认值:10)
- 计算组合的文件总大小size,size必须 <= maxCompactSize(通过hbase.hstore.compaction.max.size配置,默认值:LONG.MAX_VALUE,相当于没起作用,官方文档里面说只有觉得compaction经常发生并且没有多大的用时,可以修改这个值)
- 组合的文件大小 < minCompactSize 则是符合要求,如果 >= minCompactSize ,还需要判断filesInRatio
- filesInRatio算法:FileSize(i) <= ( Sum(0,N,FileSize(_)) - FileSize(i) ) * Ratio,也就是说组合里面的所有单个文件大小都必须满足 singleFileSize <= (totalFileSize - singleFileSize) * currentRatio,此算法的意义是为了限制太大的compaction,选择出来的文件不至于有一个很大的,应该尽可能先合并一些小的大小相差不大的文件,代码如下
private boolean filesInRatio(final List<StoreFile> files, final double currentRatio) { if (files.size() < 2) { return true; } long totalFileSize = getTotalStoreSize(files); for (StoreFile file : files) { long singleFileSize = file.getReader().length(); long sumAllOtherFileSizes = totalFileSize - singleFileSize; if (singleFileSize > sumAllOtherFileSizes * currentRatio) { return false; } } return true; }
- 寻找最有解,优先选择文件组合文件数多的,当文件数一样多时选择文件数小的,此目的是为了尽可能合并更多的文件并且产生的IO越少越好
private boolean isBetterSelection(List<StoreFile> bestSelection, long bestSize, List<StoreFile> selection, long size, boolean mightBeStuck) { if (mightBeStuck && bestSize > 0 && size > 0) { // Keep the selection that removes most files for least size. That penaltizes adding // large files to compaction, but not small files, so we don‘t become totally inefficient // (might want to tweak that in future). Also, given the current order of looking at // permutations, prefer earlier files and smaller selection if the difference is small. final double REPLACE_IF_BETTER_BY = 1.05; double thresholdQuality = ((double)bestSelection.size() / bestSize) * REPLACE_IF_BETTER_BY; return thresholdQuality < ((double)selection.size() / size); } // Keep if this gets rid of more files. Or the same number of files for less io. return selection.size() > bestSelection.size() || (selection.size() == bestSelection.size() && size < bestSize); }
主要算法至此结束,下面说说其他细节及其优化部分:
步骤6的ratio默认值是1.2,但是打开了非高峰时间段的优化时,可以有不同的值,非高峰的ratio默认值是5.0,此优化目的是为了在业务低估时可以合并更多的数据,目前此优化只能是天的小说时间段,还不算灵活。
算法中关于mightBeStuck的逻辑部分,这个参数是用来表示是否有可能compaction会被卡住,它的状态是 待选文件数 - 正在做compaction的文件数 + futureFiles(默认值是0,有正在做compaction的文件时是1) >= hbase.hstore.blockingStoreFiles (默认是10,此配置在flush中也会用到,以后分析flush的时候会补充),如果是true时:
- 选择文件算法还会去寻找一个最小解。在上文步骤4之前,会记录一个文件大小最小的组合
- isBetterSelection部分,算法改为 (bestSelection.size() / bestSize) * 1.05 < selection.size() / size,通过文件大小和文件数的比值去选择一个合适的解
- 返回结果时,没有合适的最优解或返回一个最小解。
mightBeStuck的优化部分,相当于保证在很多的文件数的情况下,也可以选出一个最小解去做compaction,而不用再让文件继续增长下去直到有一个合适的组合出现。
此算法跟RatioBasedCompactionPolicy的区别,简单的说就是RatioBasedCompactionPolicy是简单的从头到尾遍历StoreFile列表,遇到一个符合Ratio条件的序列就选定执行Compaction。而ExploringCompactionPolicy则是从头到尾遍历的同时记录下当前最优,然后从中选择一个全局最优列表。
郑重声明:本站内容如果来自互联网及其他传播媒体,其版权均属原媒体及文章作者所有。转载目的在于传递更多信息及用于网络分享,并不代表本站赞同其观点和对其真实性负责,也不构成任何其他建议。