jQuery1.11源码分析(2)-----Sizzle源码中的正则表达式

浏览数：27 / 时间：2015年06月09日

看完了上篇，对Sizzle有了一个大致的了解，我们接下来就可以正式开始啃Sizzle的源码了。上来就讲matcher难度太大，先来点开胃菜，讲讲Sizzle中的各个正则表达式的作用吧（本来还想讲初始化的，篇幅太长了，留待下篇吧）。

友情提醒：阅读本文请先学习正则表达式，至少对捕获组以及js的正则API(exec,match,test,字符串转正则)有一定的了解

这是前面一堆变量声明和函数声明。

var Sizzle =
/*!
 * Sizzle CSS Selector Engine v1.10.16
 * http://sizzlejs.com/
 *
 * Copyright 2013 jQuery Foundation, Inc. and other contributors
 * Released under the MIT license
 * http://jquery.org/license
 *
 * Date: 2014-01-13
 */
(function( window ) {
var i,
    support,      //后面会初始化为对象，保存支持性
    Expr,
    getText,
    isXML,        //是否是XML
    compile,
    outermostContext,      //最大查找范围
    sortInput,
    hasDuplicate,  //刚检查完的两个元素是否重复

    // Local document vars
    setDocument,
    document,
    docElem,
    documentIsHTML,
    rbuggyQSA,
    rbuggyMatches,
    matches,
    contains,

    // Instance-specific data
    //用来对特殊的函数进行标记
    expando = "sizzle" + -(new Date()),
    //倾向于使用的文档节点
    preferredDoc = window.document,
    dirruns = 0,
    done = 0,
    //这里几个cache实际上是函数
    //通过classCache(key,value)的形式进行存储
    //通过classCache[key+‘ ‘]来进行获取
    classCache = createCache(),
    tokenCache = createCache(),
    compilerCache = createCache(),
    sortOrder = function( a, b ) {
        if ( a === b ) {
            hasDuplicate = true;
        }
        return 0;
    },

到这里先停一下，注意到变量声明的时候声明了三个缓存(cache)，涉及到createCache这个工具函数，这个工具函数的作用简单来说就是创建一个缓存，用OO思想我们创建的缓存都是声明一个对象，然后再给其加上set、get、delete等方法，但Sizzle不是这样做的，我们先来看看源码。

/**
 * Create key-value caches of limited size
 * @returns {Function(string, Object)} Returns the Object data after storing it on itself with
 * property name the (space-suffixed) string and (if the cache is larger than Expr.cacheLength)
 * deleting the oldest entry
 */
function createCache() {
    //用来保存已经存储过的key，这是一种闭包
    var keys = [];
    //这里使用cache这个函数本身来当作存放数据的对象。
    function cache( key, value ) {
    // Use (key + " ") to avoid collision with native prototype properties (see Issue #157)
    //key后面加空格是为了避免覆盖原生属性
    //当长度超过限制时，则需要删除以前的缓存
        if ( keys.push( key + " " ) > Expr.cacheLength ) {
            // Only keep the most recent entries
            delete cache[ keys.shift() ];
        }
    //返回存储好的信息
        return (cache[ key + " " ] = value);
    }
    return cache;
}

这里调用createCache返回了一个函数，函数像对象一样可以来以key-value的形式存储信息(jQuery的大部分工具函数都是存储在jQuery这个工厂函数上，这样省了一个命名空间)，当直接调用这个返回的函数时，就可以存储信息(set)，用cache[key+‘ ‘]的形式就可以取出存放的信息(get)，之所以要在key后面加空格，是要避免覆盖其他原生属性。

继续返回上面，开始看各种变量声明。

这里的各个变量保存了一些如push之类的原生方法引用，因为这些方法要多次使用，这样的做法可以优化性能。

        
        // General-purpose constants
    //还是不明白这里为什么要把undefined转换为字符串用以判断
    strundefined = typeof undefined,
    MAX_NEGATIVE = 1 << 31,

    // Instance methods
    hasOwn = ({}).hasOwnProperty,
    arr = [],
    pop = arr.pop,
    push_native = arr.push,
    push = arr.push,
    slice = arr.slice,
    // Use a stripped-down indexOf if we can‘t use a native one
    //先检查一下有没有原生API
    indexOf = arr.indexOf || function( elem ) {
        var i = 0,
            len = this.length;
        for ( ; i < len; i++ ) {
            if ( this[i] === elem ) {
                return i;
            }
        }
        return -1;
    },
    //用来在做属性选择的时候进行判断
    booleans = "checked|selected|async|autofocus|autoplay|controls|defer|disabled|hidden|ismap|loop|multiple|open|readonly|required|scoped",

    // Regular expressions

    // Whitespace characters http://www.w3.org/TR/css3-selectors/#whitespace
    //空白符的几种类型
    //字符串里的两根斜杠其中有一根用来转义，比如\\x20到了正则表达式里就是\x20
    whitespace = "[\\x20\\t\\r\\n\\f]",

然后我们接触到了第二个正则表达式（虽然它现在是字符串，但最终是会传入new RegExp方法的）

这里总结了符合标准的种种空白字符，包括空格、tab、回车、换行等，这里之所以有两根反斜杠(backslash)，是因为反斜杠在字符串中是转义符，为了表示反斜杠这一含义，需要用\\这样的写法。

我们继续看后面的正则表达式。

        
        // http://www.w3.org/TR/css3-syntax/#characters
    //\\\\.转换到正则表达式中就是\\.+用来兼容带斜杠的css
    //三种匹配字符的方式：\\.+，[\w-]+,大于\xa0的字符+，为什么匹配这三个请看上面的链接
    characterEncoding = "(?:\\\\.|[\\w-]|[^\\x00-\\xa0])+",

    // Loosely modeled on CSS identifier characters
    // An unquoted value should be a CSS identifier http://www.w3.org/TR/css3-selectors/#attribute-selectors
    // Proper syntax: http://www.w3.org/TR/CSS21/syndata.html#value-def-identifier
    //这里多了一个#，暂时没细看标准不明白为何
    identifier = characterEncoding.replace( "w", "w#" ),

    // Acceptable operators http://www.w3.org/TR/selectors/#attribute-selectors
    //\[ [\x20\t\r\n\f]* ((?:\\.|[\w-]|[^\x00-\xa0])+) [\x20\t\r\n\f]* (?:([*^$|!~]?=) [\x20\t\r\n\f]* (?:[‘"]) ((?:\\.|[^\\])*?) \3 | () |)|)
        //\3 is ([‘\"])
    //这种正则的链接方式对有大量空白的处理非常好，很容易读。
    //捕获组序列:
    //$1:attrName,$2:([*^$|!~]?=),$3:([‘\"]),$4:((?:\\\\.|[^\\\\])*?)\\3|(" + identifier + ")|)|)，$5:(" + identifier + ")
    //$1捕获的是attrName,$2捕获的是等号或^=这样的等号方式，$3捕获单双引号
    //$4提供三种匹配字符串的方式：\\.*?\3,非斜杠*?\3(因为斜杠没意义),识别符,此处相当于捕获attrValue，只不过要兼容带引号和不带两种形式
    //$5捕获识别符
    attributes = "\\[" + whitespace + "*(" + characterEncoding + ")" + whitespace +
        "*(?:([*^$|!~]?=)" + whitespace + "*(?:([‘\"])((?:\\\\.|[^\\\\])*?)\\3|(" + identifier + ")|)|)" + whitespace + "*\\]",

上面两个正则表达式不再赘述，第三个正则表达式先看开头和结尾匹配的是代表属性选择符的‘[‘和‘]‘，捕获出来的结果分别代表的含义是[attrName、等号、引号、attrValue、attrValue]。

再看下一个正则表达式

 
        // Prefer arguments quoted,
    //   then not containing pseudos/brackets,
    //   then attribute selectors/non-parenthetical expressions,
    //   then anything else
    // These preferences are here to reduce the number of selectors
    //   needing tokenize in the PSEUDO preFilter
    //$1:pseudoName,$2:(([‘\"])((?:\\\\.|[^\\\\])*?)\\3|((?:\\\\.|[^\\\\()[\\]]|" + attributes.replace( 3, 8 ) + ")*)|.*)
    // ,$3:([‘\"]),$4:((?:\\\\.|[^\\\\])*?),$5:((?:\\\\.|[^\\\\()[\\]]|" + attributes.replace( 3, 8 ) + ")*)
    //$1捕获伪元素或伪类的名字，$2捕获两种类型的字符，一种是带引号的字符串，一种是attributes那样的键值对。
    //$3捕获引号，$4和$5分别捕获$2中的一部分
    pseudos = ":(" + characterEncoding + ")(?:\\((([‘\"])((?:\\\\.|[^\\\\])*?)\\3|((?:\\\\.|[^\\\\()[\\]]|" + attributes.replace( 3, 8 ) + ")*)|.*)\\)|)",

    // Leading and non-escaped trailing whitespace, capturing some non-whitespace characters preceding the latter
    //$1:((?:^|[^\\\\])(?:\\\\.)*)
    //这个用来去除selector多余的空格，免得干扰到后面空格的匹配关系
    rtrim = new RegExp( "^" + whitespace + "+|((?:^|[^\\\\])(?:\\\\.)*)" + whitespace + "+$", "g" ),
    //这个后面用来清除css规则中组与组之间的逗号。
    rcomma = new RegExp( "^" + whitespace + "*," + whitespace + "*" ),
    //??????第二个whitespace有什么用？？？
    //这个现在可以解答了。。因为空格也是用来连接祖先后代关系中的一种。。
    //$1:([>+~]|whitespace)分别捕获4种连接符:‘>‘,‘+‘,‘~‘,‘whitespace‘
    rcombinators = new RegExp( "^" + whitespace + "*([>+~]|" + whitespace + ")" + whitespace + "*" ),
    //$1:([^\\]‘\"]*?)
    //??????这个我忘了有什么用。。
    rattributeQuotes = new RegExp( "=" + whitespace + "*([^\\]‘\"]*?)" + whitespace + "*\\]", "g" ),

    rpseudo = new RegExp( pseudos ),
    ridentifier = new RegExp( "^" + identifier + "$" ),

    //这里是最后用来检测的正则表达式，使用形式通常是matchExpr[tokens[i].type].test(...)
    matchExpr = {
        "ID": new RegExp( "^#(" + characterEncoding + ")" ),
        "CLASS": new RegExp( "^\\.(" + characterEncoding + ")" ),
        "TAG": new RegExp( "^(" + characterEncoding.replace( "w", "w*" ) + ")" ),
        "ATTR": new RegExp( "^" + attributes ),
        "PSEUDO": new RegExp( "^" + pseudos ),
        //$1:(only|first|last|nth|nth-last),$2:(child|of-type),
        // $3:(even|odd|(([+-]|)(\\d*)n|)" + whitespace + "*(?:([+-]|)" + whitespace + "*(\\d+)|))
        // $4:(([+-]|)(\\d*)n|),$5:([+-]|),$6:(\\d*),$7:([+-]|),$8:(\\d+)
        //这是后面用来检查你是否用到子选择器的。
        //$3第一部分匹配四种字符，even,odd,[+-](\d*)n,任意字符。空格，再匹配第二部分:（[+-]或任意字符）+空格+(一或多个数字)
        //为什么这么做，则要看Sizzle支持的子选择符语法。
        //这些捕获组的含义在后面会提到,请结合具体用法理解
        "CHILD": new RegExp( "^:(only|first|last|nth|nth-last)-(child|of-type)(?:\\(" + whitespace +
            "*(even|odd|(([+-]|)(\\d*)n|)" + whitespace + "*(?:([+-]|)" + whitespace +
            "*(\\d+)|))" + whitespace + "*\\)|)", "i" ),
        "bool": new RegExp( "^(?:" + booleans + ")$", "i" ),
        // For use in libraries implementing .is()
        // We use this for POS matching in `select`
        //$1是(even|odd|eq|gt|lt|nth|first|last),$2是((?:-\\d)?\\d*)
        //当选择符匹配这个成功的时候，则说明这个选择符使用的时候需要上下文，而不是像id,tag,class一样直接查找
        "needsContext": new RegExp( "^" + whitespace + "*[>+~]|:(even|odd|eq|gt|lt|nth|first|last)(?:\\(" +whitespace + "*((?:-\\d)?\\d*)" + whitespace + "*\\)|)(?=[^-]|$)", "i" )
    },

    rinputs = /^(?:input|select|textarea|button)$/i,
        //h1,h2,h3......
    rheader = /^h\d$/i,
    //用来检测某个API是否是原生API
    //例如document.createElement.toString() 的运行结果是 "function createElement() { [native code] }"
    rnative = /^[^{]+\{\s*\[native \w/,

    // Easily-parseable/retrievable ID or TAG or CLASS selectors
    //容易判断或获取的元素单独拿出来，ID,TAG和CLASS，这三个基本有原生API
    rquickExpr = /^(?:#([\w-]+)|(\w+)|\.([\w-]+))$/,

    rsibling = /[+~]/,
    rescape = /‘|\\/g,

    // CSS escapes http://www.w3.org/TR/CSS21/syndata.html#escaped-characters
    //$1:([\\da-f]{1,6}" + whitespace + "?|(" + whitespace + ")|.),$2:(" + whitespace + ")
    runescape = new RegExp( "\\\\([\\da-f]{1,6}" + whitespace + "?|(" + whitespace + ")|.)", "ig" ),
    //??????函数用途不明
    //卧槽，jQuery还考虑了编码http://zh.wikipedia.org/wiki/UTF-16
    //转换为UTF-16编码，若某个字符是多种字符，超过BMP的计数范围0xFFFF,则必须将其编码成小于0x10000的形式。
    funescape = function( _, escaped, escapedWhitespace ) {
        var high = "0x" + escaped - 0x10000;
        // NaN means non-codepoint
        // Support: Firefox
        // Workaround erroneous numeric interpretation of +"0x"
        //这里的high !== 用于判断 high是否是NaN,NaN !== NaN
        //当high为NaN,escapedWhitespace 为undefined时，再判断high是否为负数
        return high !== high || escapedWhitespace ?
            escaped :
            high < 0 ?
                // BMP codepoint
                String.fromCharCode( high + 0x10000 ) :
                // Supplemental Plane codepoint (surrogate pair)
                String.fromCharCode( high >> 10 | 0xD800, high & 0x3FF | 0xDC00 );
    };