mongodb文档 aggregate章节阅读的笔记
aggregate 包含3中不同的类型:
1.管道
2.单一功能聚合 (count,group,distinct)
3.map-reduce
管道表达式
管道表达式仅能操作当前在管道中的文档,不能代表其他的文档.
通常,表达式是没有状态的,并且在聚合过程中被计算,但是有一个例外的:累加器表达式
累加器,只能用在group 管道操作器中,主要的状态有:总共,最大,最小,和关联数据.
优化的方式:
$match he $sort管道操作器可以利用索引,当他们出现在管道的第一个位置.避免扫描集合内所有
文档.
从2.4开始引入的$geoNear管道操作器可以利用geospatial index.当使用,必须放在管道的开始位置.
尽管管道使用了索引.聚合仍然需要访问实际的文档.索引无法完全覆盖聚合管道.
初期过滤(Early Filtering)
如果聚合管道操作仅仅需要整个集合的部分数据,那么使用$match,$limit, 和 $skip 步骤来限制进入管道
的文档数量.在管道的入口端,$match操作会使用合适的索引来扫描,仅仅使复合条件的文档进入该管道.
在管道的开始处使用$match,紧接着使用$sort逻辑上等价于简单的利用sort的查询,使用索引.如果可能,
尽量将$match操作器放在管道的开始端.
额外的功能:
聚合索引有一个内部的优化阶段以提高聚合性能.
聚合管道支持在分片的集合上进行.
mongodb 的mapreduce操作
map方法处理每一个输入文档,map方法最后生成key-value 对.
输出的限制:必须在 bson document 的尺寸范围内,当前为16M.
mapreduce 输入输出的集合均支持在分片中的集合.
内置的优化机制:
1.project优化:聚合管道可以确定管道中需要多少个字段,因此,
管道只使用需要用到的字段
管道顺序优化:
如果写成了:
$sort=>$match 会优化成:
$match=>$sort
{ $sort: { age : -1 } },
{ $match: { status: ‘A‘ } }
{ $match: { status: ‘A‘ } },
{ $sort: { age : -1 } }
{ $skip: 10 },
{ $limit: 5 }
{ $limit: 15 },
{ $skip: 10 }
减少$skip的数量
{ $redact: { $cond: { if: { $eq: [ "$level", 5 ] }, then: "$$PRUNE", else: "$$DESCEND" } } },
{ $match: { year: 2014, category: { $ne: "Z" } } }
{ $match: { year: 2014 } },
{ $redact: { $cond: { if: { $eq: [ "$level", 5 ] }, then: "$$PRUNE", else: "$$DESCEND" } } },
{ $match: { year: 2014, category: { $ne: "Z" } } }
合并优化
$sort + $limit 合并优化
当sort 后面紧接着 是limit.优化器会将 limit合并到 sort中,这样 就只需要存取前 n个值.
节省了内存.
The optimization will still apply when allowDiskUse is true and the n items exceed the aggregation memory limit (page 403).
$limit + $limit合并优化
当连续2个$limit在一起时,会合并成1个,
并且选用 $limit较小的那个数字
连续两个 $skip
连续两个的$skip会合并成一个, 后面的数字为两个$skip的和.
连续两个的$match. 会合并成1个,条件会合并在一起.
{ $sort: { age : -1 } },
{ $skip: 10 },
{ $limit: 5 }
{ $sort: { age : -1 } },
{ $limit: 15 }
{ $skip: 10 }
example:
{ $sort: { age : -1 } },
{ $skip: 10 },
{ $limit: 5 }
====>>>>
{ $sort: { age : -1 } },
{ $limit: 15 }
{ $skip: 10 }
====>>>>
$sort+$limit合并
{$limit: 100 },
{$skip: 5 },
{$limit: 10 },
{$skip: 2 }
==========>>>>>>>
{$limit: 100 },
{$limit: 15},
{$skip: 5 },
{$skip: 2 }
=========>>>>>>>>>
{ $limit: 15 },
{ $skip: 7 }
Result Size Restrictions
manage result sets that exceed this limit, the aggregate command can return result sets of any size if the command
return a cursor or store the results to a collection.
Changed in version 2.6: The aggregate command can return results as a cursor or store the results in a collection,
which are not subject to the size limit. The db.collection.aggregate() returns a cursor and can return result
sets of any size.
Memory Restrictions
Changed in version 2.6.
Pipeline stages have a limit of 100 megabytes of RAM. If a stage exceeds this limit, MongoDB will produce an error.
To allow for the handling of large datasets, use the allowDiskUse option to enable aggregation pipeline stages to
write data to temporary files.
当聚合操作执行在分片的集合中时,聚合管道被拆分成两部分.第一个管道执行在每个分片.或者当初期$match可以
通过分片key的断言排除一些分片集.管道只运行在相关联的分片上.
第二个管道运行在主分片上.合并第一个阶段执行的结果.并且在合并的结果的基础上在执行.主分片将最后的结果转发到
mongos.在2.6 以前.第二个管道运行在mongos上.]
map reduce 执行输入/输出在分片的collection.
如果是输入的集合为分片的集合,mongos 会自动平行地调度 map-reduce到每个分片中.无需额外处理.
如果输出的集合为分片的集合,
If the out field for mapReduce has the sharded value, MongoDB shards the output collection using the _id field
as the shard key.
? If the output collection does not exist, MongoDB creates and shards the collection on the _id field.
? For a new or an empty sharded collection, MongoDB uses the results of the first stage of the map-reduce
operation to create the initial chunks distributed among the shards.
? mongos dispatches, in parallel, a map-reduce post-processing job to every shard that owns a chunk. During
the post-processing, each shard will pull the results for its own chunks from the other shards, run the final
reduce/finalize, and write locally to the output collection.
Map Reduce Concurrency
Return States with Populations above 10 Million
SELECT state, SUM(pop) AS totalPop
FROM zipcodes
GROUP BY state
HAVING totalPop >= (10*1000*1000)
db.zipcodes.aggregate( { $group :
{ _id : "$state",
totalPop : { $sum : "$pop" } } },
{ $match : {totalPop : { $gte : 10*1000*1000 } } } )
Return Average City Population by State
根据stat和city 取平均值.
db.zipcodes.aggregate( [
{ $group : { _id : { state : "$state", city : "$city" }, pop : { $sum : "$pop" } } },
{ $group : { _id : "$_id.state", avgCityPop : { $avg : "$pop" } } }
] )
// 对应的sql语句
select state, avg(sum_pop) from (select state, city, sum(pop) as sum_pop from zipcodes group by city, state) as temp
group by temp.state
Return Largest and Smallest Cities by State
返回最大.最小值
db.zipcodes.aggregate( { $group:
{ _id: { state: "$state", city: "$city" },
pop: { $sum: "$pop" } } },
{ $sort: { pop: 1 } },
{ $group:
{ _id : "$_id.state",
biggestCity: { $last: "$_id.city" },
biggestPop:
{ $last: "$pop" },
smallestCity: { $first: "$_id.city" },
smallestPop: { $first: "$pop" } } },
// the following $project is optional, and
// modifies the output format.
{ $project:
{ _id: 0,
state: "$_id",
biggestCity: { name: "$biggestCity", pop: "$biggestPop" },
smallestCity: { name: "$smallestCity", pop: "$smallestPop" } } } )
Return the Five Most Common “Likes”
db.users.aggregate(
[
{ $unwind : "$likes" }, // 这里得注释一下:? The $unwind operator separates each value in the likes array, and creates a new version of the source document for every element in the array. 拆分数组用的...
{ $group : { _id : "$likes" , number : { $sum : 1 } } },
{ $sort : { number : -1 } },
{ $limit : 5 }
]
)
map-reduce 的例子:
数据结构:
{
_id: ObjectId("50a8240b927d5d8b5891743c"),
cust_id: "abc123",
ord_date: new Date("Oct 04, 2012"),
status: ‘A‘,
price: 25,
items: [ { sku: "mmm", qty: 5, price: 2.5 },
{ sku: "nnn", qty: 5, price: 2.5 } ]
}
Return the Total Price Per Customer
map 方法:
var mapFunction1 = function() {
emit(this.cust_id, this.price);
};
reduce 方法:
var reduceFunction1 = function(keyCustId, valuesPrices) {
return Array.sum(valuesPrices);
};
结合使用:
db.orders.mapReduce(
mapFunction1,
reduceFunction1,
{ out: "map_reduce_example" } //指定输出到 "map_reduce_example" 这个集合中
)
========
待续...
文档下载地址:
http://pan.baidu.com/s/1jiFOM
郑重声明:本站内容如果来自互联网及其他传播媒体,其版权均属原媒体及文章作者所有。转载目的在于传递更多信息及用于网络分享,并不代表本站赞同其观点和对其真实性负责,也不构成任何其他建议。