Friday 14 February 2014

Difference between Hash and sort grouping methods in Aggregator stage

Grouping Methods


Hash (default)

1)Calculations are made for all groups and stored in memory
2)Results are written out after all input has been processed so large memory is required when volume of input is high
3)Input does not need to be sorted
4)Useful when the number of unique groups is small 

Sort

1)Requires the input data to be sorted by grouping keys
2)Only a single aggregation group is kept in memory so less memory is required
3)When a new group is seen, the current group is written out
4)Can handle unlimited numbers of groups

Conclusion-When the volume of input is high  and is not predictable it is better to use Sort Method

No comments:

Post a Comment