资讯专栏INFORMATION COLUMN

[转]Writing an Hadoop MapReduce Program in Python

JessYanCoding / 1147人阅读

mapper.py

</>复制代码

  1. #!/usr/bin/env python
  2. """A more advanced Mapper, using Python iterators and generators."""
  3. import sys
  4. def read_input(file):
  5. for line in file:
  6. # split the line into words
  7. yield line.split()
  8. def main(separator="
  9. "):
  10. # input comes from STDIN (standard input)
  11. data = read_input(sys.stdin)
  12. for words in data:
  13. # write the results to STDOUT (standard output);
  14. # what we output here will be the input for the
  15. # Reduce step, i.e. the input for reducer.py
  16. #
  17. # tab-delimited; the trivial word count is 1
  18. for word in words:
  19. print "%s%s%d" % (word, separator, 1)
  20. if __name__ == "__main__":
  21. main()
reducer.py

</>复制代码

  1. #!/usr/bin/env python
  2. """A more advanced Reducer, using Python iterators and generators."""
  3. from itertools import groupby
  4. from operator import itemgetter
  5. import sys
  6. def read_mapper_output(file, separator="
  7. "):
  8. for line in file:
  9. yield line.rstrip().split(separator, 1)
  10. def main(separator="
  11. "):
  12. # input comes from STDIN (standard input)
  13. data = read_mapper_output(sys.stdin, separator=separator)
  14. # groupby groups multiple word-count pairs by word,
  15. # and creates an iterator that returns consecutive keys and their group:
  16. # current_word - string containing a word (the key)
  17. # group - iterator yielding all ["", ""] items
  18. for current_word, group in groupby(data, itemgetter(0)):
  19. try:
  20. total_count = sum(int(count) for current_word, count in group)
  21. print "%s%s%d" % (current_word, separator, total_count)
  22. except ValueError:
  23. # count was not a number, so silently discard this item
  24. pass
  25. if __name__ == "__main__":
  26. main()

转自:http://www.michael-noll.com/tutorials/writing-an-hadoop-mapreduce-program-in-python/

文章版权归作者所有,未经允许请勿转载,若此文章存在违规行为,您可以联系管理员删除。

转载请注明本文地址:https://www.ucloud.cn/yun/45308.html

相关文章

  • Awesome Python II

    摘要: Caching Libraries for caching data. Beaker - A library for caching and sessions for use with web applications and stand-alone Python scripts and applications. dogpile.cache - dogpile.cache...

    lx1036 评论0 收藏0

发表评论

0条评论

最新活动
阅读需要支付1元查看
<