全部版块 我的主页
论坛 数据科学与人工智能 大数据分析 Hadoop论坛
6267 6
2010-07-02
Data-Intensive Text Processing with MapReduce
Authors: Jimmy Lin and Chris Dyer
Abstract: Our world is being revolutionized by data-driven methods: access tolarge amounts of data has generated new insights and opened excitingnew opportunities in commerce, science, and computingapplications. Processing the enormous quantities of data necessary forthese advances requires large clusters, making distributed computingparadigms more crucial than ever. MapReduce is a programming model forexpressing distributed computations on massive datasets and anexecution framework for large-scale data processing on clusters ofcommodity servers. The programming model provides aneasy-to-understand abstraction for designing scalable algorithms,while the execution framework transparently handles many system-leveldetails, ranging from scheduling to synchronization to faulttolerance. This book focuses on MapReduce algorithm design, with anemphasis on text processing algorithms common in natural languageprocessing, information retrieval, and machine learning. We introducethe notion of MapReduce design patterns, which represent generalreusable solutions to commonly occurring problems across a variety ofproblem domains. This book not only intends to help the reader "thinkin MapReduce", but also discusses limitations of the programming modelas well.

1 Introduction
1.1 Computing in the Clouds
1.2 Big Ideas
1.3 Why Is This Di erent?
1.4 What This Book Is Not

2 MapReduce Basics
2.1 Functional Programming Roots
2.2 Mappers and Reducers
2.3 The Execution Framework
2.4 Partitioners and Combiners
2.5 The Distributed File System
2.6 Hadoop Cluster Architecture
2.7 Summary

3 MapReduce Algorithm Design
3.1 Local Aggregation
3.1.1 Combiners and In-Mapper Combining
3.1.2 Algorithmic Correctness with Local Aggregation
3.2 Pairs and Stripes
3.3 Computing Relative Frequencies
3.4 Secondary Sorting
3.5 Relational Joins
3.5.1 Reduce-Side Join
3.5.2 Map-Side Join
3.5.3 Memory-Backed Join

3.6 Summary

4 Inverted Indexing for Text Retrieval
4.1 Web Crawling
4.2 Inverted Indexes
4.3 Inverted Indexing: Baseline Implementation
4.4 Inverted Indexing: Revised Implementation
4.5 Index Compression
4.5.1 Byte-Aligned and Word-Aligned Codes
4.5.2 Bit-Aligned Codes
4.5.3 Postings Compression
4.6 What About Retrieval?
4.7 Summary and Additional Readings

5 Graph Algorithms
5.1 Graph Representations
5.2 Parallel Breadth-First Search
5.3 PageRank
5.4 Issues with Graph Processing
5.5 Summary and Additional Readings

6 EM Algorithms for Text Processing
6.1 Expectation Maximization
6.1.1 Maximum Likelihood Estimation
6.1.2 A Latent Variable Marble Game
6.1.3 MLE with Latent Variables
6.1.4 Expectation Maximization
6.1.5 An EM Example
6.2 Hidden Markov Models
6.2.1 Three Questions for Hidden Markov Models
6.2.2 The Forward Algorithm
6.2.3 The Viterbi Algorithm

6.2.4 Parameter Estimation for HMMs
6.2.5 Forward-Backward Training: Summary
6.3 EM in MapReduce
6.3.1 HMM Training in MapReduce
6.4 Case Study: Word Alignment for Statistical Machine Translation
6.4.1 Statistical Phrase-Based Translation
6.4.2 Brief Digression: Language Modeling with MapReduce

6.4.3 Word Alignment
6.4.4 Experiments
6.5 EM-Like Algorithms
6.5.1 Gradient-Based Optimization and Log-Linear Models
6.6 Summary and Additional Readings

7 Closing Remarks
7.1 Limitations of MapReduce
7.2 Alternative Computing Paradigms
7.3 MapReduce and Beyond
附件列表
二维码

扫码加我 拉你入群

请注明:姓名-公司-职位

以便审核进群资格,未注明则拒绝

全部回复
2010-7-2 20:11:45
动作真快啊
二维码

扫码加我 拉你入群

请注明:姓名-公司-职位

以便审核进群资格,未注明则拒绝

2010-7-2 20:13:50
看到有对Cloud Computing和Text Mining有同好的朋友,于是拿出来与大家分享罢了。
二维码

扫码加我 拉你入群

请注明:姓名-公司-职位

以便审核进群资格,未注明则拒绝

2010-7-3 09:26:40
好书,基于云计算的数据挖掘
二维码

扫码加我 拉你入群

请注明:姓名-公司-职位

以便审核进群资格,未注明则拒绝

2011-6-3 11:44:37
好书,谢谢楼主分享
二维码

扫码加我 拉你入群

请注明:姓名-公司-职位

以便审核进群资格,未注明则拒绝

2019-7-28 15:55:13
thanks for sharing
二维码

扫码加我 拉你入群

请注明:姓名-公司-职位

以便审核进群资格,未注明则拒绝

点击查看更多内容…
相关推荐
栏目导航
热门文章
推荐文章

说点什么

分享

扫码加好友,拉您进群
各岗位、行业、专业交流群