Developers: |
Content |
MapReduce is the program framework provided by Google company, used for parallel computings over very big, several petabyte,[1] data sets in computer clusters.
Overview
MapReduce is a framework for calculation of some sets of the distributed tasks using a large number of the computers (called by "notes"), forming a cluster.
Work of MapReduce consists of two steps: Map and Reduce.
On a Map-step there is preprocessing of input data. For this purpose one of computers (called by the main node — master node) obtains input data of a task, separates them into parts and transfers to other computers (to working nodes — worker node) for preprocessing. This step received the name from the function of the higher order of the same name.
On a Reduce-step there is a convolution of previously processed data. The main node receives answers from working nodes and on their basis creates result — the solution of a task which was initially formulated.
Advantage of MapReduce is that it allows is distributed to make transactions of preprocessing and convolution. Transactions of preprocessing work independently of each other and can be made in parallel (though in practice it is limited to a source of input data and/or the number of the used processors). Similarly, a set of working nodes can perform convolution — for this purpose it is necessary only that all results of preprocessing with one specific value of a key were processed by one working node at once of time. Though this process can be less effective in comparison with more consecutive algorithms, MapReduce can be applied to large volumes of data which can be processed by a large number of servers. So, MapReduce can be used for sorting of petabyte of data that will take only several hours. Parallelism also gives some opportunities of recovery after partial failures of servers: if in the working node making transaction of preprocessing or convolution there is a failure, then its work can be transferred to other working node (provided that input data are available to the performed operation).
The framework is to a large extent based on the map and reduce functions which are widely used in functional programming, [2] though actually semantics of a framework differs from a prototype.[3]
Available implementations
- Google implemented MapReduce on C++ with interfaces in languages Python and Java.
- Greenplum is a commercial implementation of MapReduce with support of the languages Python, Perl, SQL and others.[4]
- GridGain is a free implementation of MapReduce with open source codes in the Java language.
- The Apache Hadoop project is a free implementation of MapReduce with open source codes in the Java language.
- Phoenix [1] is implementation of MapReduce in language C using a shared memory.
- MapReduce is also implemented by Cell Broadband Engine in language C. [2]
- MapReduce is implemented in the graphic processors NVIDIA using CUDA [3].
- Qt Concurrent is the simplified version of a framework implemented on C ++ which is used for distribution of a task between several cores of one computer.
- CouchDB uses MapReduce for determination of representations over the distributed documents
- MongoDB also allows to use MapReduce for parallel processing of requests on several servers
- Skynet is implementation with the codes opened iskhodnyy in the Ruby language
- Disco is the implementation of MapReduce created by Nokia company. Its core is written in the Erlang language and applications for it can be written in the Python language.
- Hive framework is the superstructure with open source codes from Facebook allowing to combine approach of MapReduce and data access in SQL-like language.
- Qizmt is the implementation of MapReduce open source from MySpace written on C#.
See Also
Notes
- ↑ Google spotlights data center inner workings|Tech news blog is CNET News.com
- ↑ "Our abstraction is inspired by the map and reduce primitives present in Lisp and many other functional languages." -"MapReduce: Simplified Data Processing on Large Clusters", by Jeffrey Dean and Sanjay Ghemawat; from Google Labs
- ↑ "Google’s MapReduce Programming Model — Revisited" — paper by Ralf Lammel; from Microsoft
- ↑ Parallel Programming in the Age of Big Data
Links
- [4] MapReduce: Simplified Data Processing on Large Clusters Jeffrey Dean and Sanjay Ghemawat.Шаблон:Ref-en
- MapReduce and parallel DBMS: friends or enemies?, citforum.ru
- IBM MapReduce Tools for Eclipse