What is buried in Mapreduce

1. Flowchart

Readers may first examine the flowchart in conjunction with the following text to explain it in detail. I hope I can help everyone.

1st entry level:

Enter two files, File1 and File2:

There is this data in file1:

There is this data in file2:

Functions: cut, key convert, value

Output: split1

split2

2nd card phase:

Function: Start MapTask according to the number of shards and then call the Map method for the data of each MapTask

MapTask1 :

MapTask2

== Shuffle: partition, sort, group ==

Shuffle the card: Process the result of the card

  ==spill: spill write== [Write the data in memory to the hard drive to a file]

  • Each MapTask stores the results of its own processing in a circular memory buffer [100M].

  • When the buffer reaches 80%, an overflow write is triggered and any data that overflows is == partitioned and sorted ==

  • Partitioning: By default, partitioning is based on the hash of the key, which is essentially tagged

  • MapTask1 : filea

  • MapTask1 : fileb

  • MapTask2 : filea

  • MapTask2: fileb

  • sort by: Call the sort or compareTo method: The way to achieve: Fast sorting

  • ⚠️ Note that this is not a globally ordered sorting of the entire data stack, but an ordered sorting within the same partition ==

  • MapTask1 : filea

  • MapTask1 : fileb

    MapTask2 : filea

  • MapTask2: fileb

  • Merge: Merge, each Maptask merges all of the small files it has generated to ensure that there is only one large file for each MapTask

  • The sorting logic continues to call the sorter or the comparison

  • MapTask1

MapTask2

Shffule on the Reduce page: Enter your results on Reduce

== Pulling == Via the HTTP protocol, each ReduceTask returns to each MapTask and takes on its own value

reductTask1 retrieves MapTask1 and MapTask2 data that contains reduct1

reductTask2 retrieves MapTask1 and MapTask2 data that contains reduct2

== Merge == The process of merging is also sorted: The type of implementation is merged and sorted [memory index in memory]

  • The sorting logic continues to call the sorter or the comparison

  • reduceTask1

  • reduceTask2

  • == group == Group: insert the value of the key in the iterator

  • reduceTask1

  • reduceTask2

  • Reduce page:

  • Function: Aggregate the results of shuffle and call the reduction method for each data item

    ReduceTask1

    ReduceTask2

    output

    part-r-00000

    part-r-00001

    • If any of the technical points above is wrong, please leave a comment and point it out, thank you! ! !