Mesos – Software Engineering

It’s an alternative to YARN, yet it’s still different. It’s used by Twitter, not directly associated with Hadoop, and has a more broad system.

— Mesos Vs YARN

· Mesos manages the resources across the data centers, instead of just Hadoop. Not only about the data but also web servers, CPU, etc.

· YARN, you give it a job, and it figures out how to process it. Mesos, you give it a job, and replies back with the available resources, and then we decide whether to accept or reject.

· YARN is optimized for long, analytical jobs, while Mesos is more general purpose; can be for short, or long jobs.

— When to use Mesos?

To manage and run on top of your cluster, not only Hadoop. We can use both YARN to manage nodes for Hadoop, while Mesos manages nodes for web servers, and everything else.

MapReduce

It’s the data processing of Hadoop. It consists of Map: run a function on nodes in parallel, and Reduce: aggregate the results of all nodes.

It distributes the processing by dividing among partitions (nodes). A mapper function will be applied to each partition, and a reducer function to aggregate results of mapper.

Let’s take an example. The task here is to get the count for every word across massive amount of text data. A MapReduce job is composed of three steps:

1. Mapper. Given text data, for each input, transform it to key-value pair. In this case, the key is the word, and value is just 1 (will be aggregated).

2. Shuffle & Sort. MapReduce automatically groups same keys, and sort them. This step happens right before every reduce function. So, we’ll end up having the word as a key, and the value is an array of 1s.

3. Reduce. Given a list of all the words, aggregate. So, for every key (word), we’ll get the count (value=[1,1,1,…]). The reducer gets called for each unique key.

— How MapReduce distributes the processing?

Remember YARN?. If so, then this should be familiar since MapReduce runs on top of it.

So. It talks to YARN saying, “Hey, I want to process this job”. And the same steps explained above will kick off. The only difference here is containers (workers) are running map/reduce jobs.

— Failures?
The application master restarts the worker (preferably on a different machine) when a worker fails. YARN will restart the application master if itself failed.

— How to write MapReduce jobs in code?

MapReduce can be written in Java and Python. An example is to rank movies by their popularity, and return them in a sorted order.

# Given a list of movies, transform it to (movieId => rating)

def mapper(self, _, line):

    (userId, movieId, rating) = line.split(',')

     yield movieId, rating

# Shuffle and sort. Merge similar movieIds and sort by movieId.

# Remember that similar keys can be distributed across different

# mappers, different nodes.

# For every movieId => [7,5,6,4,6,2,...], aggregate. 

# The result is for every movieId, we get the average of rating

# But, return average as a key, and movieId as value.

def reducer(self, movieId, ratings):

     # "ratings" is an iterator not the actual array. 

     # We can iterate over values one by one. Why? 

     # To avoid loading the whole array in the memory.

     yield avg(ratings), movieId

Now, we can add another step, with another reducer to sort the result by average of ratings (key). The input for this step is the output of the preceding step: (average of ratings → movieId).

Before each reducer, “the shuffle and sort” step will kick in even if there is no mapper, and group similar ratings (average) together.

def reducer_sort_movies(self, avgRating, movies):

    for movie in movies:

        yield movie, avgRating

While MapReduce might solve some of our problems, It’s not easy to enforce a task into map/reduce functions. That’s why some other solutions like Spark that uses SQL-like queries are becoming more popular.

MapReduce

1 Comment