Map Reduce using Python mincemeat I

NOTE: Below post does not have the answer. Explained only approach.

Web Intelligence and Big Data

I created this page for my own understanding of MapReduce technology. This time, I am taking Web Intelligence and Big Data course from Coursera.org. The course explore the topics that I was curious about “Big Data”. I heard about the “Big Data” a lot, but my idea about the topic is quite limited. Simply my approach to the “Big Data” was business perspective instead of technical perspective. I knew the benefit of “Big Data” and why it became so important in the business, but missing out HOW part. This course is explaining WHAT/WHY/HOW in high level.

As a week 3 homework assignment, I have to use mincemeat. Mincemeat is a Python implementation of the MapReduce distributed computing framework. I don’t have any Python programming experience, so I need to understand the basics of Python first.

Here are my steps.

1. I downloaded and installed ActivePython.

2. Install Python and make sure that Python home is in my OS Path.

3. Any computer with Python and mincemeat.py can be a part of cluster. Mincemeat.py is very light weight and it serves as a cluster.

4. Open two separate command prompt windows. One act as server, one act as client.

5. example.py is a script that runs on the server. Actual code is parsing the data text and go through map() and reduce() to print out the word count.

example.py:

#!/usr/bin/env python
import mincemeat

data = ["Humpty Dumpty sat on a wall",
        "Humpty Dumpty had a great fall",
        "All the King's horses and all the King's men",
        "Couldn't put Humpty together again",
        ]

def mapfn(k, v):
    for w in v.split():
        yield w, 1

def reducefn(k, vs):
    result = 0
    for v in vs:
        result += v
    return result

s = mincemeat.Server()

# The data source can be any dictionary-like object
s.datasource = dict(enumerate(data))
s.mapfn = mapfn
s.reducefn = reducefn

results = s.run_server(password="changeme")
print results

Execute this script on the server command prompt window:

c:\DevApps\python27>python example.py

Run mincemeat.py as a worker on a client command prompt window:

c:\DevApps\python27>python mincemeat.py -p changeme localhost

And the server will print out:

{'a': 2, 'on': 1, 'great': 1, 'Humpty': 3, 'again': 1, 'wall': 1, 'Dumpty': 2, 'men': 1, 'had': 1, 'all': 1, 'together': 1, "King's": 2, 'horses': 1, 'All': 1,'and': 1, "Couldn't": 1, 'fall': 1, 'put': 1, 'the': 2, 'sat': 1}

This gives me an idea how to approach the HW assignment.

Task

I have about 6.5MB big data file – 249 files with “paper-id:::author1::author2::…. ::authorN:::title” format. E.g. journals/cl/SantoNR90:::Michele Di Santo::Libero Nigro::Wilma Russo:::Programmer-Defined Control Abstractions in Modula-2. My task is to compute how many times every term occurs across titles, for each author.

Approach

  • Modify Data Entry section by reading files from the directory.
  • Modify Map and Reduce functions by defining key, value.
  • Course demo provide similar approach, so it shouldn’t be too hard to program Python and use mincemeat.py.
  • One more resource Big Data Recipes.

I have 2 more weeks to complete this assignment. My next posting will include my solution and summary of my understanding of the Map Reduce.

Advertisements

4 thoughts on “Map Reduce using Python mincemeat I

  1. Giving your homework answers in public in a future blog post will be a breach of the corsera honor code ! The course will most likely be re-run, so you be enabling future students to cheat.

  2. Thanks for the post. I was struggling to get started with setting up mincemeat. This was really helpful.

Leave a Reply

Fill in your details below or click an icon to log in:

WordPress.com Logo

You are commenting using your WordPress.com account. Log Out / Change )

Twitter picture

You are commenting using your Twitter account. Log Out / Change )

Facebook photo

You are commenting using your Facebook account. Log Out / Change )

Google+ photo

You are commenting using your Google+ account. Log Out / Change )

Connecting to %s