April | 2013 | mjtoolbox

NOTE: Below post does not have the answer. Explained only approach.

Web Intelligence and Big Data

I created this page for my own understanding of MapReduce technology. This time, I am taking Web Intelligence and Big Data course from Coursera.org. The course explore the topics that I was curious about “Big Data”. I heard about the “Big Data” a lot, but my idea about the topic is quite limited. Simply my approach to the “Big Data” was business perspective instead of technical perspective. I knew the benefit of “Big Data” and why it became so important in the business, but missing out HOW part. This course is explaining WHAT/WHY/HOW in high level.

As a week 3 homework assignment, I have to use mincemeat. Mincemeat is a Python implementation of the MapReduce distributed computing framework. I don’t have any Python programming experience, so I need to understand the basics of Python first.

Here are my steps.

1. I downloaded and installed ActivePython.

2. Install Python and make sure that Python home is in my OS Path.

3. Any computer with Python and mincemeat.py can be a part of cluster. Mincemeat.py is very light weight and it serves as a cluster.

4. Open two separate command prompt windows. One act as server, one act as client.

5. example.py is a script that runs on the server. Actual code is parsing the data text and go through map() and reduce() to print out the word count.

example.py:

#!/usr/bin/env python
import mincemeat

data = ["Humpty Dumpty sat on a wall",
        "Humpty Dumpty had a great fall",
        "All the King's horses and all the King's men",
        "Couldn't put Humpty together again",
        ]

def mapfn(k, v):
    for w in v.split():
        yield w, 1

def reducefn(k, vs):
    result = 0
    for v in vs:
        result += v
    return result

s = mincemeat.Server()

# The data source can be any dictionary-like object
s.datasource = dict(enumerate(data))
s.mapfn = mapfn
s.reducefn = reducefn

results = s.run_server(password="changeme")
print results

Execute this script on the server command prompt window:

c:\DevApps\python27>python example.py

Run mincemeat.py as a worker on a client command prompt window:

c:\DevApps\python27>python mincemeat.py -p changeme localhost

And the server will print out:

{'a': 2, 'on': 1, 'great': 1, 'Humpty': 3, 'again': 1, 'wall': 1, 'Dumpty': 2, 'men': 1, 'had': 1, 'all': 1, 'together': 1, "King's": 2, 'horses': 1, 'All': 1,'and': 1, "Couldn't": 1, 'fall': 1, 'put': 1, 'the': 2, 'sat': 1}

This gives me an idea how to approach the HW assignment.

Task

I have about 6.5MB big data file – 249 files with “paper-id:::author1::author2::…. ::authorN:::title” format. E.g. journals/cl/SantoNR90:::Michele Di Santo::Libero Nigro::Wilma Russo:::Programmer-Defined Control Abstractions in Modula-2. My task is to compute how many times every term occurs across titles, for each author.

Approach

Modify Data Entry section by reading files from the directory.
Modify Map and Reduce functions by defining key, value.
Course demo provide similar approach, so it shouldn’t be too hard to program Python and use mincemeat.py.
One more resource Big Data Recipes.

I have 2 more weeks to complete this assignment. My next posting will include my solution and summary of my understanding of the Map Reduce.

mjtoolbox

MJ's Technical Blog

Monthly Archives: April 2013

Map Reduce using Python mincemeat I