NOTE: Below post does not have the answer. Explained only approach.
Web Intelligence and Big Data
I created this page for my own understanding of MapReduce technology. This time, I am taking Web Intelligence and Big Data course from Coursera.org. The course explore the topics that I was curious about “Big Data”. I heard about the “Big Data” a lot, but my idea about the topic is quite limited. Simply my approach to the “Big Data” was business perspective instead of technical perspective. I knew the benefit of “Big Data” and why it became so important in the business, but missing out HOW part. This course is explaining WHAT/WHY/HOW in high level.
As a week 3 homework assignment, I have to use mincemeat. Mincemeat is a Python implementation of the MapReduce distributed computing framework. I don’t have any Python programming experience, so I need to understand the basics of Python first.
Here are my steps.
1. I downloaded and installed ActivePython.
2. Install Python and make sure that Python home is in my OS Path.
3. Any computer with Python and mincemeat.py can be a part of cluster. Mincemeat.py is very light weight and it serves as a cluster.
4. Open two separate command prompt windows. One act as server, one act as client.
5. example.py is a script that runs on the server. Actual code is parsing the data text and go through map() and reduce() to print out the word count.
example.py:
#!/usr/bin/env python
import mincemeat
data = ["Humpty Dumpty sat on a wall",
"Humpty Dumpty had a great fall",
"All the King's horses and all the King's men",
"Couldn't put Humpty together again",
]
def mapfn(k, v):
for w in v.split():
yield w, 1
def reducefn(k, vs):
result = 0
for v in vs:
result += v
return result
s = mincemeat.Server()
# The data source can be any dictionary-like object
s.datasource = dict(enumerate(data))
s.mapfn = mapfn
s.reducefn = reducefn
results = s.run_server(password="changeme")
print results
Execute this script on the server command prompt window:
c:\DevApps\python27>python example.py
Run mincemeat.py as a worker on a client command prompt window:
c:\DevApps\python27>python mincemeat.py -p changeme localhost
And the server will print out:
{'a': 2, 'on': 1, 'great': 1, 'Humpty': 3, 'again': 1, 'wall': 1, 'Dumpty': 2, 'men': 1, 'had': 1, 'all': 1, 'together': 1, "King's": 2, 'horses': 1, 'All': 1,'and': 1, "Couldn't": 1, 'fall': 1, 'put': 1, 'the': 2, 'sat': 1}
This gives me an idea how to approach the HW assignment.
Task
I have about 6.5MB big data file – 249 files with “paper-id:::author1::author2::…. ::authorN:::title” format. E.g. journals/cl/SantoNR90:::Michele Di Santo::Libero Nigro::Wilma Russo:::Programmer-Defined Control Abstractions in Modula-2. My task is to compute how many times every term occurs across titles, for each author.
Approach
- Modify Data Entry section by reading files from the directory.
- Modify Map and Reduce functions by defining key, value.
- Course demo provide similar approach, so it shouldn’t be too hard to program Python and use mincemeat.py.
- One more resource Big Data Recipes.
I have 2 more weeks to complete this assignment. My next posting will include my solution and summary of my understanding of the Map Reduce.