DSM010 Big data analysis Course Work, UOL, Singapore: Find the descriptive statistics for the temperature of each day of a given month for the year 2007
 University University of London (UOL) Subject DSM010 Big data analysis Course Work
Posted on: 1st Dec 2023

# DSM010 Big data analysis Course Work, UOL, Singapore: Find the descriptive statistics for temperature of each day of a given month for the year 2007

Coursework Description

For this coursework, you will solve the given problems using the MapReduce computational model and Mahout on the Hadoop cluster. This coursework carries 30% weightage of total marks for the module.

Q1) Find the descriptive statistics for temperature of each day of a given month for the year 2007.

We use weather data from NCDC. You can access hourly weather data from ‘Data Sets’ folder under “Coursework submission” tab on the module VLE. We have chosen the hourly records of April, May, June and July from the year 2007. A month is represented per file. You may select any one of the four months (files) for analysis.

You can find the weather data from different weather stations (wban – first column). Using the hourly data across all of the weather stations, find:

The difference between the maximum and the minimum, “Wind Speed,” from all of the weather stations for each day in the month

The daily minimum, “Relative Humidity,” from all of the weather stations

The daily mean and variance of, “Dew Point Temp,” from all of the weather stations

The correlation matrix that describes the monthly correlation among, “Relative Humidity”, “Wind Speed” and “Dry Bulb Temp,” from all of the weather stations.

You are NOT going to use any package that gives the statistics. You MUST use the MapReduce framework. Write the pseudo code for mapper and reducer functions for the above four tasks and implement them in Python. Note that while using mapper and reducer it is helpful to consider the following formulae for variance and correlation:

