Scalable Analysis of Data from Proteomics Research

Over 70 million rows of research data are produced by the University of Dundee’s Life Sciences department, as often as every two hours. Mass spectrometry experiments running 24 hours a day as part of ongoing proteomics research are providing valuable information leading to the identification of new drugs for the treatment of disease. Providing means to analyse the data with sufficient speed and accuracy is currently one of the most challenging problems in computational proteomics.

Recent advances in ‘big data’ research have offered new approaches to the analysis of complex data sets. This study investigates the potential benefits that technologies such as Node.js, GPU Acceleration (CUDA), NoSQL database storage and distributed computation (MapReduce) could provide.

The most suitable solution, when applied directly to the problem faced by the University of Dundee, was found to be the distributed computation model, MapReduce. Chosen for its comprehensive tool set and the ease with which it could be mapped to the problem. Not without downsides, MapReduce exhibited a lack of accessible workflow management and ease of use. To counteract this, a web-based application providing a suite of tools to augment the ongoing testing, evaluation and refinement of the MapReduce program was developed.

Contact info:
Michael Baird
Personal Site