Enabling BOINC in infrastructure as a service cloud system

Volunteer or crowd computing is becoming increasingly popular for solving complex research problems from an increasingly diverse range of areas. The majority of these have been built using the Berkeley Open Infrastructure for Network Computing (BOINC) platform, which provides a range of different services to manage all computation aspects of a project. The BOINC system is ideal in those cases where not only does the research community involved need low-cost access to massive computing resources but also where there is a significant public interest in the research being done. We discuss the way in which cloud services can help BOINC-based projects to deliver results in a fast, on demand manner. This is difficult to achieve using volunteers, and at the same time, using scalable cloud resources for short on demand projects can optimize the use of the available resources. We show how this design can be used as an efficient distributed computing platform within the cloud, and outline new approaches that could open up new possibilities in this field, using Climateprediction.net (http://www. climateprediction.net/) as a case study.


Introduction
Traditionally, climate models have been run using supercomputers because of their vast computational complexity and high cost.Since its early development, climate modelling has been an undertaking that has tested the limits of highperformance computing (HPC).This application of models to answer different types of questions has led to them being used in manners not originally foreseen.This is because, for some types of simulations, it can take several months to finish a modelling experiment given the scale of resources involved.One reason for including climate modelling as a high-throughput computing (HTC) problem, as opposed to an HPC problem is due to the application design model, where there is a number (not usually greater than 20) of uncoupled, long-running tasks, each corresponding to a single climate simulation and its results.
The aim of increasing the total number of members in an ensemble of climate simulations, together with the need to achieve increased computational power to better represent the physical and chemical processes being modelled, has been well understood for some decades in meteorological and climate research.Climate models make use of ensemble means to improve the accuracy of the results and quantify uncertainty, but the number of members in each ensemble tends to be small due to computational constraints.The overwhelming majority of research projects use ensembles that generally contain only a very small number of simulations, which has an obvious impact in terms of the statistical uncertainty of the results.
The Climateprediction.net project (CPDN) was created in 1999 (Allen, 1999;CPDN, 2015) as a distributed computing initiative to address the uncertainties described above.Its aim is to run thousands of different climate modelling simulations in order to research the uncertainties associated with some of the parameters.This is essential for understanding how small changes or variations in initial conditions can affect both the models themselves and the results of climate simulations.The project is currently run by the University of Oxford using volunteer computing via the BOINC (Berkeley Open Infrastructure for Network Computing) framework (BOINC, 2014;Anderson, 2004).In its early use of distributed computing, CPDN became a precursor of the many-task computing (MTC) paradigm (Raicu et al., 2008).
CPDN has been running for more than 10 years and faces a number of evolving challenges, such as an increasing and variable need for new computational and storage resources; the processing power and memory of current volunteers' computers that restricts the use of more complex models and higher resolution; and the need to manage costs and budgeting (this is of particular interest in researching on-demand projects requested by external research collaborators and stakeholders).
To address these issues, we have explored the combination of MTC/volunteer and cloud computing as a possible improvement of, or extension to, a real existing project.This kind of solution has previously been proposed for scientific purposes by Iosup et al. (2011) and is supported by initiatives such as Microsoft Azure for Research (Microsoft, 2014).

Background
It is not the aim of this paper to describe the internals of BOINC, and for better comprehension of the problem that we are trying to solve, it is recommended to review previous works about this knowledge, such as Ries et al. (2011).

Problem description
Here, we describe some of the problems that we intend to address, as well as proposed implementations of possible solutions.
-To run more complex and computationally more expensive versions of the model, resources greater than those that can be provided by volunteer computers may be needed.One solution is a re-engineering and deployment of the client side from a volunteer computing architecture to an infrastructure as a service (IaaS) based on cloud computing (e.g.Amazon Web Services, AWS).
-There is a growing need for an on-demand and more predictable return of simulation results.A good example of this is urgent simulations for critical events in real time (e.g.floods) where it is not possible to rely on volunteers; instead, a widely available and massive scaling system is preferable (like the one described here).The current architecture and infrastructure based on BOINC does not provide a solution that can be scaled up for this purpose.This is because the models are running over a heterogeneous and decentralized environment (on a number of variable and different volunteers' computers with varying configurations), where their behaviour cannot be clearly anticipated or measured, and any control over the available resources is severely limited.
-A rationalization of the costs is required (and establishing useful metrics), not just for internal control but also to provide monetary quotations to project partners and funding bodies; this led us to the need of the development of a control plane together with a front end to display the statistics information and metrics.
-Free software can be used in order to promote scientific reproducibility (Añel, 2011).
-Complete documentation of the process will allow knowledge to be transferred or migrated easily to other systems (Montes, 2014).Additional explanations can be found in the appendices (Appendices A, B, and C).
Furthermore, in this work, we wish to prove the feasibility of running complex applications in this environment.We use weather@home (Massey et al., 2015), a high-resolution regional climate model nested in a global climate model as an example.The remainder of this paper is organized as follows.We firstly present benchmarks of the weather@home application run in AWS in Sect.3.1, then describe the migration of the CPDN infrastructure to AWS in Sect.3.2.We also describe our control plane in Sect.3.3 to conduct the simulations and manage the cloud resources.Lastly, the results are discussed in the conclusion.
3 BOINC deployment to the cloud 3.1 Application benchmarks in Amazon Web Services (AWS) The example presented here is running CPDN in AWS.AWS is the largest infrastructure as a service (IaaS) provider, it is very well documented, and is the most suitable solution for the problem at present (and with fewer limitations than other providers).The first step was to benchmark different AWS EC2 instance types1 to determine their performance running CPDN simulations.These tests were done with a range of instance types, but only choosing instance types that have hardware virtual machine (HVM) virtualization available.Elastic block store (EBS) gp2 storage2 was used for all instances for ease of comparison.These tests were carried out running multiple copies of a single work unit, in parallel with the number of simulations matching the number of vCPUs (hyperthreads) available to each instance type.For each instance's type, at least four tests were run.
For benchmarking purposes, short 1-day climate simulations were run.The model used here is weather@home2 which consists of an atmosphere-only model (HadAM3P; Gordon et al., 2000) driving the regional version of the same model (HadRM3P; Pope et al., 2000).This version of the model uses the MOSES 2 land surface scheme.The region chosen is at 0.22 • (≈ 25 km) resolution over Europe.
Figure 1a shows the average time to run all of the simulations on a particular instance, by instance type.We see a general trend of smaller instances performing better than larger instances.This is likely due to the hardware these instances are on being at a lower load.Running only a single simulation per instance resulted in similar run times for instances of the same category (e.g.c4), and they are not shown in the figure.However, we have verified that it is more cost effective to run the maximum number of simulations per instance than to run instances at a lower load.
Figure 1b shows the estimated cost of running a 1-year simulation on each instance type.The pricing here is based on the spot price 3 in the cheapest availability zone in the us-east-1 region (based on AWS regions) as of June 2016.This shows that the current-generation compute-optimized instances (c4) had three out of the four most cost-effective choices, but other small instance types are amongst the 3 https://aws.amazon.com/ec2/spot/pricing/cheapest.We emphasize that these results are very variable in time and between regions.In the us-west-1 and us-west-2 regions in AWS, the cheapest instance types were m4.large and m4.xlarge, respectively, due to the lower spot price for those particular instances in those regions.

CPDN infrastructure in AWS
Based on the previous tests, new infrastructure was designed on the cloud (Fig. 2).Several steps were required for its implementation, as described below.2. This was followed by instance post-installation configuration (contextualization); for example, in AWS this is achieved by creating a machine image (AMI) and adjusting it by selecting the appropriate options such as the kernel image (AKI).

Computing infrastructure
3. Finally, an (optional) installation and configuration of AWS EC2 command-line interface is performed.This can be useful to debug or troubleshoot issues with the infrastructure.

Storage infrastructure
Another problem that needs to be solved is the need for a decentralized, low-latency and world-wide-accessible storage for the output data (each simulation (36 000 work units) generates ∼ 656 GB of results).A solution for this could be a distributed (accessed within different and synchronized worldwide endpoints) and scalable massive storage (Fig. 3).Here, we tested an architecture in which the clients send the results (tasks) to an Amazon Simple Storage Service (S3) bucket (storage endpoint).At the same time, CPDN can access these data over the internet to run postprocessing (e.g. a custom assimilator); this can be achieved using the AWS API.Given these values, every work unit returns a result of ∼ 0.018 GB with a price of ∼ USD 0.005414 each (AWS, 2016a), and ∼ USD 194.904 for the full simulation (both storage and data transfer).

Project control plane
Having set up the computing and storage infrastructure, we still lack a control plane to provide a layer for abstraction and automation, and provide more consistency to the project.The aim of developing the central control system (Fig. 4) is to provide a cloud-agnostic, easy-to-use front end (Fig. 5) to manage the experiments with minimal knowledge of the underlying architecture and obtain a real-time overview of the current status (including the resources used and run completion data).Moreover, the central control system lends more consistency to the view of the project as an IaaS by providing a simple interface (both back end and front end).
The control plane is still in its early developmental stages (e.g.although it is cloud agnostic, so far only AWS has a connector and is supported), and further work will describe its improvements over time.
It consists of two main components: the back end provides the user with a RESTful API with basic functionalities related to simulation information and management, with the intention of providing (even more) agnostic access to the cloud; and the front end makes it easier to communicate with the API as intuitively and simplistically as possible.
The core component, the RESTful back end (using JavaScript object notation -JSON), provides simple access and wraps common actions: start simulation with n nodes, stop simulation, modify simulation parameters (n nodes), get simulation status, and get simulation metrics.

Conclusions
Several experiments (using all the defined infrastructure) were done by using standard work units developed by the climateprediction.net/weather@homeproject.We processed work units from two main experiments: the weather@home UK floods (Schaller et al., 2016) and the weather@home Australia/New Zealand project (Black et al., 2016), both with an horizontal resolution of 50 km.
It has been successfully demonstrated that it is possible to run simulations of a climatic model using infrastructure in the cloud; while this might not seem complex, to the best of our knowledge, it has never previously been tested.This efficient use of MTC resources for scientific computing has  previously been used to facilitate real research in other areas (Añel et al., 2014;Schaller et al., 2014).
We have benchmarked a number of Amazon EC2 instance types running CPDN work units.Prices for spot instances vary significantly over time and between instance, but we estimate a price as low as USD 1.50 to run a 1-year simulation based on the c4.large instance in the us-west-1 region in June 2016 (see Fig. 1).To optimize the costs of running simulations in this environment, it will be important to automatically re-evaluate the spot prices to choose the cheapest instance type at the time simulations are submitted.The better performance with smaller instance is due to the fact that vCPUs are hyperthreads, and in smaller instance types there is greater chance the CPU is running at a lower utilization and our instances can scavenge extra CPU cycles (Uhe et al., 2016).
It is interesting to note that cloud services enable us to achieve a given number of tasks completed in some cases 5 times faster than using the regular volunteer computing infrastructure.However, the financial implications can only be justified for critical cases where stakeholders are able to justify through a specific cost-benefit analysis.Anyway, aca-demic institutions and different type of organizations can benefit from waivers to reduce the fees (AWS, 2016b).
Regarding our usage and solution for storage, S3 was a good fit for this work (it comes out of the box with AWS and the pricing is convenient).However, we would not suggest it as suitable long-term archival of this output but instead suggest to make use of community repositories where such data are curated (it should be noted though that CPDN produces output in community standard NetCDF).Also, we understand that even though the infrastructure described here covers a good number of use cases for different projects and experiments, other alternatives could be analysed: -AWS Glacier is an interesting option to study in case that long-term storage for data is needed for nonimmediate access and with lower cost (AWS, 2016c).
In our case, a full simulation (36 000 work units) would have cost us USD 2.624 per month of storage.
-S3 file size is limited to 5 TB and this could be a problem for bigger projects so, options like a CephFS cluster on EC2 could be interesting (Zhao et al., 2015).
This research has also served as a basis for obtaining new research funding as part of climateprediction.net for state- of-the-art studies using cloud computing technologies.This project is based on demonstrated successes in the application of technologies and solutions of the type described here.
In summary, the achieved high-level objectives were to ensure that the client side was successfully migrated to the cloud (EC2); the upload server capability was configured to be redirected to AWS S3 buckets; different simulations were successfully run over the new infrastructure; a control plane (including a dashboard: front end and back end) was developed, deployed, and tested; and a comprehensive costing of the project and the simulation were obtained, together with metrics.
Future improvements should focus on providing more logic to the interaction with client status (such as through remote procedure calls -RPCs), allowing more metrics to be pulled from them, and creating new software as a service (a SaaS layer).From the infrastructure point of view, two main improvements are possible: first, a probe/dummy-automated execution will be needed to adjust the price to a real one before each simulation; second, full migration of the server side into the cloud, allowing the costs of data transfer and latency to be dramatically reduced.

Appendix A: Computing infrastructure design and implementation
The new computing infrastructure was built over virtualized instances (AWS EC2).Amazon provides also autoscaling groups that allow the user to define policies to dynamically add or remove instances triggered by a defined metric or alarm.As the purposes of this work are to use the rationalization of the resources and to have full control over them (via the central system), as well as any type of load balancing or failover, this feature will not be used in the cloud side but in the control system node that serves as back end for the dashboard.
After tasks have been setup in the server side and are ready to be sent to the clients (this can be currently checked in the public URL http://climateapps2.oerc.ox.ac.uk/cpdnboinc/server_status.html), the new workflow for a project/model execution is as follows: 1.The (project) administrator user configures and launches a new simulation via the dashboard.
2. The required number of instances are created based on a given template that contains a parametrized image of GNU/Linux with a configured BOINC client.
3. Every instance connects to the server and fetches two tasks (one per CPU, as the used instances have two CPUs).
4. When a task is processed, the data will be returned to the server, and also stored in a shared storage so they will be accessible for a given set of authorized users.
5. Once there are no more tasks available, the control node will shut down the instances.It should be noted that, at any point, the administrator will be able to have real-time data about the execution (metrics, costs, etc.) as well as be able to change the running parameters and apply them over the infrastructure.

A1 Template instance creation
In order to be able to create a homogeneous infrastructure, the first step is to create an (EC2) instance that can be used as template for the other instances.
The high-level steps to follow to get a template instance (with the parameters defined in Table A1) are provided on the next page.
www.geosci-model-dev.net/10/811/2017/Geosci.Model Dev., 10, 811-826, 2017 Note that one should remember to create a new keypair (public-private key used for password-less SSH access to the instances) and save it (it will be used for the central system), or use another one that already exists and is currently accessible.Because of the limited space in this article, the line length (new line) has been truncated with \; please consider this when running any command described in here.

A1.1 Installing and testing AWS and EC2 command-line interface
Prerequisites include wget, unzip, and Python 2.7.x.This step is optional, but it is highly recommendable because this will be the advanced control of the infrastructure through the shell.The following description applies and has been tested on Ubuntu 14.04 Canonical Ltd. ( 2014), but can be reproduced into any GNU/Linux system.
First, create an "Access Key" (and secret and/or password), via the AWS web interface in the "Security Credentials" section.With these data, the "AWS_ACCESS_KEY" and "AWS_SECRET_KEY" variables should be exported/updated; please have in mind that this mechanism will be also used for the dashboard/metrics application.

A1.3 Simulation terminator
An essential piece of software, developed for this work, is the "simulation terminator", which decides if a node should shut down itself in the event that work units were not processed for a given amount of time (by default 6 h, via Cron), or there were no jobs waiting on the server.This application will be provided upon request to the authors.
To install it (by default into /opt/climateprediction/), the following must be done: $ sudo ./installClient>> Simulation Client succesfully installed!When an is powered off, it will be terminated (destroyed) by the Reaper service that runs in the central control system.

A2 Contextualization
Now that the template instance is ready, this means that all the parameters have been configured and the BOINC client is ready to start processing tasks; the next stage is to contextualize it.This means that an OS image will be created from it, which will give our infrastructure the capacity of being scalable by creating new instances from this new image.Unfortunately, this part is strongly related to the cloud type, and although it can be replicated into another system, by now it will only explicitly work in this way for AWS.

D. Montes et al.: Enabling BOINC in cloud services
-"Stop simulation" forces all the instances to terminate.
There are three default metrics (default time lapse: 6 h): -Active instances are the number of active instances.
-Completed tasks are the number of work units successfully completed.
-Simulation cost is the accumulated cost for the simulation.

C2 Installation and configuration
The applications are intended to run at any GNU/Linux.The only requirements are (apart from Python 2.7) Flask and Boto, that can be easily installed into any GNU/Linux: $ pip install flask virtualenv \ boto daemonize

C2.1 First configuration and run
For this step, the file controlSystem.tar.gz, which contains all the software and configurations for the central system, needs to be uncompressed into /opt/climateprediction/; then just Now that the central system has been installed and configured, it will be listening and accepting connections into any network interface (0.0.0.0) on port 5000, protocol HTTP, so it can be accessed via web browser.Firefox or Chromium are recommended because of Javascript compatibility.

C3.1 Launch a new simulation
When starting a simulation, the number of instances will be 0.This can be changed by clicking "Edit Simulation", setting the number into the input box, and clicking on "Apply Changes".Within some minutes (defined in the configuration file, in the "pollingTime" variable), the system will start to deploy instances (workers).

C3.2 Modify a simulation
If the number of instances needs to be adjusted when a simulation is running, the procedure is the same as launching a new simulation ("Edit Simulation").Please be aware that if the number of instances is reduced, unfinished work units will be lost (the scheduler will stop and terminate them using a FIFO).

C3.3 End current simulation
When a simulation wants to be stopped, click "Stop Simulation".This will reduce the number of instances to 0, copy the database as "SIMULATION-TIMESTAMP" for further analysis, and reset all the parameters and metrics.

Figure 1 .
Figure 1.(a) Work-unit run time and (b) cost per simulation year.
Central System Ready to Run! Type ./run.py to start.#Start the service... $ ./run.py #... and Central System starts #resolving backend and frontend requests Optionally, the configuration can be set manually by editing the file "Config.cfg"(parameters in < >):

Table A1 .
Parameters for the template instances.
The project executes both 32 and 64 bit binaries for the simulation, so once the template instance is running, the needed packages and dependencies need to be installed via