Biocep-R project
Copyright © 2007-2009 Karim Chine

Executive Summary
I - Open Science in the cloud, towards a universal platform for mathematical and statistical computing
      R , the open-source software environment for statistical computing and graphics, is becoming the lingua franca of data analysis. Repositories of contributed R packages related to a variety of problem domains in life sciences, social sciences, finance, econometrics, chemometrics, etc are growing at an exponential rate. Scilab , the open-source software package for numerical computations, is becoming more and more widely used for engineering and scientific applications. The ubiquitous Java technologies allow the building of highly effective platform-independent distributed systems and graphical user interfaces. Free virtualization technologies allow the creation, distribution and reuse in any environment of snapshots of operating systems, computing software stacks and data sets. Finally The Amazon EC2's simple web services interface let anyone run computations on demand on Amazon's proven computing environment (public cloud) and the open-source Eucalyptus system enables to mimic those web services on private infrastructures (private cloud). Biocep builds with these ingredients and others a universal open-source computing platform that enhances dramatically the accessibility of mathematical and statistical computing, creates an open environment for the production, sharing and reuse of all the artifacts of computing and puts unprecedented analytical, numerical and processing capabilities in the hands of everyone (open science).
      With Biocep, R/Scilab computational engines are abstracted with URLs and can run at any location. They can be interactively controlled from the user's laptop either programatically or via an extensible, highly productive data analysis workbench or from highly programmable spreadsheets. The computational engines can be used as clusters on Grids and Clouds to solve computationally intensive problems, to build scalable analytical web applications or to expose functions as web services or nodes for workflow workbenches. They can also be used to distribute numerical/statistical user interfaces created with drag-and-drop tools and can be accessed simultaneously by several users to work with data collaboratively.
    II - A Google docs-like portal for data analysis : towards a user-friendly facade for the ubiquitous cloud
      The Biocep-R software platform makes it possible to use mainstream statistical/scientific computing environments such as R,Scilab, SciPy, Sage and Root as a service in the cloud. The full capabilities of the environments are exposed to the end user from within a simple browser. He/she can issue commands, install and use new packages, generate and interact with graphics, upload and process files, download results, etc. using high-capacity virtual machines that he/she starts and stops on-demand. The full computational environment and the data can be snapshotted any time, shared and reused. Spreadsheets running in the cloud and fully integrated with the computing environments functions and data can be mirrored to web browsers and to Excel. The Platform takes the computing engine to the data and allows many collaborators to access and analyze together that data using collaborative consoles, editors, spreadsheets and annotatable graphics. The platform helps performing elastic distributed computing with any number of virtual machines to solve heavily computational problems or deploying highly scalable computational backbends for analytical applications and workflows. Finally, the platform enables to easily drag and drop visual components and create user interfaces and dashboards that use advanced statistical/numerical models running on cloud machines. Those Interfaces can be easily delivered to the end user with simple URLs. Elastic-R is a new portal built using the Biocep-R platform. It enables anyone to use AWS resources seamlessly, to work with R, Scilab, etc. within the browser and to collaborate, share and reuse data, functions, algorithms, user interfaces, and servers. It aims to become the "Google docs" of data analysis.
    Biocep-R within the Technology Environment
    Open Platform Diagram
    Biocep-R Computational Open Platform Ecosystem
    Open Platform Diagram
    Distributed Computing in the Cloud
    Open Platform Diagram


    Project Description

      Biocep builds on top of the highly popular statistical environment R, an e-platform for computing and data analysis.

      Let there be R..

      R is becoming the lingua franca of data analysis and statistical computing. It has a very powerful graphics system as well as cross-platform capabilities for packaging any computational code. Hundreds of available R packages, exponentially growing in number, implement the most up-to-date computational methods and reflect the state-of-the-art of research in various fields. R packages are likely to become a reproducible research enabler because they enable functions and algorithms to be reused and shared. There is no obstacle to a large-scale deployment of R on public grids since it is a GPL software. However, R is not multithreaded. It does not operate as a server and it has only a low-level non-object-oriented API. GUI development for R remains non-standardized. R's potential as a computational back end engine for applications and service-oriented architectures has yet to be fully exploited. While its user base is growing at a high rate, this growth rate would be significantly higher in the presence of a user-friendly and rich workbench.

      Pluggability, reusability

      Biocep is a general unified open source Java solution for integrating and virtualizing the access to R engines/servers. It aims to become a federative user-friendly computational e-platform for research, finance and education. The Biocep virtual workbench provides a framework enabling the connection of all the elements of a computational environment:
      • 1. The computational resource (whether it is a local machine, a cluster, a grid or a cloud server) via a simple URL.
      • 2. The computational components via the import of R packages.
      • 3. The GUIs via the import of plugins from repositories or the design of new views with a drag-and-drop GUI editor.


      Most powerful cross-platform R workbench so far

      Several dockable built-in views allow users to work interactively with R engines running at any location. The views include a console, highly interactive remote graphic devices (with built-in zooming, scrolling, real coordinate tracking..), PDF and SVG viewers, R data inspectors, linked plots and spreadsheets that are fully integrated with R functions and data.

      Cyber-collaboration ? simple comme bonjour

      Biocep enables collaborative R sessions - multiple web users can connect simultaneously to an R server running anywhere and analyze data collaboratively via a set of broadcasted views. For example, the console log is sent in real time to all users. Chatting is enabled and a graphic device is synchronously updated for all. Biocep includes an editable collaborative spreadsheet that retains data on the server, removing limits on client machines. Distributed and linked statistical graphics based on a refactored iplots package (www.iplots.org) enable the collaborative highlighting and color brushing of various linked plots.

      How would you like your R ?

      Biocep frameworks and tools make it possible to use R as a Java object-oriented toolkit or as an RMI server. All the standard R objects have been mapped to Java and user defined R classes can be mapped to Java on demand. Calls to R functions from java locally or remotely cope with local and distributed R objects. An easy-to-use Web Services generator is provided to enable automatic exposure of R functions and packages as Web Services. They can be seamlessly integrated as nodes into workflows. They can be stateless (an anonymous R worker performs the computation) or stateful (an R worker reserved and associated with a session ID is used and can be reused until the session is destroyed). The statefulness solves the overhead problem caused by the transfer of intermediate results between workflow nodes. A stateful R-SOAP API exposes the full capabilities of the platform and enables an efficient integration of R into data analysis pipelines using PERL, C, C++, C#, Java or R. R-SOAP clients are provided for each language.

      Yes, it's already Cloud Age

      Biocep provides a remote resources pooling framework (RPF) allowing pools of R engines to be deployed on heterogeneous nodes. These engines are managed and used via a simple borrow/return API for multithreaded web applications and web services, for distributed and parallel computing, for dynamic content on-the-fly generation (analytic results, tables and graphics in various formats for thin web clients) and for R virtualization in a shared computational resources context. RPF enables transparent cloudbursting: Amazon EC2 virtual machines running R servers can be fired up or shut down to scale up or scale down according to the load in a highly scalable web applications deployment.

      Scripting ? easy, easy..

      Biocep has built-in Python and Groovy scripting facilities both on server and on client sides. The bridging of R and the scripting interpreters is bi-directional. R objects can be exported to Python/Groovy and vice versa and the scripts can embed seamlessly any R code. Scripting with R as a component becomes easier than ever using either the Biocep APIs or the workbench's views. User Java code can be dynamically loaded by R servers and used for scripting.

      To conclude..

      In Summary, Biocep combines the capabilities of R and the flexibility of a Java based distributed system to create a tool of considerable power and utility. A Biocep based R virtualization infrastructure has been successfully deployed on the British National Grid Service, demonstrating its usability and usefulness for researchers. Biocep could become an essential building block of a new generation of distributed or web-based statistical software. The virtual workbench enhances the user experience and the productivity of anyone working with R. As Biocep is extensible, it enables the emergence of repositories of plugins. The interoperability, coupled with a large-scale deployment of virtualization infrastructures on various grids democratizes R based HPC and enables users from within their browsers to compute and visualize data with unprecedented flexibility and performance. The adoption of the new platform would be a step forward in the direction of interoperability, reusability and seamless integration of research resources (and therefore a reproducible research enabler). Finally, Biocep may work as an enabler of a new computing business model that would synergize the utility computing model (resources) and the pay-per-use software model (components/GUIs).

    Citation

      Karim Chine, "Biocep, Towards a Federative, Collaborative, User-Centric,
      Grid-Enabled and Cloud-Ready Computational Open Platform,"
      escience,pp.321-322, 2008 Fourth IEEE International Conference on eScience, 2008


    Author

      Karim - Chine --- Open Platform Diagram --- CV - -

    Talks, Tutorials, Conferences


    License

      This program is free software: you can redistribute it and/or modify it under the terms of the GNU General Public License as published by the Free Software Foundation, either version 3 of the License, or (at your option) any later version. This program is distributed in the hope that it will be useful, but WITHOUT ANY WARRANTY; without even the implied warranty of MERCHANTABILITY or FITNESS FOR A PARTICULAR PURPOSE. See the GNU General Public License for more details. You should have received a copy of the GNU General Public License along with this program. If not, see http://www.gnu.org/licenses/.

    Project Source Code

    • the public svn link (anonymous access) : svn://svn.r-forge.r-project.org/svnroot/biocep
    • Project summary page on R-Forge: here

    Project Documentation

    • Article: ''Scientific Computing Environments in the age of virtualization'' : pdf - doc
    • Article: ''Biocep, an e-Science Computational Platform for the Cloud'' : pdf (available also on the IEEE Computer Society Digital Library here )
    • Slides about biocep : pdf ppt
    • Bio-IT World slides : ppt
    • Getting Started / User Manual / Flash tutorials / How to's/recipes / FAQ / Javadoc not yet availables
    • Biocep Wiki : here
    • Handbook of Cloud Computing (Springer), list of articles


    Project Deliverables & Howtos (Last update: June 1st, 15h25)

    For Windows :

    • R Workbench without R, without plugins and without extensions here (install R first, and make it accessible from your command line by adding its binary's location to your system PATH)
    • R Workbench without R, with plugins (EC2/S3 monitors + examples) and with extensions (OpenOffice-based file converter) here
    • R Workbench with R (2.8.0), with plugins (EC2/S3 monitors + examples) and with extensions (OpenOffice-based file converter) here
    • readme

    For all Operating Systems :

    • Prerequisties:

      • java 5 (or upper JRE) installed
        to run the workbench and connect to R servers on remote hosts

      • java 5 (or upper JDK) and R>=2.5 installed and accessibles from the command line
        to run R servers from the command line or via the workbench
        to generate java mappings for S4 classes and R functions
        to generate Web Services exposing R functions
        to run the miniature R virtualisation, the R-SOAP and the generated Web Services Web Applications

    • The Virtual R Workbench


      • As a Java Web Start application (recommended) : here
        create and connect to an R server on your machine : choose "Create New R Server" , choose "On My Machine", OK

      • As a desktop application :
        download biocep.jar : here
        from your command line: java -jar biocep.jar
        create and connect to an R server on your machine : choose "Create New R Server", choose "On My Machine", OK
      • As an applet : here
        create and connect to an R server on your machine : choose "Create New R Server" , choose "On My Machine", OK

    • The Biocep Core

      • download it here

      • run an R server
        from your command line: rmiregistry & (on windows: start rmiregistry)
        from your command line: java -Dname=toto -jar biocep-core.jar (replace toto with any other name, repeat the command with different server names to run several R servers)
        connect to the server : open the workbench, choose "Connect to R via RMI", choose "Use:" → "Rmi Registry", click on "Refresh", choose your R Server name (toto, ..), OK

      • jython scripting with an R server
        example (connect to an existing R server)

      • groovy scripting with an R server
        example (create an R server)

      • create and use R Server from your own java web application : use the biocep core for tomcat available here

      • run the miniature R virtualization and the R-SOAP web applications (automatic download)
        java -Dport=8080 -cp biocep-core.jar HttpServer
        connect to the virtualization server : open the workbench, choose "Connect to R via Http", keep the default value for "Url", OK


    • The Miniature R Virtualization Web Application

      • download it here

        run it via the biocep embedded jetty server : java -Dport=8080 -cp biocep-core.jar HttpServer rvirtual.war or deploy it to tomcat ?
        connect to the virtualization server :
        • open the R workbench
        • choose "Connect to R via Http"
        • keep the default value for "Url"
        • keep "Private R" checked
        • keep "Private R Name" empty or enter a name of your choice for your R Server if you would like to keep it alive after logging off (for future reconnections, reenter the same server name)
        • OK


        connect to the virtualization server from Java : example requires biocep-core.jar

    • The R-SOAP Web Application

      • download it here

        run it via the biocep embedded jetty server: java -Dport=8080 -cp biocep-core.jar HttpServer rvirtual.war rws.war or deploy it to tomcat ?
        (needs the miniature R virtualization web application to run)
        get the R-SOAP WSDL : open the following URL : http://127.0.0.1:8080/rws/rGlobalEnvFunction?wsdl
        use the URL to generate a Web Service Client for R-SOAP, use R-SOAP from java: example, R-SOAP java client eclipse project

    • The Biocep Tools

      • download it here

      • generate stateful and stateless Web Services for R functions
        download (save as) globals.r and rjmap.xml
        add to globals.r your functions definitions and their dependencies (library..)
        in globals.r, add TypeInfos to your functions ?
        add to rjmap.xml "function" tags with your functions names under "rj"/"publish"/"functions"
        java -Dfile=rjmap.xml -Dwarname=MyWebService -jar biocep-tools.jar
        run the generated web services web application (distrib/MyWebService.war) and the miniature virtualization web application:
        java -Dport=8080 -cp biocep-core.jar HttpServer rvirtual.war distrib/MyWebService.war or deploy them to tomcat ?
        use the following url (wsdl) : http://127.0.0.1:8080/MyWebService/rGlobalEnvFunction?wsdl to generate a web service client using your published functions


    • Biocep Plugins

      • A simple example of a plugin created with the Netbeans GUI designer
        download the SimplePlugin.jar here
        run the workbench as a desktop application: java -jar biocep.jar
        create and connect to an R server on your machine : choose "Create New R Server", choose "On My Machine", OK
        go to the menu "Plugins" / "Open Plugin View From Jar File" → "Choose jar" → pick SimplePlugin.jar → OK
        in the new View, set a value for n and click "Submit", the SVG Panels are resizables

      • A plugin embdding xulrunner. Enables the use of Firefox (Browser) , Elasticfox (EC2 monitor) and S3fox (S3 monitor) as views of the workbench
        download the mozillabrowser.zip here
        run the workbench as a desktop application: java -jar biocep.jar
        create and connect to an R server on your machine : choose "Create New R Server", choose "On My Machine", OK
        go to the menu "Plugins" / "Install Plugin from Zip File" → choose the mozillabrowser.zip file → OK
        open Elasticfox (EC2 monitor) : go to the menu "Plugins" / mozillabrowser / Elastcifox - EC2 monitor
        open S3fox (S3 monitor) : go to the menu "Plugins" / mozillabrowser / S3fox - S3 monitor

      • A Netbeans project for creating plugins visually and distribute them via simple URLs
        download and unzip BiocepPluginsStudio.zip here
        Open the project with Netbeans and edit MyDashboard (source editing or visual editing)
        Press F11 to build the plugin
        1 open the workbench , open Plugins / ISMB / MyDashboard
        2 Use the first displayed URL to distribute an EC2-based version of your new view
        3 Use the second displayed URL to distribute a version of your new view that creates transparently an R engine on your user's machine and use it



    Biocep-R on Amazon's Cloud

    • Getting Started with Amazon EC2

      • Sign up for Amazon EC2 here
      • Install Elasticfox, the Mozilla Firefox extension for interacting with Amazon EC2 from here
      • Learn how to use Elasticfox to connect to your EC2 account, browse available AMIs (Amazon Machine Images ) and run AMIs from here
      • Few issues like keys conversion for beeing able to ssh the virtual machines instances can be answered using EC2 getting started documentation here

      • The following Workbench plugin : mozillabrowser embeds Firefox, Elasticfox and S3fox as views of the Workbench
        Unzip the plugin under ~/RWorkbench/plugins ( for windows : %UserProfile%\RWorkbench\plugins )
        start the workbench and Choose the menu "Plugins" / "Mozilla browser" / "Elasticfox - EC2 Monittor" to run the standalone Elasticfox
    • Start the Biocep-R AMI ami-cd5fb9a4 : Ubuntu 9.0.4 Jaunty Jackalope / R version 2.9.0 / Scilab 5.1.0 /java version 1.6.0

      • find ami-cd5fb9a4 (select region "us-east-1", search with AMI id or with the keyword "biocep", the AMI manifest is : biocep-ubuntu904-r290-j160-sci510-cologne/biocepimage.manifest.xml )
      • Create a keys pair if you dont have one already
      • Create a security group with one port of your choice open {my_port} : add a permission for a TCP/IP port {my_port} open to the network 0.0.0.0/0
      • Run ami-cd5fb9a4 , choose your keys pair and your security group , insert the following to the field user data
        start=true
        port={my_port}
        login={my_login}
        pwd={my_pwd}
        email={my_email}
        workers={nbr_workers}
      • when the ami starts running, you receive an email with the URL to use to connect the Workbench to the ami

        or





    Biocep-R on Virtual Appliances

    • Download and install the VMware player from here. On Mac, use VMware Fusion.
    • Download and unzip the VMware image (R+Scilab+Biocep) from here.
    • Double-click on Ubuntu-server-9.04-i386.vmx (file under the folder "ubuntu-r-scilab-biocep") to run the virtual machine. Once asked wether you moved the image or copied it, answer "I moved it".
    • The machine dipslays "Host IP :" followed by its IP address.


    They talk about Biocep

    • CRAN Task Views - High Performance Computing here
    • State-of-the-art in Parallel Computing with R -Technical Report Number 47, 2009 - Department of Statistics-University of Munich State-of-the-art in Parallel Computing with R here
    • BusinessWeek Cloud Computing Ad Section here
    • Hans Gilde's weblog here
    • DecisionStats blog here
    • R & BioConductor Manual here
    • Bitlab Wiki- here
    • Enabling reproducible research: licensing for scientific innovation here
    • Interview with the author (Decisionstats) here


    R Virtualization

    R Virtualization Diagram

    R Servers Pool - Deployment

    R Servers Pool - Deployment Diagram

    R Servers Pool - Architecture

    R Servers Pool - Architecture Diagramt

    R Virtualization on an LSF Cluster

    R Virtualization on an LSF Cluster Diagram

    Biocep on the National Grid Service

    Biocep on the National Grid Service Diagram

    Scripting with R

    Scripting

    Web Services Generation

    Web Services Generation Diagram

    Workflows with Generated Stateful Web Services

    Workflows with Stateful Web Services Diagram

    Workbench Plugins

    Workbench Plugins  Diagram

    Collaborative R

    Collaborative R  Diagram

    Standard R Objects Mapping Class Diagram

    Scripting

    Generated Mapping for S4 ExpressionSet Class Diagram

    Scripting

    Acknowledgements

      ACS: Madi Nassiri Amazon: Simone Brunozzi, Deepak Singh AT&T Research Labs: Simon Urbanek ATUGE: Imen Essafi, Béchir Tourki, Ilyes Gouja, HatemHachicha, Amine Elleuch Banca d'Italia: Giuseppe Bruno Bio-IT World :Kevin Davies Cambridge Healthtech Institute: Cindy Crowninshield City University of New York: Mario Morales, Makram Talih Columbia University: Omar Besbes Dataspora: Michael E. Driscoll EBI: Alvis Brazma, Wolfgang Huber, Kimmo Kallio, Misha Kapushesky, Michael Kleen, Alberto Labarga, Philippe Rocca-Serra, Ugis Sarkans, Kirsten Williams, Eamonn Maguire EPFL: Darlene Goldstein ETH Zürich: Yohan Chalabi, Diethelm Würtz, Martin Mächler EVRI.com: Seth Falcon FHCRC: Martin Morgan, Nianhua Li FVG LLC: Lisa Wood Google: Olivier Bosquet Harvard Business School: Ousseynou Nakoulima Harvard University: Tim Clark, Sudeshna Das, Douglas Burke, Paolo Ciccarese IBM: Jean-Louis Bernaudin, Pascal Sempe, Loic Simon, Lea A Deleris, Alex Fleischer, Alain Chabrier Imperial College London: Asif Akram, Vasa Curcin, John Darlington, Brian Fuchs Indiana University: Michael Grobe INRIA: David Monteau JISC: David Flanders Johnson & Johnson - Janssen Pharmaceutica: Patrick Marichal Lancaster University: Robert Crouchley, Daniel Grose Leibniz Universität Hannover: Kornelius Rohmeier Limagrain: Zivan Karaman Mekentosj: Alexander Griekspoor Microsoft: Eric Le Marois, Tony Hey Mubadala: Ghazi Ben Amor Nature Publishing Group: Ian Mulvany, Steve Scott NCeSS: Peter Halfpenny, Rob Procter, Marzieh Asgari-Targhi, Alex Voss, YuWei Lin, Mercedes Argüello Casteleiro, Wei Jie, Meik Poschen, Katy Middlebrough, Pascal Ekin, June Finch, Farzana Latif, Elisa Pieri, Frank O'Donnell, Kenny Baird New York Java User Group: Frank D Greco OeRC: Dimitrina Spencer, Matteo Turilli, David Wallom, Steven Young OMII-UK: Neil Chue Hong, Steve Brewer OpenAnalytics: Tobias Verbeke Oracle: Dominique van Deth, Andrew Bond OSS Watch: Ross Gardler Platform Computing: Christopher Smith San Diego Supercomputer Center: Nancy R. Wilkins-Diehr Sanger Institute: Daniel Jeffares, Matt Wood, Phil Butcher Shell: Wayne.W.Jones, Nigel Smith Stanford University: John Chambers, Balasubramanian Narasimhan, Gunter Walther SYSTEM@TIC: Karim Azoum Technische Universität Dortmund: Uwe Ligges, Bernd Bischl The Generations Network: Jim Porzak Tunisian Ministry of Communication Technologies: Lamia Chaffai-Sghaier, Mohamed Saïd Ouerghi Tunisian Ecole Polytechnique: Riadh Robbana UC Berkeley: Noureddine El Karoui, Terry Speed UC Davis: Rudy Beran, Debashis Paul, Duncan Temple Lang UCLA: Ivo Dinov UCSF: Tena Sakai Université Catholique de Louvain: Christian Ritter University of Cambridge: Ian Roberts, Robert MacInnis,Peter Murray-Rust, Jim Downing University of Manchester: Carole Goble, Len Gill, Simon Peters, Richard D Pearson, Iain Buchan, John Ainsworth University of Plymouth: Paul Hewson University of Split: Ivica Puljak UTK: Ajay Ohri World Bank Group-IFC: Oualid Ammar Yahoo: Laurent Mirguet, Rob Weltman / Charles Dallas, Romain François, Manfred Duchrow, Joerg Mueller, Slava Pestov, ..