Episode
Rc2: an Environment for Running R and Spark in Docker Containers
with Jim Harner
useR!2017: Rc$^2$: an Environment for Running R and...
Keywords: R, Spark, Docker containers, Kubernetes, Cloud computing
Rc\(^2\) (R cloud computing) is a containerized environment for running R, Hadoop, and Spark with various persistent data stores including PostgreSQL, HDFS, HBase, Hive, etc. At this time, the server side of Rc\(^2\) runs on Docker's Community Edition, which can be: on the same machine as the client, on a server, or in the cloud. Currently, Rc\(^2\) supports a macOS client, but iOS and web clients are in active development.
The clients are designed for small or large screens with a left editor panel and a right console/output panel. The editor panel supports R scripts, R Markdown, and Sweave, but bash, SQL, Python, and additional languages will be added. The right panel allows toggling among the console and graphical objects as well as among generated help, html, and pdf files. A slide-out panel allows toggling among session files, R environments, and R packages. Extensive search capabilities are available in all panels.
The base server configuration has containers for an app server, a database server, and a compute engine. The app server communicates with the client. The compute engine is available with or without Hadoop/Spark. Additional containers can be added or removed from within Rc\(^2\) as it is running, or various prebuilt topologies can be launched from the Welcome window. Multiple sessions can be run concurrently in tabs. For example, a local session could be running along with another session connected to a Spark cluster.
Although the Rc\(^2\) architecture supports physical servers and clusters, the direction of computing is in virtualization. The docker containers in Rc\(^2\) can be orchestrated by kubernetes to build arbitrarily large virtual clusters for the compute engine (e.g., parallel R) and/or for Hadoop/ Spark. The focus initially is on building a virtual cluster from Spark containers using kubernetes built on a persistent data store, e.g., HDFS. The ultimate goal is to built data science workflows, e.g., ingesting streaming data into Kafka, modulating it into a data store, and passing it to Spark Streaming.
useR!2017: Rc$^2$: an Environment for Running R and...
Keywords: R, Spark, Docker containers, Kubernetes, Cloud computing
Rc\(^2\) (R cloud computing) is a containerized environment for running R, Hadoop, and Spark with various persistent data stores including PostgreSQL, HDFS, HBase, Hive, etc. At this time, the server side of Rc\(^2\) runs on Docker's Community Edition, which can be: on the same machine as the client, on a server, or in the cloud. Currently, Rc\(^2\) supports a macOS client, but iOS and web clients are in active development.
The clients are designed for small or large screens with a left editor panel and a right console/output panel. The editor panel supports R scripts, R Markdown, and Sweave, but bash, SQL, Python, and additional languages will be added. The right panel allows toggling among the console and graphical objects as well as among generated help, html, and pdf files. A slide-out panel allows toggling among session files, R environments, and R packages. Extensive search capabilities are available in all panels.
The base server configuration has containers for an app server, a database server, and a compute engine. The app server communicates with the client. The compute engine is available with or without Hadoop/Spark. Additional containers can be added or removed from within Rc\(^2\) as it is running, or various prebuilt topologies can be launched from the Welcome window. Multiple sessions can be run concurrently in tabs. For example, a local session could be running along with another session connected to a Spark cluster.
Although the Rc\(^2\) architecture supports physical servers and clusters, the direction of computing is in virtualization. The docker containers in Rc\(^2\) can be orchestrated by kubernetes to build arbitrarily large virtual clusters for the compute engine (e.g., parallel R) and/or for Hadoop/ Spark. The focus initially is on building a virtual cluster from Spark containers using kubernetes built on a persistent data store, e.g., HDFS. The ultimate goal is to built data science workflows, e.g., ingesting streaming data into Kafka, modulating it into a data store, and passing it to Spark Streaming.
Have feedback? Submit an issue here.