controlR
controlR is a node module for running commands in an external R process.
License
controlR is copyright (c) 2016 Structured Data LLC and released under the MIT license.
See source files for license details.
Rationale
R is great for data processing, in particular because of the excellent libraries developed by the user community. Once you have developed your data model, you can run it through the R shell or GUI provided. Beyond that, it's pretty easy to embed R in a C/C++ application. You can build a full-featured desktop application or service this way.
Requiring either the R shell or a C++ host application can be a bit limiting, though. Modern applications tend to use a variety of platforms, languages, and environments to solve different problems. Javascript, and in particular javascript through Node, has become the de facto language for modern application development. It's not for everything; but it solves a lot of problems, provides a complex ecosystem of libraries and developers, and (via electron, among other platforms) supports desktop applications as well as services.
Javascript is not for everything -- that's why we want to use R in the first place. But node provides a great environment for building applications. controlR is built to be glue code that lets you build your application in javascript, your data model in R, and connect the two.
Moreover because controlR runs R in an external process, you have benefits beyond what you get by embedding in a C++ application -- for example, you can monitor execution and kill off runaway processes; or you can start multiple instances and run code in parallel.
The child process will run R's event loop, meaning it supports things like R's html help server. This is probably only useful for desktop applications.
One further note on rationale -- for our purposes we did not want to modify the R source code in any way. controlR binds against the R shared libraries (DLLs on windows) at runtime. This way we can guarantee fidelity with the standard R interpreter, and building against updated versions of R is trivial.
What it is not
controlR is not designed for, nor is it suitable for, running a web service. It is designed to support a single client connection; and it adds no limitations on what running R code does to the host system.
R has full access to the system (subject to the host process' permissions) and the interface imposes no security restrictions on top of this. Therefore exposing the R interface to outside users is a significant security risk, even if the host process is running with minimal permissions.
R processes maintain internal state. If multiple clients connect to the same running instance, each client will have access to and can modify that state. For some purposes this is immaterial; but for our purposes this is not desirable. Therefore there is a tight binding between a single client and a single R process.
On the other hand, there is no reason you can't run multiple R processes at the same time, and talk to them from a single client or from multiple logical clients.
Connection
controlR consists of a javascript module, for use with node; and a standalone executable, which acts as a host for R (via the shared library/dll). Communication between node and the R process runs over a domain socket (named pipe on Windows) or a TCP socket. All messages transferred between the processes are JSON formatted.
Interface
controlR connects to and talks to a single R instance. Within the interface there are
two separate "channels" for communication. In the API these are generally referred to
as exec
and internal
.
The internal
channel uses the embedded R interface. This is generally what you want
if you want to execute some R code and get a result back. For most purposes this is
sufficient to build an R application. The exec
channel talks to R through R's REPL
loop -- much as if you were using an R shell. Why do this at all? to support debugging
and R's concept of a browser
-- a window into executing code. Without modifying R code,
this is the only way to support debugging. We can also use it to build our R shell
(more on that later).
So the internal
channel executes code and returns a result, as a javascript object
(JSON on the wire). The exec
channel executes code in a shell context, and (possibly)
prints results to the output console. Anything R wishes to print to the output console
is sent to the client as a javascript message (again, JSON on the wire), using node's
eventemitter interface.
R is single-threaded, and can run only one operation at a time. State is maintained by
the module, which enforces linear execution. State my be polled via the busy()
method,
and the module broadcasts state change events.
Calls to exec
or internal
will fail if another call is in process. The module
provides queued_exec
and queued_internal
methods which will wait for a change in state
and then execute.
See the js/
directory for API and event documentation.
Third-Party Dependencies
controlR depends on R and node. See build instructions. controlR futher depends on libuv and nlohmann::json, used under license and included in source distributions. See the individual projects for license details.
Building
Although the node module itself is pure javascript, you need standard node build tools to build the child executable. You also need R installed. On linux, R must be built with support for shared libraries.
To build, set an environment variable R_HOME
pointing to the root of the R directory.
Then use npm
or node-gyp
to build.
> export R_HOME=/path/to/R-3.2.3
> npm install controlr
Example
At runtime, you can either set an R_HOME
environment variable or pass a value directly
to the initialization method. Remember to escape backslashes in Windows paths.
const ControlR = var controlr = ; controlr;