WekaIn this section I briefly cover what the new RPlugin package for Weka >= 3.7.6 offers. This package can be installed via Weka's built-in package manager.
Here is an list of the functionality implemented:
- Execution of arbitrary R scripts in Weka's Knowledge Flow engine
- Datasets into and out of the R environment
- Textual results out of the R environment
- Graphics out of R in png format for viewing inside of Weka and saving to files via the JavaGD graphics device for R
- A perspective for the Knowledge Flow and a plugin tab for the Explorer that provides visualization of R graphics and an interactive R console
- A wrapper classifier that invokes learning and prediction of R machine learning schemes via the MLR (Machine Learning in R) library
The following screenshot shows the execution of two separate R scripts in Weka's Knowledge Flow environment. This is accomplished by a new RScriptExecutor step for the Knowledge Flow.
The upper part of the flow loads a dataset in Weka's ARFF format and passes it to a RScriptExecutor step that first pushes the data into R as a data frame, and then learns an rpart decision tree in R. The tree, in text form, is then sent to a TextViewer component. The lower part of the flow uses a second RScriptExecutor step to load the iris data (inside of the R environment) and then create a scatter plot matrix using the "pairs" function. It also exports the iris data from R into Weka's internal "Instances" format and sends this to a second TextViewer. The scatter plot matrix produced by the R script is exported as a png and sent to an ImageSaver step. The GUI dialog for the RScriptExecutor showing the R script producing these results is shown at the bottom of the screenshot.
Any graphics produced by an RScriptExecutor step are also picked up by the "RConsole/visualize" perspective for the Knowledge Flow.
This perspective (which is also available in Weka's Explorer as a plugin tab) maintains a list of images produced by Knowledge Flow processes. It also provides an interactive R console where R commands can be typed and evaluated immediately.
In order to evaluate R machine learning models in the Weka framework, and to use Weka as a vehicle for operationalizing such models, it is necessary to go beyond just executing R scripts. The MLR wrapper classifier for Weka provides a bridge between the MLR library in R and Weka's "Classifier" API. It allows R models to be learned, evaluated and used for prediction inside of Weka's framework. It also allows the models learned in R to be persisted via serialization and encapsulated in the MLRClassifier for use at a later date. The following screenshots show the MLRClassifier at work in a Knowledge Flow process and in Weka's Explorer UI.
R integration, for scoring/prediction using R models, in Pentaho's PDI data integration tool is achieved with minimal effort using the existing WekaScoring plugin step for PDI. WekaScoring already handles scoring using pre-constructed Weka models (classifiers and clusterers) and PMML models. Since MLRClassifier is a Weka classifier it can be consumed immediately by the step and R models can be used for scoring inside of a PDI transformation.
It is also possible to execute R scripts and construct R predictive models from scratch as part of a PDI transformation using the existing Knowledge Flow plugin step for PDI. This allows, for example, R predictive models to be refreshed and R visualizations to be generated as part of an automated ETL process.
Weka's R integration uses the JRI library which provides JNI interface to the R native libraries. This, of course, requires that the user have R installed on their computer and that they have installed the rJava package (which includes JRI) from within the R environment. It also requires several environment variables to be set in order for the JRI native library and dependent R libraries to be found. The RPlugin package has instructions for easing this pain and a mechanism that attempts to find the JRI library in the most common installation locations under Windows and MacOS. Once JRI and R are available to the Java VM then Weka's RPlugin will install various R libraries (such as MLR) automatically.
Class loaders + native libraries (combined with the single-threaded nature of the R environment) add up to quite a headache when considering things like plugin environments, application servers and the like. Weka's RPlugin can be used in such environments where it is loaded (perhaps multiple times) by plugin class loaders. To achieve native library visibility across child class loaders, and to maintain a single point of access to R by clients, the byte code for certain key classes (from JRI, REngine and Weka) are injected into the root class loader very early in the class loading process. Many thanks to the guys over at the snappy-java project for detailing this approach.
PDI is a streaming environment and so is Weka (as far as prediction goes). R and MLR operate most efficiently in a batch fashion since the data frame is the structure that is used for both learning a model and making predictions. As the conversion and transfer of data from Weka or PDI into R is costly, the best performance is obtained by pushing over data in batches for prediction. Prediction using R models in Weka 3.7.6 is slow because each test instance has to be transfered into R as a separate data frame. The next release of Weka (3.7.7) due out in August rectifies this with a new batch prediction interface (Note that nightly snapshots of Weka already include this performance improvement).