How to use RevoScaleR in a Spark compute context
Important
This content is being retired and may not be updated in the future. The support for Machine Learning Server will end on July 1, 2022. For more information, see What's happening to Machine Learning Server?
This article provides a step-by-step introduction to using the RevoScaleR functions in Apache Spark running on a Hadoop cluster. You can use a small built-in sample dataset to complete the walkthrough, and then step through tasks again using a larger dataset.
- Download sample data
- Start Revo64
- Create a compute context for Spark
- Copy a data set into HDFS
- Create a data source
- Summarize your data
- Fit a linear model to the data
Fundamentals
In a Spark cluster, you typically connect to Machine Learning Server on the edge node for most of your work, writing and running script in a local compute context, using client tools and the RevoScaleR engine on that machine. Your script calls the RevoScaleR functions to execute scalable and high-performance data management, analysis, and visualization functions. As with any script using RevoScaleR, you can call base R functions, as well as other functions available in packages you have installed separately. The RevoScaleR engine is built on the R interpreter, extending but not breaking any core functions you are accustomed to using.
To load data and run analyses on worker nodes, you set the compute context in your script to RxSpark. In this context, RevoScaleR functions automatically distribute the workload across all the worker nodes, with no built-in requirement for managing jobs or the queue.
Internally, RevoScaleR in an RxSpark compute context uses the Hadoop infrastructure, including the Yarn scheduler, HDFS, and of course, Spark as the processing framework. Alternatively, if project requirements dictate manual overrides, RevoScaleR provides functions for manual data allocation and scheduled processing across selected nodes.
How RevoScaleR distributes jobs in Spark and Hadoop
In a Spark cluster, the RevoScaleR analysis functions go through the following steps:
- A master process is initiated to run the main thread of the algorithm.
- The master process initiates a Spark job to make a pass through the data.
- Spark worker produces an intermediate results object for each task processing a chunk of data. These are then combined and returned to master node.
- The master process examines the results. For iterative algorithms, it decides if another pass through the data is required. If so, it initiates another Spark job and repeats.
- When complete, the final results are computed and returned.
When running on Spark, the RevoScaleR analysis functions process data contained in the Hadoop Distributed File System (HDFS). HDFS data can also be accessed directly from RevoScaleR, without performing the computations within the Hadoop framework. An example can be found in Using a Local Compute Context with HDFS Data.
Note
For a comprehensive list of functions for the Spark or Hadoop compute contexts, see RevoScaleR Functions for Hadoop. For a list of which functions support distributed computing, see Running distributed analyses using RevoScaleR.
1 - Download sample data
Sample data is required if you intend to follow the steps. This walkthrough starts with the AirlineDemoSmall.csv file from the local RevoScaleR SampleData directory.
Optionally, you can graduate to a second series of tasks using a larger dataset. The Airline 2012 On-Time Data Set consists of 12 comma-separated files containing information on flight arrival and departure details for all commercial flights within the USA, for the year 2012. This is a big data set with over six million observations.
To download the larger dataset, go to https://packages.revolutionanalytics.com/datasets/.
2 - Start Revo64
Machine Learning Server for Hadoop runs on Linux. On the edge node, start an R session by typing Revo64
at the shell prompt: $ Revo64
3 - Spark compute context
The Spark compute context is established through RxSpark or rxSparkConnect() to create the Spark compute context, and uses rxSparkDisconnect() to return to a local compute context.
The rxSparkConnect() function maintains a persistent Spark session and results in more efficient execution of RevoScaleR functions for this context.
In the Revo64 command line, define a Spark compute context using default values:
myHadoopCluster <- ()
The default settings include a specification of /var/RevoShare/$USER as the shareDir and /user/RevoShare/$USER as the hdfsShareDir. These are the default locations for writing files to the native local file system and HDFS file system, respectively. These directories must be writable for your cluster jobs to succeed. You must either create these directories or specify suitable writable directories for these parameters. If you are working on a node of the cluster, the default specifications for the shared directories are:
myShareDir = paste( "/var/RevoShare", Sys.info()[["user"]],
sep="/" )
myHdfsShareDir = paste( "/user/RevoShare", Sys.info()[["user"]],
sep="/" )
Tip
You can have multiple compute context objects available for use, but you can only use one at a time. Specify the active compute context using the rxSetComputeContext function: rxSetComputeContext(myHadoopCluster2)
Defining a Compute Context on a High-Availability Cluster
On a Hadoop cluster configured for high-availability, you must specify the node providing the name service using the nameNode argument to RxSpark, and also specify the Hadoop port with the port argument:
myHadoopCluster <- RxSpark(nameNode = "my-name-service-server", port = 8020)
4 - Copy data to HDFS
Start with the built-in data set, AirlineDemoSmall.csv provided with RevoScaleR. You can verify that it is on your local system as follows:
file.exists(system.file("SampleData/AirlineDemoSmall.csv",
package="RevoScaleR"))
[1] TRUE
To use this file in our distributed computations, it must first be copied to Hadoop Distributed File System (HDFS). For our examples, we make extensive use of the HDFS shared directory, /share:
bigDataDirRoot <- "/share" # HDFS location of the example data
First, check to see what directories and files are already in your shared file directory. You can use the rxHadoopListFiles function, which will automatically check your active compute context for information:
rxHadoopListFiles(bigDataDirRoot)
If the AirlineDemoSmall subdirectory does not exist and you have write permission, you can use the following functions to copy the data there:
source <-system.file("SampleData/AirlineDemoSmall.csv",
package="RevoScaleR")
inputDir <- file.path(bigDataDirRoot,"AirlineDemoSmall")
rxHadoopMakeDir(inputDir)
rxHadoopCopyFromLocal(source, inputDir)
We can then verify that it exists as follows:
rxHadoopListFiles(inputDir)
5 - Create a data source
In a Spark compute context, you can create data sources using the following functions:
Function | Description |
---|---|
RxTextData |
A comma-delimited text data source. |
RxXdfData |
Data in the XDF data file format. In RevoScaleR, the XDF file format is modified for Hadoop to store data in a composite set of files rather than a single file. |
RxHiveData |
Generates a Hive Data Source object. |
RxParquetData |
Generates a Parquet Data Source object. |
RxOrcData |
Generates an Orc Data Source object. |
In this step, create an RxTextData object using the text file you just copied to HDFS. First, create a file system object that uses the default values:
hdfsFS <- RxHdfsFileSystem()
The input .csv file uses the letter M to represent missing values, rather than the default NA, so we specify this with the missingValueString argument. We will explicitly set the factor levels for DayOfWeek in the desired order using the colInfo argument:
colInfo <- list(DayOfWeek = list(type = "factor",
levels = c("Monday", "Tuesday", "Wednesday", "Thursday",
"Friday", "Saturday", "Sunday")))
airDS <- RxTextData(file = inputDir, missingValueString = "M",
colInfo = colInfo, fileSystem = hdfsFS)
6 - Summarize data
Use the rxSummary function to obtain descriptive statistics for your data. The rxSummary function takes a formula as its first argument, and the name of the data set as the second.
adsSummary <- rxSummary(~ArrDelay+CRSDepTime+DayOfWeek,
data = airDS)
adsSummary
Summary statistics are computed on the variables in the formula, removing missing values for all rows in the included variables:
Call:
rxSummary(formula = ~ArrDelay + CRSDepTime + DayOfWeek, data = airDS)
Summary Statistics Results for: ~ArrDelay + CRSDepTime + DayOfWeek
Data: airDS (RxTextData Data Source)
File name: /share/AirlineDemoSmall
Number of valid observations: 6e+05
Name Mean StdDev Min Max ValidObs MissingObs
ArrDelay 11.31794 40.688536 -86.000000 1490.00000 582628 17372
CRSDepTime 13.48227 4.697566 0.016667 23.98333 600000 0
Category Counts for DayOfWeek
Number of categories: 7
Number of valid observations: 6e+05
Number of missing observations: 0
DayOfWeek Counts
Monday 97975
Tuesday 77725
Wednesday 78875
Thursday 81304
Friday 82987
Saturday 86159
Sunday 94975
Notice that the summary information shows cell counts for categorical variables, and appropriately does not provide summary statistics such as Mean and StdDev. Also notice that the Call: line shows the actual call you entered or the call provided by summary, so appear differently in different circumstances.
You can also compute summary information by one or more categories by using interactions of a numeric variable with a factor variable. For example, to compute summary statistics on Arrival Delay by Day of Week:
rxSummary(~ArrDelay:DayOfWeek, data = airDS)
The output shows the summary statistics for ArrDelay for each day of the week:
Call:
rxSummary(formula = ~ArrDelay:DayOfWeek, data = airDS)
Summary Statistics Results for: ~ArrDelay:DayOfWeek
Data: airDS (RxTextData Data Source)
File name: /share/AirlineDemoSmall
Number of valid observations: 6e+05
Name Mean StdDev Min Max ValidObs MissingObs
ArrDelay:DayOfWeek 11.31794 40.68854 -86 1490 582628 17372
Statistics by category (7 categories):
Category DayOfWeek Means StdDev Min Max ValidObs
ArrDelay for DayOfWeek=Monday Monday 12.025604 40.02463 -76 1017 95298
ArrDelay for DayOfWeek=Tuesday Tuesday 11.293808 43.66269 -70 1143 74011
ArrDelay for DayOfWeek=Wednesday Wednesday 10.156539 39.58803 -81 1166 76786
ArrDelay for DayOfWeek=Thursday Thursday 8.658007 36.74724 -58 1053 79145
ArrDelay for DayOfWeek=Friday Friday 14.804335 41.79260 -78 1490 80142
ArrDelay for DayOfWeek=Saturday Saturday 11.875326 45.24540 -73 1370 83851
ArrDelay for DayOfWeek=Sunday Sunday 10.331806 37.33348 -86 1202 93395
7 - Fit a linear model
Use the rxLinMod function to fit a linear model using your airDS data source. Use a single dependent variable, the factor DayOfWeek:
arrDelayLm1 <- rxLinMod(ArrDelay ~ DayOfWeek, data = airDS)
summary(arrDelayLm1)
The resulting output is:
Call:
rxLinMod(formula = ArrDelay ~ DayOfWeek, data = airDS)
Linear Regression Results for: ArrDelay ~ DayOfWeek
Data: airDS (RxTextData Data Source)
File name: /share/AirlineDemoSmall
Dependent variable(s): ArrDelay
Total independent variables: 8 (Including number dropped: 1)
Number of valid observations: 582628
Number of missing observations: 17372
Coefficients: (1 not defined because of singularities)
Estimate Std. Error t value Pr(>|t|)
(Intercept) 10.3318 0.1330 77.673 2.22e-16 ***
DayOfWeek=Monday 1.6938 0.1872 9.049 2.22e-16 ***
DayOfWeek=Tuesday 0.9620 0.2001 4.809 1.52e-06 ***
DayOfWeek=Wednesday -0.1753 0.1980 -0.885 0.376
DayOfWeek=Thursday -1.6738 0.1964 -8.522 2.22e-16 ***
DayOfWeek=Friday 4.4725 0.1957 22.850 2.22e-16 ***
DayOfWeek=Saturday 1.5435 0.1934 7.981 2.22e-16 ***
DayOfWeek=Sunday Dropped Dropped Dropped Dropped
---
Signif. codes: 0 `***` 0.001 `**` 0.01 `*` 0.05 ‘.’ 0.1 ‘ ’ 1
Residual standard error: 40.65 on 582621 degrees of freedom
Multiple R-squared: 0.001869
Adjusted R-squared: 0.001858
F-statistic: 181.8 on 6 and 582621 DF, p-value: < 2.2e-16
More Spark scenarios
Use Machine Learning Server as a Hadoop Client
A common practice is to do exploratory analysis on your laptop, then deploy the same analytics code on a Hadoop cluster. The underlying RevoScaleR engine in Machine Learning Server handles the distribution of the computations across cores and nodes automatically.
If you are running Machine Learning Server from Linux or from a Windows computer equipped with PuTTY and/or both the Cygwin shell and Cygwin OpenSSH packages, you can create a compute context that runs RevoScaleR functions from your local client in a distributed fashion on your Hadoop cluster. You use RxSpark to create the compute context, but use additional arguments to specify your user name, the file-sharing directory where you have read and write access, the publicly facing host name, or IP address of your Hadoop cluster’s name node or an edge node that run the master processes, and any additional switches to pass to the ssh command (such as the -i flag if you are using a pem or ppk file for authentication, or -p to specify a non-standard ssh port number). For example:
mySshUsername <- "user1"
#public facing cluster IP address
mySshHostname <- "12.345.678.90"
mySshSwitches <- "-i /home/yourName/user1.pem" #See NOTE below
myShareDir <- paste("/var/RevoShare", mySshUsername, sep ="/")
myHdfsShareDir <- paste("/user/RevoShare",mySshUsername, sep="/")
myHadoopCluster <- RxSpark(
hdfsShareDir = myHdfsShareDir,
shareDir = myShareDir,
sshUsername = mySshUsername,
sshHostname = mySshHostname,
sshSwitches = mySshSwitches)
Note
if you are using a pem or ppk file for authentication the permissions of the file must be modified to ensure that only the owner has full read and write access (that is, chmod go-rwx user1.pem).
If you are using PuTTY, you may incorporate the publicly facing host name and any authentication requirements into a PuTTY saved session, and use the name of that saved session as the sshHostname. For example:
mySshUsername <- "user1"
#name of PuTTY saved session
mySshHostname <- "myCluster"
myShareDir <- paste("/var/RevoShare", mySshUsername, sep ="/")
myHdfsShareDir <- paste("/user/RevoShare",mySshUsername, sep="/")
myHadoopCluster <- RxSpark(
hdfsShareDir = myHdfsShareDir,
shareDir = myShareDir,
sshUsername = mySshUsername,
sshHostname = mySshHostname)
The above assumes that the directory containing the ssh and scp commands (Linux/Cygwin) or plink and pscp commands (PuTTY) is in your path (or that the Cygwin installer has stored its directory in the Windows registry). If not, you can specify the location of these files using the sshClientDir argument:
myHadoopCluster <- RxSpark(
hdfsShareDir = myHdfsShareDir,
shareDir = myShareDir,
sshUsername = mySshUsername,
sshHostname = mySshHostname,
sshClientDir = "C:\\Program Files (x86)\\PuTTY")
In some cases, you may find that environment variables needed by Hadoop are not set in the remote sessions run on the sshHostname computer. This may be because a different profile or startup script is being read on ssh login. You can specify the appropriate profile script by specifying the sshProfileScript argument to RxSpark; this should be an absolute path:
myHadoopCluster <- RxSpark(
hdfsShareDir = myHdfsShareDir,
shareDir = myShareDir,
sshUsername = mySshUsername,
sshHostname = mySshHostname,
sshProfileScript = "/home/user1/.bash_profile",
sshSwitches = mySshSwitches)
Now that you have defined your compute context, make it the active compute context using the rxSetComputeContext function:
rxSetComputeContext(myHadoopCluster)
Use RxSpark Compute Context in Persistent Mode
By default, with RxSpark Compute Context, a new Spark application is launched when a job starts and is terminated when the job completes. To avoid the overhead of Spark initialization on launching time, the persistentRun mode (experimental) is introduced. If you set persistentRun as “TRUE”, the Spark application (and associated processes) persists across jobs until idleTimeout or the rxStopEngine is called explicitly:
myHadoopCluster <- RxSpark(persistentRun = TRUE, idleTimeout = 600)
The idleTimeout is the number of seconds of being idle before the system kills the Spark process.
If persistentRun mode is enabled, then the RxSpark compute context cannot be a Non-Waiting Compute Context. See Non-Waiting Compute Context.
Use a local compute context
At times, it may be more efficient to perform smaller computations on the local node rather than using Spark. You can easily do this, accessing the same data from the HDFS file system. When working with the local compute context, you need to specify the name of a specific data file. The same basic code is then used to run the analysis; simply change the compute context to a local context. (Note that this will not work if you are accessing the Hadoop cluster via a client.)
rxSetComputeContext("local")
inputFile <-file.path(bigDataDirRoot,"AirlineDemoSmall/AirlineDemoSmall.csv")
airDSLocal <- RxTextData(file = inputFile,
missingValueString = "M",
colInfo = colInfo, fileSystem = hdfsFS)
adsSummary <- rxSummary(~ArrDelay+CRSDepTime+DayOfWeek,
data = airDSLocal)
adsSummary
The results are the same as doing the computations across the nodes with the RxSpark compute context:
Call:
rxSummary(formula = ~ArrDelay + CRSDepTime + DayOfWeek, data = airDS)
Summary Statistics Results for: ~ArrDelay + CRSDepTime + DayOfWeek
Data: airDS (RxTextData Data Source)
File name: /share/AirlineDemoSmall/AirlineDemoSmall.csv
Number of valid observations: 6e+05
Name Mean StdDev Min Max ValidObs MissingObs
ArrDelay 11.31794 40.688536 -86.000000 1490.00000 582628 17372
CRSDepTime 13.48227 4.697566 0.016667 23.98333 600000 0
Category Counts for DayOfWeek
Number of categories: 7
Number of valid observations: 6e+05
Number of missing observations: 0
DayOfWeek Counts
Monday 97975
Tuesday 77725
Wednesday 78875
Thursday 81304
Friday 82987
Saturday 86159
Sunday 94975
Now set the compute context back to the Hadoop cluster for further analyses:
rxSetComputeContext(myHadoopCluster)
Create a Non-Waiting Compute Context
When running all but the shortest analyses in Hadoop, it can be convenient to let Hadoop do its processing while returning control of your R session to you immediately. You can do this by specifying wait = FALSE in your compute context definition. By using our existing compute context as the first argument, all of the other settings will be carried over to the new compute context:
myHadoopNoWaitCluster <- RxSpark(myHadoopCluster, wait = FALSE)
rxSetComputeContext(myHadoopNoWaitCluster)
Once you have set your compute context to non-waiting, distributed RevoScaleR functions return relatively quickly with a jobInfo object, which you can use to track the progress of your job, and, when the job is complete, obtain the results of the job. For example, we can rerun our linear model in the non-waiting case as follows:
job1 <- rxLinMod(ArrDelay ~ DayOfWeek, data = airDS)
rxGetJobStatus(job1)
Right after submission, the job status will typically return "running". When the job status returns "finished", you can obtain the results using rxGetJobResults as follows:
arrDelayLm1 <- rxGetJobResults(job1)
summary(arrDelayLm1)
You should always assign the jobInfo object so that you can easily track your work, but if you forget, the most recent jobInfo object is saved in the global environment as the object rxgLastPendingJob. (By default, after you’ve retrieved your job results, the results are removed from the cluster. To have your job results remain, set the autoCleanup argument to FALSE in RxSpark.)
If after submitting a job in a non-waiting compute context, you decide you don’t want to complete the job, you can cancel the job using the rxCancelJob function:
job2 <- rxLinMod(ArrDelay ~ DayOfWeek, data = airDS)
rxCancelJob(job2)
The rxCancelJob function returns TRUE if the job is successfully canceled, FALSE otherwise.
For the remainder of this tutorial, we return to a waiting compute context:
rxSetComputeContext(myHadoopCluster)
Analyze a Large Data Set
Set up the sample airline dataset
We will now move to examples using a much larger version of the airline data set. The airline on-time data has been gathered by the Department of Transportation since 1987. The data through 2008 was used in the American Statistical Association Data Expo in 2009 (http://stat-computing.org/dataexpo/2009/). ASA describes the data set as follows:
The data consists of flight arrival and departure details for all commercial flights within the USA, from October 1987 to April 2008. This is a large dataset: there are nearly 120 million records in total, and takes up 1.6 gigabytes of space compressed and 12 gigabytes when uncompressed.
The airline on-time data set for 2012, consisting of 12 separate CSV files, is available online. We assume you have uploaded them to the /tmp directory on your edge node or a name node (although any directory visible as a native file system directory from the name node will work.)
As before, our first step is to copy the data into HDFS. We specify the location of your Hadoop-stored data for the airDataDir variable:
airDataDir <- file.path(bigDataDirRoot,"/airOnTime12/CSV")
rxHadoopMakeDir(airDataDir)
rxHadoopCopyFromLocal("/tmp/airOT2012*.csv", airDataDir)
The original CSV files have rather unwieldy variable names, so we supply a colInfo list to make them more manageable (we won’t use all of these variables in this manual, but you use the data sources created in this manual as you continue to explore distributed computing in the RevoScaleR Distributed Computing Guide:
airlineColInfo <- list(
MONTH = list(newName = "Month", type = "integer"),
DAY_OF_WEEK = list(newName = "DayOfWeek", type = "factor",
levels = as.character(1:7),
newLevels = c("Mon", "Tues", "Wed", "Thur", "Fri", "Sat",
"Sun")),
UNIQUE_CARRIER = list(newName = "UniqueCarrier", type =
"factor"),
ORIGIN = list(newName = "Origin", type = "factor"),
DEST = list(newName = "Dest", type = "factor"),
CRS_DEP_TIME = list(newName = "CRSDepTime", type = "integer"),
DEP_TIME = list(newName = "DepTime", type = "integer"),
DEP_DELAY = list(newName = "DepDelay", type = "integer"),
DEP_DELAY_NEW = list(newName = "DepDelayMinutes", type =
"integer"),
DEP_DEL15 = list(newName = "DepDel15", type = "logical"),
DEP_DELAY_GROUP = list(newName = "DepDelayGroups", type =
"factor",
levels = as.character(-2:12),
newLevels = c("< -15", "-15 to -1","0 to 14", "15 to 29",
"30 to 44", "45 to 59", "60 to 74",
"75 to 89", "90 to 104", "105 to 119",
"120 to 134", "135 to 149", "150 to 164",
"165 to 179", ">= 180")),
ARR_DELAY = list(newName = "ArrDelay", type = "integer"),
ARR_DELAY_NEW = list(newName = "ArrDelayMinutes", type =
"integer"),
ARR_DEL15 = list(newName = "ArrDel15", type = "logical"),
AIR_TIME = list(newName = "AirTime", type = "integer"),
DISTANCE = list(newName = "Distance", type = "integer"),
DISTANCE_GROUP = list(newName = "DistanceGroup", type =
"factor",
levels = as.character(1:11),
newLevels = c("< 250", "250-499", "500-749", "750-999",
"1000-1249", "1250-1499", "1500-1749", "1750-1999",
"2000-2249", "2250-2499", ">= 2500")))
varNames <- names(airlineColInfo)
We create a data source using these definitions as follows:
hdfsFS <- RxHdfsFileSystem()
bigAirDS <- RxTextData( airDataDir,
colInfo = airlineColInfo,
varsToKeep = varNames,
fileSystem = hdfsFS )
Estimate a linear model with a big data set
We fit our first model to the large airline data much as we created the linear model for the AirlineDemoSmall data, and we time it to see how long it takes to fit the model on this large data set:
system.time(
delayArr <- rxLinMod(ArrDelay ~ DayOfWeek, data = bigAirDS,
cube = TRUE)
)
To see a summary of your results:
print(
summary(delayArr)
)
You should see the following results:
Call:
rxLinMod(formula = ArrDelay ~ DayOfWeek, data = bigAirDS, cube = TRUE)
Cube Linear Regression Results for: ArrDelay ~ DayOfWeek
Data: bigAirDS (RxTextData Data Source)
File name: /share/airOnTime12/CSV
Dependent variable(s): ArrDelay
Total independent variables: 7
Number of valid observations: 6005381
Number of missing observations: 91381
Coefficients:
Estimate Std. Error t value Pr(>|t|) | Counts
DayOfWeek=Mon 3.54210 0.03736 94.80 2.22e-16 *** | 901592
DayOfWeek=Tues 1.80696 0.03835 47.12 2.22e-16 *** | 855805
DayOfWeek=Wed 2.19424 0.03807 57.64 2.22e-16 *** | 868505
DayOfWeek=Thur 4.65502 0.03757 123.90 2.22e-16 *** | 891674
DayOfWeek=Fri 5.64402 0.03747 150.62 2.22e-16 *** | 896495
DayOfWeek=Sat 0.91008 0.04144 21.96 2.22e-16 *** | 732944
DayOfWeek=Sun 2.82780 0.03829 73.84 2.22e-16 *** | 858366
---
Signif. codes: 0 `***` 0.001 `**` 0.01 `*` 0.05 ‘.’ 0.1 ‘ ’ 1
Residual standard error: 35.48 on 6005374 degrees of freedom
Multiple R-squared: 0.001827 (as if intercept included)
Adjusted R-squared: 0.001826
F-statistic: 1832 on 6 and 6005374 DF, p-value: < 2.2e-16
Condition number: 1
Notice that the results indicate we have processed all the data, six million observations, using all the .csv files in the specified directory. Notice also that because we specified cube = TRUE, we have an estimated coefficient for each day of the week (and not the intercept).
Import data as composite XDF files
As we have seen, you can analyze CSV files directly with RevoScaleR on Hadoop, but the analysis can be done more quickly if the data is stored in a more efficient format. The RevoScaleR .xdf format is extremely efficient, but is modified somewhat for HDFS so that individual files remain within a single HDFS block. (The HDFS block size varies from installation to installation but is typically either 64 MB or 128 MB.) When you use rxImport on Hadoop, you specify an RxTextData data source such as bigAirDS as the inData and an RxXdfData data source with fileSystem set to an HDFS file system as the outFile argument to create a set of composite .xdf files. The RxXdfData object can then be used as the data argument in subsequent RevoScaleR analyses.
For our airline data, we define our RxXdfData object as follows (the second “user” should be replaced by your actual user name on the Hadoop cluster):
bigAirXdfName <- "/user/RevoShare/user/AirlineOnTime2012"
airData <- RxXdfData( bigAirXdfName,
fileSystem = hdfsFS )
We set a block size of 250000 rows and specify that we read all the data by specifying
numRowsToRead = -1:
blockSize <- 250000
numRowsToRead = -1
We then import the data using rxImport:
rxImport(inData = bigAirDS,
outFile = airData,
rowsPerRead = blockSize,
overwrite = TRUE,
numRows = numRowsToRead )
Write to CSV in HDFS
If you converted your CSV to XDF to take advantage of the efficiency while running analyses, but now wish to convert your data back to CSV you can do so using rxDataStep.
To create a folder of CSV files, first create an RxTextData object using a directory name as the file argument; this represents the folder in which to create the CSV files. This directory is created when you run the rxDataStep. Then, point to this RxTextData object in the outFile argument of the rxDataStep. Each CSV created will be named based on the directory name and followed by a number.
Suppose we want to write out a folder of CSV in HDFS from our airData composite XDF after we performed the logistic regression and prediction, so that the new CSV files contain the predicted values and residuals. We can do this as follows:
airDataCsvDir <- file.path("user/RevoShare/user/airDataCsv")
airDataCsvDS <- RxTextData(airDataCsvDir,fileSystem=hdfsFS)
rxDataStep(inData=airData, outFile=airDataCsvDS)
You notice that the rxDataStep wrote out one CSV for every .xdfd file in the input composite XDF file. This is the default behavior for writing CSV from composite XDF to HDFS when the compute context is set to RxSpark.
Alternatively, you could switch your compute context back to “local” when you are done performing your analyses and take advantage of two arguments within RxTextData that give you slightly more control when writing out CSV files to HDFS: createFileSet and rowsPerOutFile. When createFileSet is set to TRUE a folder of CSV files is written to the directory you specify. When createFileSet is set to FALSE a single CSV file is written. The second argument, rowsPerOutFile, can be set to an integer to indicate how many rows to write to each CSV file when createFileSet is TRUE. Returning to the example above, suppose we wanted to write out a folder of CSVs but we wanted to write out the airData but into only six CSV files.
rxSetComputeContext("local")
airDataCsvRowsDir <-file.path("/user/RevoShare/MRS/airDataCsvRows")
rxHadoopMakeDir(airDataCsvRowsDir)
airDataCsvRowsDS <- RxTextData(airDataCsvRowsDir, fileSystem=hdfsFS, createFileSet=TRUE, rowsPerOutFile=1000000)
rxDataStep(inData=airData, outFile=airDataCsvRowsDS)
When using an RxSpark compute context, createFileSet defaults to TRUE and rowsPerOutFile has no effect. Thus if you wish to create a single CSV or customize the number of rows per file you must perform the rxDataStep in a local compute context (data can still be in HDFS).
Estimate a linear model using composite XDF files
Now we can re-estimate the same linear model, using the new, faster data source:
system.time(
delayArr <- rxLinMod(ArrDelay ~ DayOfWeek, data = airData,
cube = TRUE)
)
To see a summary of your results:
print(
summary(delayArr)
)
You should see the following results (which should match the results we found for the CSV files above):
Call:
rxLinMod(formula = ArrDelay ~ DayOfWeek, data = airData, cube = TRUE)
Cube Linear Regression Results for: ArrDelay ~ DayOfWeek
Data: airData (RxXdfData Data Source)
File name: /user/RevoShare/v7alpha/AirlineOnTime2012
Dependent variable(s): ArrDelay
Total independent variables: 7
Number of valid observations: 6005381
Number of missing observations: 91381
Coefficients:
Estimate Std. Error t value Pr(>|t|) | Counts
DayOfWeek=Mon 3.54210 0.03736 94.80 2.22e-16 *** | 901592
DayOfWeek=Tues 1.80696 0.03835 47.12 2.22e-16 *** | 855805
DayOfWeek=Wed 2.19424 0.03807 57.64 2.22e-16 *** | 868505
DayOfWeek=Thur 4.65502 0.03757 123.90 2.22e-16 *** | 891674
DayOfWeek=Fri 5.64402 0.03747 150.62 2.22e-16 *** | 896495
DayOfWeek=Sat 0.91008 0.04144 21.96 2.22e-16 *** | 732944
DayOfWeek=Sun 2.82780 0.03829 73.84 2.22e-16 *** | 858366
---
Signif. codes: 0 `***` 0.001 `**` 0.01 `*` 0.05 ‘.’ 0.1 ‘ ’ 1
Residual standard error: 35.48 on 6005374 degrees of freedom
Multiple R-squared: 0.001827 (as if intercept included)
Adjusted R-squared: 0.001826
F-statistic: 1832 on 6 and 6005374 DF, p-value: < 2.2e-16
Condition number: 1
Handling larger linear models
The data set contains a variable UniqueCarrier, which contains airline codes for 15 carriers. Because the RevoScaleR Compute Engine handles factor variables so efficiently, we can do a linear regression looking at the Arrival Delay by Carrier. This takes a little longer than the previous analysis, because we are estimating 15 instead of 7 parameters.
delayCarrier <- rxLinMod(ArrDelay ~ UniqueCarrier,
data = airData, cube = TRUE)
Next, sort the coefficient vector so that we can see the airlines with the highest and lowest values.
dcCoef <- sort(coef(delayCarrier))
Next, see how the airlines rank in arrival delays:
print(
dcCoef
)
The result is:
UniqueCarrier=AS UniqueCarrier=US UniqueCarrier=DL UniqueCarrier=FL
-1.9022483 -1.2418484 -0.8013952 -0.2618747
UniqueCarrier=HA UniqueCarrier=YV UniqueCarrier=WN UniqueCarrier=MQ
0.3143564 1.3189086 2.8824285 2.8843612
UniqueCarrier=OO UniqueCarrier=VX UniqueCarrier=B6 UniqueCarrier=UA
3.9846305 4.0323447 4.8508215 5.0905867
UniqueCarrier=AA UniqueCarrier=EV UniqueCarrier=F9
6.3980937 7.2934991 7.8788931
Notice that Alaska Airlines comes in as the best in on-time arrivals, while Frontier is at the bottom of the list. We can see the difference by subtracting the coefficients:
print(
sprintf("Frontier's additional delay compared with Alaska: %f",
dcCoef["UniqueCarrier=F9"]-dcCoef["UniqueCarrier=AS"])
)
which results in:
[1] "Frontier's additional delay compared with Alaska: 9.781141"
A Large Data Logistic Regression
Our airline data includes a logical variable, ArrDel15, that tells whether a flight is 15 or more minutes late. We can use this as the response for a logistic regression to model the likelihood that a flight will be late given the day of the week:
logitObj <- rxLogit(ArrDel15 ~ DayOfWeek, data = airData)
logitObj
which results in:
Logistic Regression Results for: ArrDel15 ~ DayOfWeek
Data: airData (RxXdfData Data Source)
File name: AirOnTime2012.xdf
Dependent variable(s): ArrDel15
Total independent variables: 8 (Including number dropped: 1)
Number of valid observations: 6005381
Number of missing observations: 91381
Coefficients:
ArrDel15
(Intercept) -1.61122563
DayOfWeek=Mon 0.04600367
DayOfWeek=Tues -0.08431814
DayOfWeek=Wed -0.07319133
DayOfWeek=Thur 0.12585721
DayOfWeek=Fri 0.18987931
DayOfWeek=Sat -0.13498591
DayOfWeek=Sun Dropped
Prediction on Large Data
You can predict (or score) from a fitted model on Hadoop using rxPredict. In this example, we compute predicted values and their residuals for the logistic regression in the previous section. These new prediction variables are output to a new composite XDF in HDFS. (As in an earlier example, the second “user” in the airDataPredDir path should be changed to the actual user name of your Hadoop user.)
airDataPredDir <- "/user/RevoShare/user/airDataPred"
airDataPred <- RxXdfData(airDataPredDir, fileSystem=hdfsFS)
rxPredict(modelObject=logitObj, data=airData, outData=airDataPred,
computeResiduals=TRUE)
By putting in a call to rxGetVarInfo we see that two variables, ArrDel15_Pred and ArrDel15_Resid were written to the airDataPred composite XDF. If in addition to the prediction variables, we wanted to have the variables used in the model written to our outData we would need to add the writeModelVars=TRUE to our rxPredict call.
Alternatively, we can update the airData to include the predicted values and residuals by not specifying an outData argument, which is NULL by default. Since the airData composite XDF already exists, we would need to add overwrite=TRUE to the rxPredict call.
An RxTextData object can be used as the input data for rxPredict on Hadoop, but only XDF can be written to HDFS. When using a CSV file or directory of CSV files as the input data to rxPredict the outData argument must be set to an RxXdfData object.
Performing Data Operations on Large Data
To create or modify data in HDFS on Hadoop, we can use the rxDataStep function. Suppose we want to repeat the analyses with a “cleaner” version of the large airline dataset. To do this we keep only the flights having arrival delay information and flights that did not depart more than one hour early. We can put a call to rxDataStep to output a new composite XDF to HDFS. (As in an earlier example, the second “user” in the newAirDir path should be changed to the actual user name of your Hadoop user.)
newAirDir <- "/user/RevoShare/user/newAirData"
newAirXdf <- RxXdfData(newAirDir,fileSystem=hdfsFS)
rxDataStep(inData = airData, outFile = newAirXdf,
rowSelection = !is.na(ArrDelay) & (DepDelay > -60))
To modify an existing composite XDF using rxDataStep set the overwrite argument to TRUE and either omit the outFile argument or set it to the same data source specified for inData.
Parallel Partial Decision Forests
As mentioned earlier, both rxDForest and rxBTrees are available on Hadoop--these provide two different methods for fitting classification and regression decision forests. In the default case, both algorithms generate multiple stages in the Spark job, and thus can tend to incur significant overhead, particularly with smaller data sets. However, the scheduleOnce argument to both functions allows the computation to be performed via rxExec, which generates only a single stage in the Spark job, and thus incurs significantly less overhead. When using the scheduleOnce argument, you can specify the number of trees to be grown within each rxExec task using the forest function’s nTree argument together with rxExec’s rxElemArgs function, as in the following regression example using the built-in claims data:
file.name <- "claims.xdf"
sourceFile <- system.file(file.path("SampleData", file.name),
package="RevoScaleR")
inputDir <- "/share/claimsXdf"
rxHadoopMakeDir(inputDir)
rxHadoopCopyFromLocal(sourceFile, inputDir)
input <- RxXdfData(file.path(inputDir, file.name,
fsep="/"),fileSystem=hdfsFS)
partial.forests <-rxDForest(formula = age ~ car.age +
type + cost + number, data = input,
minSplit = 5, maxDepth = 2,
nTree = rxElemArg(rep(2,8)), seed = 0,
maxNumBins = 200, computeOobError = -1,
reportProgress = 2, verbose = 0,
scheduleOnce = TRUE))
Equivalently, you can use nTree together with rxExec’s timesToRun argument:
partial.forests <-rxDForest(formula = age ~ car.age +
type + cost + number, data = input,
minSplit = 5, maxDepth = 2,
nTree = 2, seed = 0,
maxNumBins = 200, computeOobError = -1,
reportProgress = 2, verbose = 0,
scheduleOnce = TRUE, timesToRun = 8)
In this example, using scheduleOnce can be up to 45 times faster than the default, and so we recommend it for decision forests with small to moderate-sized data sets.
Similarly, the rxPredict methods for rxDForest and rxBTrees objects include the scheduleOnce argument, and should be used when using these methods on small to moderate-sized data sets.
Using Hive data
There are multiple ways to access and use data from Hive for analyses with RevoScaleR. Here are some general recommendations, assuming in each case that the data for analysis is defined by the results of a Hive query.
- If running from a remote client or edge node, and the data is modest, then use RxOdbcData to stream results, or land them as XDF in the local file system, for subsequent analysis in a local compute context.
- If the data is large, then use the Hive command-line interface (hive or beeline) from an edge node to run the Hive query with results spooled to a text file on HDFS for subsequent analysis in a distributed fashion using the HadoopMR or Spark compute contexts.
Here’s how to get started with each of these approaches.
Accessing Hive data via ODBC
Start by following your Hadoop vendor’s recommendations for accessing Hive via ODBC from a remote client or edge node. Once you have the prerequisite software installed and have run a smoke test to verify connectivity, then accessing data in Hive from Machine Learning Server is just like accessing data from any other data source.
mySQL = "SELECT * FROM CustData"
myDS <- RxOdbcData(sqlQuery = mySQL, connectionString = "DSN=HiveODBC")
xdfFile <- RxXdfData("dataFromHive.xdf")
rxImport(myDS, xdfFile, stringsAsFactors = TRUE, overwrite=TRUE)
Accessing Hive data via an Export to Text Files
This approach uses the Hive command-line interface to first spool the results to text files in the local or HDFS file systems and then analyze them in a distributed fashion using local or HadoopMR/Spark compute contexts.
First, check with your Hadoop system administrator find out whether the ‘hive’ or ‘beeline’ command should be used to run a Hive query from the command line, and obtain the needed credentials for access.
Next, try out your hive/beeline command from the Linux command line to make sure it works and that you have the proper authentication and syntax.
Finally, run the command from within R using R’s system command.
Here are some sample R statements that output a Hive query to the local and HDFS file systems:
Run a Hive query and dump results to a local text file (edge node)
system('hive –e "select * from emp" > myFile.txt')
Run the same query using beeline’s csv2 output format
system('beeline -u "jdbc:hive2://.." --outputformat=csv2 -e "select * from emp" > myFile.txt')
Run the same query but dump the results to CSV in HDFS
system('beeline -u "jdbc:hive2://.." –e "insert overwrite directory \'/your-hadoop-dir\' row format delimited fields terminated by \',\' select * from emp"')
After you’ve exported the query results to a text file, it can be streamed directly as input to a RevoScaleR analysis routine via use of the RxTextdata data source, or imported to XDF for improved performance upon repeated access. Here’s an example assuming output was spooled as text to HDFS:
hiveOut <-file.path(bigDataDirRoot,"myHiveData")
myHiveData <- RxTextData(file = hiveOut, fileSystem = hdfsFS)
rxSummary(~var1 + var2+ var3+ var4, data=myHiveData)
Clean up Data
You can run the following commands to clean up data in this tutorial:
rxHadoopRemoveDir("/share/AirlineDemoSmall")
rxHadoopRemoveDir("/share/airOnTime12")
rxHadoopRemoveDir("/share/claimsXdf")
rxHadoopRemoveDir("/user/RevoShare/<user>/AirlineOnTime2012")
rxHadoopRemoveDir("/user/RevoShare/<user>/airDataCsv")
rxHadoopRemoveDir("/user/RevoShare/<user>/airDataCsvRows")
rxHadoopRemoveDir("/user/RevoShare/<user>/newAirData")
rxHadoopRemoveDir("/user/RevoShare/<user>/airDataPred")