How to merge data using rxMerge in RevoScaleR
Important
This content is being retired and may not be updated in the future. The support for Machine Learning Server will end on July 1, 2022. For more information, see What's happening to Machine Learning Server?
Merging allows you to combine the information from two data sets into a third data set that can be used for subsequent analysis. One example is merging account information such as account number and billing address with transaction data such as account number and purchase details to create invoices. In this case, the two files are merged on the common information, that is, the account number.
In RevoScaleR, you merge .xdf files and/or data frames with the rxMerge function. This function supports a number of types of merge that are best illustrated by example. The available types are as follows:
- Inner
- Outer: left, right, and full
- One-to-One
- Union
We describe each of these types in the following sections.
Inner merge
In the default inner merge type, one or more merge key columns is specified, and only those observations for which the specified key columns match exactly are combined to create new observations in the merged data set.
Suppose we have the following data from a dentist’s office:
AccountNo Billee Patient
0538 Rich C 1
0538 Rich C 2
0538 Rich C 3
0763 Tom D 1
1534 Kath P 1
We can create a data frame with this information:
# Merging Data
acct <- c(0538, 0538, 0538, 0763, 1534)
billee <- c("Rich C", "Rich C", "Rich C", "Tom D", "Kath P")
patient <- c(1, 2, 3, 1, 1)
acctDF<- data.frame( acct=acct, billee= billee, patient=patient)
Suppose further we have the following information about procedures performed:
AccountNo Patient Procedure
0538 3 OffVisit
0538 2 AdultPro
0538 2 OffVisit
0538 3 2SurfCom
0763 1 OffVisit
0763 1 AdultPro
0763 2 OffVisit
This data is put into another data frame:
acct <- c(0538, 0538, 0538, 0538, 0763, 0763, 0763)
patient <- c(3, 2, 2, 3, 1, 1, 2)
type <- c("OffVisit", "AdultPro", "OffVisit", "2SurfCom", "OffVisit", "AdultPro", "OffVisit")
procedureDF <- data.frame(acct=acct, patient=patient, type=type)
Then we use rxMerge to create an inner merge matching on the columns acct and patient:
rxMerge(inData1 = acctDF, inData2 = procedureDF, type = "inner",
matchVars=c("acct", "patient"))
acct billee patient type
1 538 Rich C 2 AdultPro
2 538 Rich C 2 OffVisit
3 538 Rich C 3 OffVisit
4 538 Rich C 3 2SurfCom
5 763 Tom D 1 OffVisit
6 763 Tom D 1 AdultPro
Because the patient 1 in account 538 and patient 1 in account 1534 had no visits, they are omitted from the merged file. Similarly, patient 2 in account 763 had a visit, but does not have any information in the accounts file, so it to is omitted from the merged data set. Also, note that the two input data files are automatically sorted on the merge keys before merging.
Outer merge
There are three types of outer merge: left, right, and full. In a left outer merge, all the lines from the first file are present in the merged file, either matched with lines from the second file that match on the key columns, or if no match, filled out with missing values. A right outer merge is similar, except all the lines from the second file are present, either matched with matching lines from the first file or filled out with missings. A full outer merge includes all lines in both files, either matched or filled out with missings. We can use the same dentist data to illustrate the various types of outer merge:
rxMerge(inData1 = acctDF, inData2 = procedureDF, type = "left",
matchVars=c("acct", "patient"))
acct billee patient type
1 538 Rich C 1 <NA>
2 538 Rich C 2 AdultPro
3 538 Rich C 2 OffVisit
4 538 Rich C 3 OffVisit
5 538 Rich C 3 2SurfCom
6 763 Tom D 1 OffVisit
7 763 Tom D 1 AdultPro
8 1534 Kath P 1 <NA>
rxMerge(inData1 = acctDF, inData2 = procedureDF, type = "right",
matchVars=c("acct", "patient"))
acct billee patient type
1 538 Rich C 2 AdultPro
2 538 Rich C 2 OffVisit
3 538 Rich C 3 OffVisit
4 538 Rich C 3 2SurfCom
5 763 Tom D 1 OffVisit
6 763 Tom D 1 AdultPro
7 763 <NA> 2 OffVisit
rxMerge(inData1 = acctDF, inData2 = procedureDF, type = "full",
matchVars=c("acct", "patient"))
acct billee patient type
1 538 Rich C 1 <NA>
2 538 Rich C 2 AdultPro
3 538 Rich C 2 OffVisit
4 538 Rich C 3 OffVisit
5 538 Rich C 3 2SurfCom
6 763 Tom D 1 OffVisit
7 763 Tom D 1 AdultPro
8 763 <NA> 2 OffVisit
9 1534 Kath P 1 <NA>
## One-to-one merge
In the one-to-one merge type, the first observation in the first data set is paired with the first observation in the second data set to create the first observation in the merged data set, the second observation is paired with the second observation to create the second observation in the merged data set, and so on. The data sets must have the same number of rows. It is equivalent to using *append=*"*cols*" in a data step.
For example, suppose our first data set contains three observations as follows:
1 a x 2 b y 3 c z
Create a data frame with this data:
myData1 <- data.frame( x1 = 1:3, y1 = c("a", "b", "c"), z1 = c("x", "y", "z"))
Suppose our second data set contains three different variables:
101 d u
102 e v
103 f w
Create a data frame with this data:
myData2 <- data.frame( x2 = 101:103, y2 = c("d", "e", "f"),
z2 = c("u", "v", "w"))
A one-to-one merge of these two data sets combines the columns from the two data sets into one data set:
rxMerge(inData1 = myData1, inData2 = myData2, type = "oneToOne")
x1 y1 z1 x2 y2 z2
1 1 a x 101 d u
2 2 b y 102 e v
3 3 c z 103 f w
Union merge
A union merge is simply the concatenation of two files with the same set of variables. It is equivalent to using append="rows" in a data step.
Using the example from one-to-one merge, we rename the variables in the second data frame to be the same as in the first:
names(myData2) \<- c("x1", "x2", "x3")
Then use a union merge:
rxMerge(inData1 = myData1, inData2 = myData2, type = "union")
x1 y1 z1
1 1 a x
2 2 b y
3 3 c z
4 101 d u
5 102 e v
6 103 f w
Using rxMerge with .xdf files
You can use rxMerge with a combination of .xdf files or data frames. For example, you specify the two the paths for two input .xdf files as the inData1 and inData2 arguments, and the path to an output file as the outFile argument. As a simple example, we can stack two copies of the claims data using the union merge type as follows:
claimsXdf <- file.path(rxGetOption("sampleDataDir"), "claims.xdf")
rxMerge(inData1 = claimsXdf, inData2 = claimsXdf, outFile = "claimsTwice.xdf",
type = "union")
A new .xdf file is created containing twice the number of rows of the original claims file.
You can also merge an .xdf file and data frame into a new .xdf file. For example, suppose that you would like to add a variable on state expenditure on education into each observation in the censusWorkers sample .xdf file. First, take a quick look at the state variable in the .xdf file:
censusWorkers <- file.path(rxGetOption("sampleDataDir"), "CensusWorkers.xdf")
rxGetVarInfo(censusWorkers, varsToKeep = "state")
Var 1: state
3 factor levels: Connecticut Indiana Washington
We can create a data frame with per capita educational expenditures for the same three states. (Note that because R alphabetizes factor levels by default, the factor levels in the data frame will be in the same order as those in the .xdf file).
educExp <- data.frame(state=c("Connecticut", "Washington", "Indiana"),
EducExp = c(1795.57,1170.46,1289.66 ))
Now use rxMerge, matching by the variable state:
rxMerge(inData1 = censusWorkers, inData2 = educExp,
outFile="censusWorkersEd.xdf", matchVars = "state", overwrite=TRUE)
The new .xdf file has an additional variable, EducExp:
rxGetVarInfo("censusWorkersEd.xdf")
Var 1: age, Age
Type: integer, Low/High: (20, 65)
Var 2: incwage, Wage and salary income
Type: integer, Low/High: (0, 354000)
Var 3: perwt, Type: integer, Low/High: (2, 168)
Var 4: sex, Sex
2 factor levels: Male Female
Var 5: wkswork1, Weeks worked last year
Type: integer, Low/High: (21, 52)
Var 6: state
3 factor levels: Connecticut Indiana Washington
Var 7: EducExp, Type: numeric, Low/High: (1170.4600, 1795.5700)
See Also
Machine Learning Server
Install Machine Learning Server on Windows
Install Machine Learning Server on Linux
Install Machine Learning Server on Hadoop