How to sort data using rxSort in RevoScaleR
Important
This content is being retired and may not be updated in the future. The support for Machine Learning Server will end on July 1, 2022. For more information, see What's happening to Machine Learning Server?
Many analysis and plotting algorithms require as a first step that the data be sorted. Sorting a massive data set is both memory-intensive and time-consuming, but the rxSort function provides an efficient solution. The rxSort function allows you to sort by one or many keys. A stable sorting routine is used, so that, in the case of ties, remaining columns are left in the same order as they were in the original data set.
As a simple example, we can sort the census worker data by age and incwage. We will sort first by age, using the default increasing sort, and then by incwage, which we will sort in decreasing order:
# Sorting Data
censusWorkers <- file.path(rxGetOption("sampleDataDir"), "CensusWorkers.xdf")
outXDF <- "censusWorkersSorted.xdf"
rxSort(inData = censusWorkers, outFile = outXDF,
sortByVars=c("age", "incwage"), decreasing=c(FALSE, TRUE))
The first few lines of the sorted file can be viewed as follows:
rxGetInfo(outXDF, numRows=10)
File name: C:\YourOutputPath\censusWorkersSorted.xdf
Number of observations: 351121
Number of variables: 6
Number of blocks: 6
Compression type: zlib
Data (10 rows starting with row 1):
age incwage perwt sex wkswork1 state
1 20 336000 3 Male 40 Washington
2 20 336000 23 Male 46 Washington
3 20 336000 11 Male 52 Washington
4 20 314000 33 Male 52 Indiana
5 20 168000 13 Male 24 Indiana
6 20 163000 16 Male 26 Washington
7 20 144000 27 Female 24 Indiana
8 20 96000 21 Male 48 Washington
9 20 93000 24 Male 24 Indiana
10 20 90000 6 Male 52 Washington
If the sort keys contain missing values, you can use the missingsLow flag to specify whether they are sorted as low values (missingsLow=TRUE, the default) or high values (missingsLow=FALSE).
Remove duplicates during sort
In many situations, you are sorting a large data set by a particular key, for example, userID, but are looking for a sorted list of unique userIDs. The removeDupKeys argument to rxSort allows you to remove the duplicate entries from a sorted list. This argument is supported only for type="auto" and type="mergeSort"; it is ignored for type="varByVar".
When you use removeDupKeys=TRUE, the first record containing a unique combination of the sortByVars is retained; subsequent matching records are omitted from the sorted results, but, if desired, a count of the matching records is maintained in a new dupFreqVar output column. For example, the following artificial data set simulates a small amount of transaction data, with a user name, a state, and a transaction amount. When we sort by the variables users and state and specify removeDupKeys=TRUE, the transAmt shown for duplicate entries is the transaction amount for the first transaction encountered:
set.seed(17)
users <- sample(c("Aiden", "Ella", "Jayden", "Ava", "Max", "Grace", "Riley",
"Lolita", "Liam", "Emma", "Ethan", "Elizabeth", "Jack",
"Genevieve", "Avery", "Aurora", "Dylan", "Isabella",
"Caleb", "Bella"), 100, replace=TRUE)
state <- sample(c("Washington", "California", "Texas", "North Carolina",
"New York", "Massachusetts"), 100, replace=TRUE)
transAmt <- round(runif(100)*100, digits=3)
df <- data.frame(users=users, state=state, transAmt=transAmt)
rxSort(df, sortByVars=c("users", "state"), removeDupKeys=TRUE,
dupFreqVar = "DUP_COUNT")
Number of rows written to file: 66, Variable(s): users, state, transAmt, DUP_COUNT, Total number of rows in file: 66
Time to sort data file: 0.100 seconds
users state transAmt DUP_COUNT
1 Aiden New York 11.010 1
2 Aiden North Carolina 73.307 1
3 Aiden Texas 8.037 2
4 Aurora California 8.787 1
5 Aurora Massachusetts 55.187 1
6 Aurora New York 91.648 1
7 Aurora Texas 30.566 1
8 Aurora Washington 27.374 1
9 Ava California 70.638 2
10 Ava Massachusetts 82.916 2
11 Ava New York 45.683 2
12 Ava North Carolina 51.748 1
13 Ava Texas 52.674 1
14 Avery California 9.756 4
15 Avery Massachusetts 63.715 1
16 Avery New York 93.430 1
17 Avery North Carolina 1.889 2
18 Bella California 60.258 2
19 Bella Texas 32.684 1
20 Bella Washington 60.230 2
21 Caleb Massachusetts 94.527 1
22 Caleb North Carolina 89.259 2
23 Dylan California 73.665 1
24 Dylan North Carolina 98.384 1
25 Dylan Texas 27.067 1
26 Dylan Washington 82.141 3
27 Elizabeth California 95.497 2
28 Elizabeth New York 35.546 3
29 Elizabeth North Carolina 18.892 2
30 Ella California 11.644 1
31 Ella North Carolina 72.289 1
32 Ella Washington 66.453 2
33 Emma Massachusetts 28.502 1
34 Emma North Carolina 63.067 1
35 Ethan Massachusetts 31.480 1
36 Ethan New York 95.639 1
37 Ethan North Carolina 6.561 1
38 Ethan Texas 29.963 1
39 Ethan Washington 44.187 2
40 Genevieve Massachusetts 90.783 1
41 Grace New York 18.232 1
42 Grace North Carolina 5.355 2
43 Grace Washington 91.084 1
44 Isabella New York 4.115 1
45 Isabella North Carolina 12.942 2
46 Isabella Texas 2.227 2
47 Jack California 40.905 1
48 Jack Massachusetts 98.080 2
49 Jack New York 8.071 2
50 Jack North Carolina 11.304 3
51 Jack Texas 18.795 2
52 Jayden Massachusetts 83.949 1
53 Jayden North Carolina 67.769 1
54 Jayden Washington 4.360 1
55 Liam California 3.300 1
56 Liam Massachusetts 87.585 1
57 Liam New York 96.599 1
58 Liam North Carolina 32.997 1
59 Lolita California 18.102 1
60 Lolita Washington 30.649 2
61 Max Massachusetts 21.683 2
62 Max New York 14.852 2
63 Max North Carolina 79.982 2
64 Max Texas 66.749 2
65 Max Washington 67.326 2
66 Riley New York 20.527 2
Removing duplicates can be a useful way to reduce the size of a data set without losing information of interest. For example, consider an analysis of using data from the sample AirlineDemoSmall xdf file. It has 600,000 observations and contains the variables DayOfWeek, CRSDepTime, and ArrDelay. We can create a smaller data set reducing the number of observations, and adding a variable that contains the frequency of the duplicated observation.
sampleDataDir <- rxGetOption("sampleDataDir")
airDemo <- file.path(sampleDataDir, "AirlineDemoSmall.xdf")
airDedup <- file.path(tempdir(), "rxAirDedup.xdf")
rxSort(inData = airDemo, outFile = airDedup,
sortByVars = c("DayOfWeek", "CRSDepTime", "ArrDelay"),
removeDupKeys = TRUE, dupFreqVar = "FreqWt")
rxGetInfo(airDedup)
The new data file contains about 1/3 of the observations and one additional variable:
File name: C:\YourTempDir\rxAirDedup.xdf
Number of observations: 232451
Number of variables: 4
Number of blocks: 2
Compression type: zlib
By using the frequency weights argument, we can use many of the RevoScaleR analysis functions on this smaller data set and get same results as we would using the full data set. For example, a linear model for Arrival Delay can be specified as follows, using the fweights argument:
linModObj <- rxLinMod(ArrDelay~CRSDepTime + DayOfWeek, data = airDedup,
fweights = "FreqWt")
summary(linModObj)
Call:
rxLinMod(formula = ArrDelay ~ CRSDepTime + DayOfWeek, data = airDedup,
fweights = "FreqWt")
Linear Regression Results for: ArrDelay ~ CRSDepTime + DayOfWeek
File name: C:\YourTempDir\rxAirDedup.xdf
Frequency weights: FreqWt
Dependent variable(s): ArrDelay
Total independent variables: 9 (Including number dropped: 1)
Sum of weights of valid observations: 582628
Number of missing observations: 3503
Coefficients: (1 not defined because of singularities)
Estimate Std. Error t value Pr(>|t|)
(Intercept) -3.19458 0.20413 -15.650 2.22e-16 ***
CRSDepTime 0.97862 0.01126 86.948 2.22e-16 ***
DayOfWeek=Monday 2.08100 0.18602 11.187 2.22e-16 ***
DayOfWeek=Tuesday 1.34015 0.19881 6.741 1.58e-11 ***
DayOfWeek=Wednesday 0.15155 0.19679 0.770 0.441
DayOfWeek=Thursday -1.32301 0.19518 -6.778 1.22e-11 ***
DayOfWeek=Friday 4.80042 0.19452 24.679 2.22e-16 ***
DayOfWeek=Saturday 2.18965 0.19229 11.387 2.22e-16 ***
DayOfWeek=Sunday Dropped Dropped Dropped Dropped
---
Signif. codes: 0 ‘***’ 0.001 ‘**’ 0.01 ‘*’ 0.05 ‘.’ 0.1 ‘ ’ 1
Residual standard error: 40.39 on 582620 degrees of freedom
Multiple R-squared: 0.01465
Adjusted R-squared: 0.01464
F-statistic: 1238 on 7 and 582620 DF, p-value: < 2.2e-16
Condition number: 10.6542
Using the full data set, we get the following results:
linModObjBig <- rxLinMod(ArrDelay~CRSDepTime + DayOfWeek, data = airDemo)
summary(linModObjBig)
Call:
rxLinMod(formula = ArrDelay ~ CRSDepTime + DayOfWeek, data = airDemo)
Linear Regression Results for: ArrDelay ~ CRSDepTime + DayOfWeek
File name: C:\YourSampleDir\AirlineDemoSmall.xdf
Dependent variable(s): ArrDelay
Total independent variables: 9 (Including number dropped: 1)
Number of valid observations: 582628
Number of missing observations: 17372
Coefficients: (1 not defined because of singularities)
Estimate Std. Error t value Pr(>|t|)
(Intercept) -3.19458 0.20413 -15.650 2.22e-16 ***
CRSDepTime 0.97862 0.01126 86.948 2.22e-16 ***
DayOfWeek=Monday 2.08100 0.18602 11.187 2.22e-16 ***
DayOfWeek=Tuesday 1.34015 0.19881 6.741 1.58e-11 ***
DayOfWeek=Wednesday 0.15155 0.19679 0.770 0.441
DayOfWeek=Thursday -1.32301 0.19518 -6.778 1.22e-11 ***
DayOfWeek=Friday 4.80042 0.19452 24.679 2.22e-16 ***
DayOfWeek=Saturday 2.18965 0.19229 11.387 2.22e-16 ***
DayOfWeek=Sunday Dropped Dropped Dropped Dropped
---
Signif. codes: 0 ‘***’ 0.001 ‘**’ 0.01 ‘*’ 0.05 ‘.’ 0.1 ‘ ’ 1
Residual standard error: 40.39 on 582620 degrees of freedom
Multiple R-squared: 0.01465
Adjusted R-squared: 0.01464
F-statistic: 1238 on 7 and 582620 DF, p-value: < 2.2e-16
Condition number: 10.6542
The rxQuantile Function and the Five-Number Summary
Sorting data is, in the general case, a prerequisite to finding exact quantiles, including medians. However, it is possible to compute approximate quantiles by counting binned data then computing a linear interpolation of the empirical cdf. If the data are integers, or can be converted to integers by exact multiplication, and integral bins are used, the computation is exact. The RevoScaleR function rxQuantile does this computation, and by default returns an approximate five-number summary:
# The rxQuantile Function and the Five-Number Summary
readPath <- rxGetOption("sampleDataDir")
AirlinePath <- file.path(readPath, "AirlineDemoSmall.xdf")
rxQuantile("ArrDelay", AirlinePath)
Rows Processed: 600000
0% 25% 50% 75% 100%
-86 -9 0 16 1490
See Also
Machine Learning Server
Install Machine Learning Server on Windows
Install Machine Learning Server on Linux
Install Machine Learning Server on Hadoop