RxHiveData, RxParquetData, RxOrcData {RevoScaleR}: Generate Hive, Parquet or ORC Data Source Object
These are constructors for Hive, Parquet and ORC data sources
which extend RxDataSource
. These three data sources
can be used only in RxSpark
compute context.
RxHiveData(query = NULL, table = NULL, colInfo = NULL, saveAsTempTable = FALSE, cache = FALSE, writeFactorsAsIndexes = FALSE)
RxParquetData(file, colInfo = NULL, fileSystem = "hdfs", cache = FALSE, writeFactorsAsIndexes = FALSE)
RxOrcData(file, colInfo = NULL, fileSystem = "hdfs", cache = FALSE, writeFactorsAsIndexes = FALSE)
character string specifying a Hive query, e.g. "select * from sample_table"
. Cannot be used with 'table'.
character string specifying the name of a Hive table, e.g. "sample_table"
. Cannot be used with 'query'.
list of named variable information lists. Each variable information list contains one or more of the named elements given below (see rxCreateColInfo for more details):
- Currently available properties for a column information list are:
type
character string specifying the data type for the column. Supported types are:"logical"
(stored asuchar
)"integer"
(stored asint32
)"int16"
(alternative to integer for smaller storage space)"float32"
(stored asFloatType
)"numeric"
(stored asfloat64
)"character"
(stored asstring
)"factor"
(stored asuint32
)"Date"
(stored asDate
, i.e.float64
)"POSIXct"
(stored asPOSIXct
, i.e.float64
)
levels
character vector containing the levels whentype = "factor"
. If the"levels"
property is not provided, factor levels will be determined by the values in the source column. If levels are provided, any value that does not match a provided level will be converted to a missing value.
logical. Only applicable when using as output with table
parameter. If TRUE
register a temporary Hive table in Spark memory system otherwise generate a persistent Hive table. The temporary Hive table is always cached in Spark memory system.
character string specifying a file path, e.g. "/tmp/AirlineDemoSmall.parquet"
or "/tmp/AirlineDemoSmall.orc"
.
character string "hdfs"
or RxFileSystem
object indicating type of file system. It supports native HDFS and other HDFS compatible systems, e.g., Azure Blob and Azure Data Lake. Local file system is not supported.
[Deprecated] logical. If TRUE
data will be cached in the Spark application's memory system after the first use.
logical. If TRUE
, when writing to an output data source, underlying factor indexes will be written instead of the string representations.
object of RxHiveData, RxParquetData or RxOrcData.
## Not run:
myHadoopCluster <- rxSparkConnect()
colInfo = list(DayOfWeek = list(type = "factor"))
### import from a parquet file
ds1 <- RxParquetData(file = "/tmp/AirlineDemoSmall.parquet",
colInfo = colInfo)
rxImport(ds1)
### import from an orc file
ds2 <- RxOrcData(file = "/tmp/AirlineDemoSmall.orc",
colInfo = colInfo)
rxImport(ds2)
### import from a Hive query
ds3 <- RxHiveData(query = "select * from hivesampletable")
rxImport(ds3)
### import from a Hive persistent table
ds4 <- RxHiveData(table = "AirlineDemo")
rxImport(ds4)
### output to a orc file
out1 <- RxOrcData(file = "/tmp/AirlineDemoSmall.orc",
colInfo = colInfo)
rxDataStep(inData = ds1, outFile=out1, overwrite=TRUE)
### output to a Hive temporary table
out2 <- RxHiveData(table = "AirlineDemoSmall", saveAsTempTable=TRUE)
rxDataStep(inData = ds1, outFile=out2)
## End(Not run)