revoscalepy package
The revoscalepy module is a collection of portable, scalable and distributable Python functions used for importing, transforming, and analyzing data at scale. You can use it for descriptive statistics, generalized linear models, logistic regression, classification and regression trees, and decision forests.
Functions run on the revoscalepy interpreter, built on open-source Python, engineered to leverage the multithreaded and multinode architecture of the host platform.
Package details | Information |
---|---|
Current version: | 9.4 |
Built on: | Anaconda 4.2 distribution of Python 3.5 |
Package distribution: | Machine Learning Server 9.x SQL Server 2017 Machine Learning Services SQL Server 2017 Machine Learning Server (Standalone) |
How to use revoscalepy
The revoscalepy module is found in Machine Learning Server or SQL Server Machine Learning when you add Python to your installation. You get the full collection of proprietary packages plus a Python distribution with its modules and interpreter.
You can use any Python IDE to write Python script calling functions in revoscalepy, but the script must run on a computer having our proprietary modules. For a review of common tasks, see How to use revoscalepy with Spark.
Run it locally
This is the default. The revoscalepy library runs locally on all platforms. On a standalone Linux or Windows system, data and operations are local to the machine. On Spark, a local compute context means that data and operations are local to current execution environment (typically, an edge node).
Run in a remote compute context
In a remote compute context, the script running on a local Machine Learning Server shifts execution to a remote Machine Learning Server on Spark or SQL Server. For example, script running on Windows might shift execution to a Spark cluster to process data there.
On Spark, set the compute context to RxSpark cluster and give the cluster name. In this context, if you call a function that can run in parallel, the task is distributed across data nodes in the cluster, where the operation is co-located with the data.
On SQL Server, set the compute context to RxInSQLServer. There are two primary use cases for remote compute context:
Call Python functions in T-SQL script or stored procedures running on SQL Server.
Call revoscalepy functions in Python script executing in a SQL Server compute context. In your script, you can set a compute context to shift execution of revoscalepy operations to a remote SQL Server instance that has the revoscalepy interpreter.
Functions by category
The library includes data transformation and manipulation, visualization, predictions, and statistical analysis functions. It also includes functions for controlling jobs, serializing data, and performing common utility tasks.
This section lists the functions by category to give you an idea of how each one is used. The table of contents to lists functions in alphabetical order.
Note
Some function names begin with rx-
and others with Rx
. The Rx
function name prefix is used for class constructors for data sources and compute contexts.
1-Compute context functions
Function | Description |
---|---|
RxInSqlServer | Creates a compute context for running revoscalepy analyses inside a remote Microsoft SQL Server. |
RxLocalSeq | This is the default but you can call it switch back to a local compute context if your script runs in multiple. Computations using rx_exec will be processed sequentially. |
rx_get_compute_context | Returns the current compute context. |
rx_set_compute_context | Change the compute context to a different one. |
RxSpark | Creates a compute context for running revoscalepy analyses in a remote Spark cluster. |
rx_get_pyspark_connection | Gets a connection to a PySpark data set, in support of revoscalepy and PySpark interoperability. |
rx_spark_connect | Creates a persistent Spark Connection. |
rx_spark_disconnect | Closes the connection. |
2-Data source functions
Data sources are used by microsoftml functions as well as revoscalepy.
Function | Compute Context | Description |
---|---|---|
RxDataSource | All | Base class for all revoscalepy data sources. |
RxHdfsFileSystem | Local, RxSpark | Data source is accessed through HDFS instead of Linux. |
RxNativeFileSystem | Local, RxSpark | Data source is accessed through Linux instead of HDFS. |
RxHiveData | Local, RxSpark | Generates a data source object from a Hive data file. |
RxTextData | Local, RxSpark | Generates a data source object from a text data file. |
RxXdfData | All | Generates a data source object from an XDF data source. |
RxOdbcData | All | Generates a data source object from an ODBC data source. |
RxOrcData | Local, RxSpark | Generates a data source object from an Orc data file. |
RxParquetData | Local, RxSpark | Generates a data source object from a Parquet data file. |
RxSparkData | Local, RxSpark | Generates a data source object from a Spark data source. |
RxSparkDataFrame | Local, RxSpark | Generates a data source object from a Spark data frame. |
rx_get_partitions | Local, RxSpark | Get partitions of a partitioned Xdf data source. |
rx_partition | Local, RxSpark | Partition input data sources by key values and save the results to a partitioned .xdf on disk. |
rx_spark_cache_data | Local, RxSpark | Generates a data source object from cached data. |
rx_spark_list_data | Local, RxSpark | Generates a data source object from a list. |
rx_spark_remove_data | Local, RxSpark | Deletes the Spark cached data source object. |
RxSqlServerData | Local, RxInSqlServer | Generates a data source object from a SQL table or query. |
3-Data manipulation (ETL) functions
Function | Compute Context | Description |
---|---|---|
rx_import | All | Import data into an .xdf file or data frame. |
rx_data_step | All | Transform data from an input data set to an output data set. |
4-Analytic functions
Function | Compute Context | Description |
---|---|---|
rx_exec_by | Local, RxSpark | Execute an arbitrary function in parallel on multiple data nodes. |
rx_summary | All | Produce univariate summaries of objects in revoscalepy. |
rx_lin_mod | All | Fit linear models on small or large data. |
rx_logit | All | Use rx_logit to fit logistic regression models for small or large data. |
rx_dtree | All | Fit classification and regression trees on an ‘.xdf’ file or data frame for small or large data using parallel external memory algorithm. |
rx_dforest | All | Fit classification and regression decision forests on an ‘.xdf’ file or data frame for small or large data using parallel external memory algorithm. |
rx_btrees | All | Fit stochastic gradient boosted decision trees on an ‘.xdf’ file or data frame for small or large data using parallel external memory algorithm. |
rx_predict_default | All | Compute predicted values and residuals using rx_lin_mod and rx_logit objects. |
rx_predict_rx_dforest | All | Calculate predicted or fitted values for a data set from an rx_dforest or rx_btrees object. |
rx_predict_rx_dtree | All | Calculate predicted or fitted values for a data set from an rx_dtree object. |
5-Job functions
In an RxSpark context, job management is built in. You only need job functions if you want to manually control the Yarn queue.
Function | Compute Context | Description |
---|---|---|
rx_exec | All | Allows distributed execution of a function in parallel across nodes (computers) or cores of a “compute context” such as a cluster. |
rx_cancel_job | All | Removes all job-related artifacts from the distributed computing resources, including any job results. |
rx_cleanup_jobs | All | Removes the artifacts for a specific job. |
RxRemoteJob class | All | Closes the remote job, purging all associated job-related data. |
RxRemoteJobStatus | All | Represents the execution status of a remote Python job. |
rx_get_job_info | All | Contains complete information on the job’s compute context as well as other information needed by the distributed computing resources. |
rx_get_job_output | All | Returns console output for the nodes participating in a distributed computing job. |
rx_get_job_results | All | Returns results of the run or a message stating why results are not available. |
rx_get_job_status | All | Obtain distributed computing processing status for the specified job. |
rx_get_jobs | All | Returns a list of job objects associated with the given compute context and matching the specified parameters. |
rx_wait_for_job | All | Block on an existing distributed job until completion, effectively turning a non-blocking job into a blocking job. |
6-Serialization functions
Function | Compute Context | Description |
---|---|---|
rx_serialize_model | All | Serialize a given python model. |
rx_read_object | All | Retrieves an ODBC data source object. |
rx_read_xdf | All | Read data from an .xdf file into a data frame. |
rx_write_object | All | Stores an ODBC data source object. |
rx_delete_object | All | Deletes an object from the ODBC data source. |
rx_list_keys | All | Enumerates all keys or versions for a given key, depending on the parameters. |
7-Utility functions
Function | Compute Context | Description |
---|---|---|
RxOptions | All | Specify and retrieve options needed for revoscalepy computations. |
rx_get_info | All | Get basic information about a revoscalepy data source or data frame. |
rx_get_var_info | All | Get variable information for a revoscalepy data source or data frame, including variable names, descriptions, and value labels. |
rx_get_var_names | All | Read the variable names for data source or data frame. |
rx_set_var_info | All | Set the variable information for an .xdf file, including variable names, descriptions, and value labels, or set attributes for variables in a data frame. |
RxMissingValues | All | Provides missing values for various NumPy data types which you can use to mark missing values in a sequence of data in ndarray . |
rx_privacy_control | All | Opt out of usage data collection. |
rx_hadoop_command | Local, RxSpark | Execute arbitrary Hadoop commands and perform standard file operations in Hadoop. |
rx_hadoop_copy_from_local | Local, RxSpark | Wraps the Hadoop fs -copyFromLocal command. |
rx_hadoop_copy_to_local | Local, RxSpark | Wraps the Hadoop fs -copyToLocal command. |
rx_hadoop_copy | Local, RxSpark | Wraps the Hadoop fs -cp command. |
rx_hadoop_file_exists | Local, RxSpark | Wraps the Hadoop fs -test -e command. |
rx_hadoop_list_files | Local, RxSpark | Wraps the Hadoop fs -ls or -lsr command. |
rx_hadoop_make_dir | Local, RxSpark | Wraps the Hadoop fs -mkdir -p command. |
rx_hadoop_move | Local, RxSpark | wraps the Hadoop fs -mv command. |
rx_hadoop_remove_dir | Local, RxSpark | Wraps the Hadoop fs -rm -r or fs -rm -r -skipTrash command. |
rx_hadoop_remove | Local, RxSpark | Wraps the Hadoop fs -rm or fs -rm -skipTrash command. |
Next steps
For Machine Learning Server, try a quickstart as an introduction to revoscalepy:
For SQL Server, add both Python modules to your computer by running setup:
Follow these SQL Server tutorials for hands-on experience:
See also
Machine Learning Server
SQL Server Machine Learning Services with Python
SQL Server Machine Learning Server (Standalone)