revoscalepy package

The revoscalepy module is a collection of portable, scalable and distributable Python functions used for importing, transforming, and analyzing data at scale. You can use it for descriptive statistics, generalized linear models, logistic regression, classification and regression trees, and decision forests.

Functions run on the revoscalepy interpreter, built on open-source Python, engineered to leverage the multithreaded and multinode architecture of the host platform.

Package details Information
Current version: 9.4
Built on: Anaconda 4.2 distribution of Python 3.5
Package distribution: Machine Learning Server 9.x
SQL Server 2017 Machine Learning Services
SQL Server 2017 Machine Learning Server (Standalone)

How to use revoscalepy

The revoscalepy module is found in Machine Learning Server or SQL Server Machine Learning when you add Python to your installation. You get the full collection of proprietary packages plus a Python distribution with its modules and interpreter.

You can use any Python IDE to write Python script calling functions in revoscalepy, but the script must run on a computer having our proprietary modules. For a review of common tasks, see How to use revoscalepy with Spark.

Run it locally

This is the default. The revoscalepy library runs locally on all platforms. On a standalone Linux or Windows system, data and operations are local to the machine. On Spark, a local compute context means that data and operations are local to current execution environment (typically, an edge node).

Run in a remote compute context

In a remote compute context, the script running on a local Machine Learning Server shifts execution to a remote Machine Learning Server on Spark or SQL Server. For example, script running on Windows might shift execution to a Spark cluster to process data there.

On Spark, set the compute context to RxSpark cluster and give the cluster name. In this context, if you call a function that can run in parallel, the task is distributed across data nodes in the cluster, where the operation is co-located with the data.

On SQL Server, set the compute context to RxInSQLServer. There are two primary use cases for remote compute context:

  • Call Python functions in T-SQL script or stored procedures running on SQL Server.

  • Call revoscalepy functions in Python script executing in a SQL Server compute context. In your script, you can set a compute context to shift execution of revoscalepy operations to a remote SQL Server instance that has the revoscalepy interpreter.

Functions by category

The library includes data transformation and manipulation, visualization, predictions, and statistical analysis functions. It also includes functions for controlling jobs, serializing data, and performing common utility tasks.

This section lists the functions by category to give you an idea of how each one is used. The table of contents to lists functions in alphabetical order.

Note

Some function names begin with rx- and others with Rx. The Rx function name prefix is used for class constructors for data sources and compute contexts.

1-Compute context functions

Function Description
RxInSqlServer Creates a compute context for running revoscalepy analyses inside a remote Microsoft SQL Server.
RxLocalSeq This is the default but you can call it switch back to a local compute context if your script runs in multiple. Computations using rx_exec will be processed sequentially.
rx_get_compute_context Returns the current compute context.
rx_set_compute_context Change the compute context to a different one.
RxSpark Creates a compute context for running revoscalepy analyses in a remote Spark cluster.
rx_get_pyspark_connection Gets a connection to a PySpark data set, in support of revoscalepy and PySpark interoperability.
rx_spark_connect Creates a persistent Spark Connection.
rx_spark_disconnect Closes the connection.

2-Data source functions

Data sources are used by microsoftml functions as well as revoscalepy.

Function Compute Context Description
RxDataSource All Base class for all revoscalepy data sources.
RxHdfsFileSystem Local, RxSpark Data source is accessed through HDFS instead of Linux.
RxNativeFileSystem Local, RxSpark Data source is accessed through Linux instead of HDFS.
RxHiveData Local, RxSpark Generates a data source object from a Hive data file.
RxTextData Local, RxSpark Generates a data source object from a text data file.
RxXdfData All Generates a data source object from an XDF data source.
RxOdbcData All Generates a data source object from an ODBC data source.
RxOrcData Local, RxSpark Generates a data source object from an Orc data file.
RxParquetData Local, RxSpark Generates a data source object from a Parquet data file.
RxSparkData Local, RxSpark Generates a data source object from a Spark data source.
RxSparkDataFrame Local, RxSpark Generates a data source object from a Spark data frame.
rx_get_partitions Local, RxSpark Get partitions of a partitioned Xdf data source.
rx_partition Local, RxSpark Partition input data sources by key values and save the results to a partitioned .xdf on disk.
rx_spark_cache_data Local, RxSpark Generates a data source object from cached data.
rx_spark_list_data Local, RxSpark Generates a data source object from a list.
rx_spark_remove_data Local, RxSpark Deletes the Spark cached data source object.
RxSqlServerData Local, RxInSqlServer Generates a data source object from a SQL table or query.

3-Data manipulation (ETL) functions

Function Compute Context Description
rx_import All Import data into an .xdf file or data frame.
rx_data_step All Transform data from an input data set to an output data set.

4-Analytic functions

Function Compute Context Description
rx_exec_by Local, RxSpark Execute an arbitrary function in parallel on multiple data nodes.
rx_summary All Produce univariate summaries of objects in revoscalepy.
rx_lin_mod All Fit linear models on small or large data.
rx_logit All Use rx_logit to fit logistic regression models for small or large data.
rx_dtree All Fit classification and regression trees on an ‘.xdf’ file or data frame for small or large data using parallel external memory algorithm.
rx_dforest All Fit classification and regression decision forests on an ‘.xdf’ file or data frame for small or large data using parallel external memory algorithm.
rx_btrees All Fit stochastic gradient boosted decision trees on an ‘.xdf’ file or data frame for small or large data using parallel external memory algorithm.
rx_predict_default All Compute predicted values and residuals using rx_lin_mod and rx_logit objects.
rx_predict_rx_dforest All Calculate predicted or fitted values for a data set from an rx_dforest or rx_btrees object.
rx_predict_rx_dtree All Calculate predicted or fitted values for a data set from an rx_dtree object.

5-Job functions

In an RxSpark context, job management is built in. You only need job functions if you want to manually control the Yarn queue.

Function Compute Context Description
rx_exec All Allows distributed execution of a function in parallel across nodes (computers) or cores of a “compute context” such as a cluster.
rx_cancel_job All Removes all job-related artifacts from the distributed computing resources, including any job results.
rx_cleanup_jobs All Removes the artifacts for a specific job.
RxRemoteJob class All Closes the remote job, purging all associated job-related data.
RxRemoteJobStatus All Represents the execution status of a remote Python job.
rx_get_job_info All Contains complete information on the job’s compute context as well as other information needed by the distributed computing resources.
rx_get_job_output All Returns console output for the nodes participating in a distributed computing job.
rx_get_job_results All Returns results of the run or a message stating why results are not available.
rx_get_job_status All Obtain distributed computing processing status for the specified job.
rx_get_jobs All Returns a list of job objects associated with the given compute context and matching the specified parameters.
rx_wait_for_job All Block on an existing distributed job until completion, effectively turning a non-blocking job into a blocking job.

6-Serialization functions

Function Compute Context Description
rx_serialize_model All Serialize a given python model.
rx_read_object All Retrieves an ODBC data source object.
rx_read_xdf All Read data from an .xdf file into a data frame.
rx_write_object All Stores an ODBC data source object.
rx_delete_object All Deletes an object from the ODBC data source.
rx_list_keys All Enumerates all keys or versions for a given key, depending on the parameters.

7-Utility functions

Function Compute Context Description
RxOptions All Specify and retrieve options needed for revoscalepy computations.
rx_get_info All Get basic information about a revoscalepy data source or data frame.
rx_get_var_info All Get variable information for a revoscalepy data source or data frame, including variable names, descriptions, and value labels.
rx_get_var_names All Read the variable names for data source or data frame.
rx_set_var_info All Set the variable information for an .xdf file, including variable names, descriptions, and value labels, or set attributes for variables in a data frame.
RxMissingValues All Provides missing values for various NumPy data types which you can use to mark missing values in a sequence of data in ndarray.
rx_privacy_control All Opt out of usage data collection.
rx_hadoop_command Local, RxSpark Execute arbitrary Hadoop commands and perform standard file operations in Hadoop.
rx_hadoop_copy_from_local Local, RxSpark Wraps the Hadoop fs -copyFromLocal command.
rx_hadoop_copy_to_local Local, RxSpark Wraps the Hadoop fs -copyToLocal command.
rx_hadoop_copy Local, RxSpark Wraps the Hadoop fs -cp command.
rx_hadoop_file_exists Local, RxSpark Wraps the Hadoop fs -test -e command.
rx_hadoop_list_files Local, RxSpark Wraps the Hadoop fs -ls or -lsr command.
rx_hadoop_make_dir Local, RxSpark Wraps the Hadoop fs -mkdir -p command.
rx_hadoop_move Local, RxSpark wraps the Hadoop fs -mv command.
rx_hadoop_remove_dir Local, RxSpark Wraps the Hadoop fs -rm -r or fs -rm -r -skipTrash command.
rx_hadoop_remove Local, RxSpark Wraps the Hadoop fs -rm or fs -rm -skipTrash command.

Next steps

For Machine Learning Server, try a quickstart as an introduction to revoscalepy:

For SQL Server, add both Python modules to your computer by running setup:

Follow these SQL Server tutorials for hands-on experience:

See also

Machine Learning Server
SQL Server Machine Learning Services with Python
SQL Server Machine Learning Server (Standalone)