Share via


Recent Updates to the Microsoft Data Science Virtual Machine

Posted by Gopi Kumar, Principal Program Manager in the Microsoft Data Group.

It's been over 9 months since we first released the Data Science Virtual Machine (DSVM), a custom virtual machine image we published in the Azure Marketplace with a host of popular data science tools pre-installed and pre-configured. We've made a few updates since then, and now offer the DSVM in both Windows and Linux editions. There's been a tremendous response to this offering by the data analytics community across the globe and we continue to iterate and improve the experience. This post provides a quick update on some of the newer features that should further improve your productivity and let you accomplish more with the DSVM.

Windows Edition
We now have the SQL Server 2016 Developer edition replacing the SQL Server 2014 Express edition on the VM. SQL Server 2016 Developer is a full-featured edition, for development/test purposes only, of Microsoft's industry-leading OLTP database and top-performing data warehouse. 

It also includes R Services that support in-database analytics using Microsoft R, enabling large-scale analytics to be run closer to your data using ScaleR, Microsoft's distributed scalable library in R that is fully compatible with open source R packages and supports parallel algorithms.

The DSVM also packages an end-to-end data science tutorial featuring SQL Server R Services as a Jupyter notebook along with a preloaded dataset in the SQL database. You can also run R Server standalone outside the database.

In addition to libraries to work with Azure ML, we also provide locally on the VM a few popular Open Source ML and deep neural networks/AI toolkits such as xgboost, Vowpal Wabbit, Rattle, CNTK and mxnet with samples to get you started.

Other notable updates to the VM include the Azure CLI, Visual Studio Community 2015 Update 3, which comes with several language tools including R, Python and node,js as well as pre-installed plugins that make it easier to work with data and analytics technology, including with SQL Server, Azure HDInsight(Hadoop), Azure Data Lake.

You have the ability to run several Linux command line tools, e.g. awk, sed, find, wget, perl etc., right in the Windows command prompt or on Git Bash. Data movement tools on the VM support the movement of data to and from relational databases, Azure storage accounts, Azure DocumentDB and Azure Data Lake. Microsoft Data Management Gateway installed on the VM allows you to setup data pipelines from on premises to cloud using the Azure Data Factory.

Linux Edition

Microsoft R Server Developer edition, for non-production use only, is now available on the Linux DSVM, allowing you to build models at scale in R using Microsoft's ScaleR libraries. Previously we supported only Microsoft R Open, which uses Open Source libraries that can only process data that fit in memory.

Another major update on the Linux VM is our support for JupyterHub, a multiuser solution for Jupyter Notebook server. Based on our experience, Jupyterhub has been particularly useful in education and training scenarios, where a single VM instance is able to support multiple users independently working on their own single-user notebook server instances with OS authentication.

We have also added support for working with the Julia language both in the command line and as a Jupyter notebook kernel. All the ML tools mentioned above in the Windows section with the exception of mxnet are also available on the Linux DSVM.
The slide below captures the key software components available in each of the DSVM editions: 

DSVM Edition Side by Side Comparison - New

With the data science VM you have a comprehensive set of tools to perform a whole range of data science activities including data movement, data storage, data exploration/visualization, modeling with ML and AI algorithms, and operationalization using multiple languages in both Linux and Windows environments.

There is lots more information at the resources listed below. Do give the DSVM a spin for your next data science or analytics project or training session. As always, we'd love hearing your feedback so we can continue to improve your experience.

Gopi

 

Windows Edition:

Linux Edition:

Webinar:

Comments

  • Anonymous
    September 13, 2016
    Why isn't the Data Science VM covered by Azure credits from a MSDN subscription?
    • Anonymous
      September 15, 2016
      @Gabe - Were you able to successfully create the DSVM with the Azure credits from your MSDN subscription?
  • Anonymous
    September 13, 2016
    Actually, data science VM should be covered by Azure credits since there is only compute charges (and no software charges). The message when provisioning the VM about unable to apply credits is a bit misleading and only applies only to software price (which is not an issue for data science VM since its software price is $0). You can ignore that message and continue provisioning the VM (I will report the issue of misleading message). Personally, I have used my MSDN Azure credits to spin up the data science VMs. Let me know if you are having issues using your MSDN subscription Azure credits. Thanks for raising this question, Gabe.
  • Anonymous
    September 13, 2016
    Hi Gabe,I just tried with my MSDN credits and as Gopi mentioned, it is working as expected.
  • Anonymous
    September 19, 2016
    Thank you for this.This will be great to create a how to installation for Hybrid deployments. Some services in Hyper-V on premises and others in Azure.Thanks again
    • Anonymous
      September 19, 2016
      Thanks @JLSF. Can you provide a sample concrete scenario for your hybrid setup? We will look at creating some guidance for use case scenarios such as yours?