Data Science in a Box using IPython: Scipy and Scikit-Learn (3/4)

In the first two blogs of this series, we installed the IPython notebook using the minimum requirement.   

The third blog post will walk you through some of the common packages used for Data Science. 

SciPy/NumPy packages are usually mentioned together.  At this point, we have not installed SciPy, SciPy includes a collection of numerical packages, that includes Linear solvers that we used in a previous post.  Enter the Big Data Matrix: analyzing meanings and relations of everything (2/2)

To install the package type: sudo apt-get install python-scipy

Scikit Learn is a fantastic python-based machine learning package, it includes algorithms for both supervised and unsupervised learning.  Moreover, it includes support for sample datasets, data import tools, and model evaluation.

Scikit Learn is included with your Ubuntu distribution, but the default is about 2 versions behind.  The best way to install Scikit Learn is to use PIP.

type: pip install scikit-learn

The installation process includes building many of the packages from scratch; much of the code base is written in C. Check the installation for errors. You can verify by checking new files in /usr/local/lib/python2.7/dist-packages for sklearn.

image

Getting samples

It is easy to find samples and run them in IPython Notebook.  You can easily get them from various websites and even tutorials.  To save you time, I’ve make a small collection at:  https://github.com/wenming/BigDataSamples 

Get the package by typing:  wget https://github.com/wenming/BigDataSamples/archive/master.zip 

On your Ubuntu box, you might have to install unzip by typing:  sudo apt-get install unzip 

Unzip master.zip; then copy content from BigDataSamples-master/ipythonMLsamples into your Ipython dir. A sample command may look like:

cp /home/azureuser/samples/BigDataSamples-master/ipythonMLsamples/* /home/azureuser/.ipython/

Check to make sure the files have been copied.

image

Running the samples

Go back to the website for IPython, log in, and the files listed should show up in the root directory.  Click on K-Means clustering on the handwritten digits data.

 

image

 

Click on the Play button to run the machine learning sample.

image

 

The code uses the K-means algorithm with 3 different types of initialization, then plots the results.

 

image

 

The code for making the color scattered plot.

image

 

Additional samples to explore

These samples are also includes, feel free to explore them on your own.

  • A demo of K-Means clustering on the handwritten digits data
    A demo of structured Ward hierarchical clustering on Lena image
    Faces dataset decompositions
    Gaussian Processes regression
    Manifold learning
    Non-linear SVM
    Hand writing recognition using SVM
    Hierarchical clustering-structured vs unstructured ward
    demo2 of the K Means clustering algorithm
    Weighted SVM
    Visualizing the stock market structure

 

 

Conclusion

IPython Notebook gives us a quick and easy way to share compute resources through the web-based IPython notebook interface.  Scikit-Learn, NumPy, and Scipy all simply work out of the box for IPython notebook.  The simple, yet powerful combination lets users focus on learning and getting the data analysis done.

In the next blog, we will introduce additional packages in Python that can be used for Data analysis including scaling out using clustering.