Skip to content
Madison J Myers edited this page Oct 5, 2017 · 2 revisions

Welcome to the Setting-up-and-Running-SystemML wiki!

Short Name

Learn how to setup your environment and run Apache SystemML on your computer and complete Nonnegative Matrix Factorization in a Jupyter Notebook.

Offering Type

Data Analytics & Cloud

Introduction

Apache SystemML is an open source machine learning tool that can run locally or in conjunction with big data tools such as Apache Spark. This journey highlights how to setup your environment for its use as well as an initial use case focusing on nonnegative matrix factorization. In this use case, SystemML is able to process the mathematical functions quickly with large data, adding to an efficient pipeline. This journey is built for beginners who are not as familiar with Apache Spark or Jupyter Notebook and who are brand new to Apache SystemML. This journey will demonstrate how you can use both Spark and SystemML in tandem to achieve your data science goals!

Author

by Madison J. Myers

Code

https://github.com/MadisonJMyers/Setting-up-and-Running-SystemML

Demo

N/A

Video

N/A

Overview

Apache SystemML and Apache Spark are invaluable big data tools, but are sometimes confusing to set up and use. In this journey I will demonstrate how to set up your environment for Apache Spark and Apache SystemML and incorporate both into a Jupyter Notebook. I will then quickly show you how you would conduct some math while using SystemML, which can be run directly in the Notebook.

When you have completed this journey, you will understand how to:

Set up your environment for Apache Spark and SystemML
Download data into a Jupyter Notebook using SystemML
Use PySpark to load the data into a dataframe
Use SystemML to define a Poisson nonnegative matrix factorization (PNMF)
Plot your results

Flow

The user sets up their environment following the highlighted steps.
The user opens a new Jupyter Notebook with Spark and SystemML using the code provided.
The user downloads a data and loads it into a data frame.
The user starts a SystemML context.
The user defines a kernel for poisson nonnegative matrix factorization (PNMF) in DML.

Included Components

Get familiar with Apache SystemML and Apache Spark while using a Jupyter Notebook.

Featured Technologies

Jupyter Notebooks: An open-source web application that allows you to create and share documents that contain live code, equations, visualizations and explanatory text.

Apache SystemML: a machine learning platform, optimal for big data.

Apache Spark: a fast and general engine for big data processing, with built-in modules for streaming, SQL, machine learning and graph processing.

Links

https://spark.apache.org/
https://www.ibm.com/analytics/us/en/technology/spark/
http://researcher.watson.ibm.com/researcher/view_group.php?id=3174
http://systemml.apache.org/

Blog by Madison J. Myers

0 to Life-Changing Application with SystemML

A “life-changing app”? You may be asking yourself who is this person and how are they so sure they are going to change lives?

Well, let me introduce myself.

Before joining the Spark Technology Center as an intern working with SystemML, I was a student and a researcher and a restaurant manager and an undergraduate admissions ambassador and a barista and...the list goes on, but my passion has always been the social sciences and social good. I studied global politics and philosophy in my undergrad at NYU, I then went on to study foreign policy, focusing on South Asia in my masters degree at King’s College London. Jumping forward a few years, several countries and several jobs, I spontaneously moved out to San Francisco to see what all the buzz was about. I worked as a journalist and as a health researcher, but I wanted something to really dig my teeth into. That’s when I discovered data science. Though I have no computer science background and am driven only by my thirst for knowledge, I have jumped head first into the world of data, programming and machine learning as a UC Berkeley data science grad student.

That brings us back to now where IBM’s STC has given me the assignment of my dreams: learn SystemML from scratch, brainstorm a real-world problem, help build an application using SystemML, then sit back and see lives being changed. Well, that’s the plan anyway.

As you can guess, this experience of learning SystemML from scratch and then building an application with it will be interesting at the least. That’s why I am going to blog about every step along the way. This way, we can simultaneously build our SystemML applications together, and I can alleviate some troubleshooting along the way. Why SystemML?

At UC Berkeley, we're taught R and Python. SystemML runs with R and Python. Being new to computer science and wanting to jump straight into the data doesn't allow me much time to hack into Spark and figure out how to write high-level math with big data. On SystemML you can write the math no matter how big the data is! Because I can access algorithms from files, it's easier to go from formulas and R code to big data problems. Now let's get to my first dive into SystemML where I’ll focus on: overcoming assumptions.

While I may still be very new to the tech world and all of its wonderful tutorials, an issue that I have consistently noticed thus far, is the long list of assumptions made in any step by step guide, particularly in setting up your environment. Many developers, data scientists and researchers are so advanced, they have forgotten what it’s like to be new! When writing tutorials, they assume that everything is set up and ready to go, but that’s not always the case. No need to worry with SystemML: I am here to help. Below is my very own step by step guide to running SystemML on Jupyter notebook (with little to no assumptions).