Apache Zeppelin

edit

Apache Zeppelin is a web based notebook that allows end users with an easy to use browser based interface on which they can do data analytics without having to worry about the hardware and cluster setup which hosts the processing engines and databases. It provides the user with a friendly interface on which they can write code and see the results in real-time.

What is a Data Science Notebook?

edit

Just as we have work documents for text, spreadsheets for working with numbers and power point for making presentations with text and data and graphics, a data science notebook is a tool or application that you can use for visualizing data and performing analytics.[1] Most of data analytics work is iterative process in which we query the database for some information and based on what we get we formulate another query and keep on repeating the process until we get what we get what we were looking for.[2] An interactive notebook is a perfect tool for performing such operations. Not only does a notebook allow us to easily interact with huge databases, it also gives us the ability to visualize data in the form of graphs, charts and other visual aids that help in research and decision making. A note book can be written into and saved just like a regular document and its browser based interface makes it highly portable.

History

edit

Apache Zeppelin initially started as a commercial product for data analysis in Dec 2012. It was conceived and developed by Moon Soo Lee who is the CTO of NF Labs. It was made opens source in Oct 2013 after which it was heavily adopted by data scientists and businesses and in end of 2014, it was donated to Apache Foundation and it became an Apache Incubator project. Its first release (version 0.5.0) under the Apache license was in July 2015. The latest release (version 0.6.1) was on 15th August 2016.[3]

Developer Community and Codebase

edit

Zeppelin has a growing community of developers across the globe and its source code is currently hosted in a github repository. As of September 2016, it has 148 contributors and 1915 stars in github. It is available to download as a binary package which contains the application, a web server, library files and utility scripts. The intention behind donating the project to Apache was to gain more contributors and to bring the project under a well-defined and transparent development environment.[4] One of the major factors that led to widespread adoption by the user community and developers is its backend independent design that enables it to run on top of not only Apache Spark but almost any other data science processing engines.


Collaboration

edit

Zeppelin is a great tool for collaboration. Since it has a web based interface, the URL can be shared among multiple users working on the same project and the changes done are reflected in real time just like Google Docs.[5] It also has the facility to share the url for only the results section of the documents. This way you can share your work with people with whom you just want to share the results but not the access to edit the code.

Backend Independent Design and Support for Multiple Processing Engines

edit

This has been the major strength of this tool and the reason behind its widespread adoption by the user and developer community. This is made possible by the use of the Zeppelin interpreter. It is a plugin which enables Zeppelin to use a particular language or backend processing engine. Some of the interpreters available are Scala with Apache Spark, Python (programming language)with Spark, Shell, JDBC, SparkSQL etc. There are community managed interpreters as well as third party interpreters. By default, Zeppelin comes with built in support for Apache Spark integration. This can be done out of the box without the need for any plugins or additional modules.


Configuration of Zeppelin

edit

This section is a walkthrough of installation of Apache Zeppelin by building from the source. The following softwares are to be pre-installed.

  1. Git
  2. Apache Maven

Initially the source repository needs to be cloned into the system.

git clone https://github.com/apache/zeppelin.git

Build Zeppelin[6]

Next, create the software build for Apache Zeppelin using the maven,

mvn clean package -DskipTests [Options]

Each Zeppelin interpreter requires different build options. An example for Spark Interpreter is given below,

mvn clean package -Pspark-1.6

Starting the Zeppelin The Zeppelin is started from the bash by,

./bin/zeppelin-daemon.sh start

To stop the Zeppelin type the following command into the bash shell,

./bin/zeppelin-daemon.sh stop

After passing the starting command through the shell, go to localhost on the browser and listen to port 8080, i.e, go to http://localhost:8080 on the browser. This opens up the Zeppelin IDE on the browser.

References

edit
  1. ^ Osipov, Dan. "The Rise of Data Science Notebooks". https://www.datanami.com/. Retrieved 15 September 2016. {{cite web}}: External link in |website= (help)
  2. ^ Hellerstein, Joseph. "Interactive Data Analysis: The Control Project" (PDF). http://control.cs.berkeley.edu/. IEEE. Retrieved 15 September 2016. {{cite web}}: External link in |website= (help)
  3. ^ "Apache Wiki page". https://wiki.apache.org/. Retrieved 15 September 2016. {{cite web}}: External link in |website= (help)
  4. ^ "Apache Wiki page". https://wiki.apache.org/. Retrieved 15 September 2016. {{cite web}}: External link in |website= (help)
  5. ^ https://zeppelin.apache.org https://zeppelin.apache.org/docs/0.6.1/. Retrieved 15 September 2016. {{cite web}}: External link in |website= (help); Missing or empty |title= (help)
  6. ^ Zeppelin, Apache. "Apache Zeppelin Official Website".