StateOfJupyter 2017

From OSGeo
Jump to navigation Jump to search

The state of Jupyter

How Project Jupyter got here and where we are headed.

By Fernando Pérez, Brian Granger — January 26, 2017

In this post, we’ll look at Project Jupyter and answer three questions:

  1. Why does the project exist? That is, what are our motivations, goals, and vision?
  2. How did we get here?
  3. Where are things headed next, in terms of both Jupyter itself and the context of data and computation it exists in?

Project Jupyter aims to create an ecosystem of open source tools for interactive computation and data analysis, where the direct participation of humans in the computational loop—executing code to understand a problem and iteratively refine their approach—is the primary consideration.

Anchoring Jupyter around humans is key to the project; it helps us both narrow our scope in some directions (e.g., we are not building generic frameworks for graphical user interfaces) and generalize in others (e.g., our tools are language agnostic despite our team’s strong Python heritage). In service of this goal, we:

  1. Explore ideas and develop open standards that try to capture the essence of what humans do when using the computer as a companion to reasoning about data, models, or algorithms. This is what the https://jupyter-client.readthedocs.io" Jupyter messaging protocol or the https://nbformat.readthedocs.io Notebook format provide for their respective problems, for example.
  2. Build libraries that support the development of an ecosystem, where tools interoperate cleanly without everyone having to reinvent the most basic building blocks. Examples of this include tools for https://jupyter-client.readthedocs.io/en/latest/wrapperkernels.html creating new Jupyter kernels (the components that execute the user’s code) or https://nbconvert.readthedocs.io converting Jupyter notebooks to a variety of formats.
  3. Develop end-user applications that apply these ideas to common workflows that recur in research, education, and industry. This includes tools ranging from the now-venerable https://ipython.org IPython command-line shell (which continues to evolve and improve) and our widely used https://jupyter-notebook.readthedocs.io Jupyter Notebook to new tools like https://jupyterhub.readthedocs.io JupyterHub for organizations and our next-generation https://github.com/jupyterlab/jupyterlab JupyterLab modular and extensible interface. We strive to build highly usable, very high-quality applications, but we focus on specific usage patterns: for example, the architecture of JupyterLab is optimized for a web-first approach, while other projects in our ecosystem target desktop usage, like the open source https://nteract.io nteract client or the support for Jupyter Notebooks in the commercial PyCharm IDE.
  4. Host a few services that facilitate the adoption and usage of Jupyter tools. Examples include NBViewer, our online notebook sharing system, or the free demonstration service try.jupyter.org. These services are themselves fully open source, enabling others to either deploy them in custom environments or build new technology based on them, such as the https://mybinder.org/ mybinder.org system that provides single-click hosted deployment of GitHub repositories with custom code, data, and notebooks

Some cairns along the trail

This is not a detailed historical retrospective; instead, we’ll highlight a few milestones along the way that signal the arrival of important ideas that continue to be relevant today

Interactive Python and the SciPy ecosystem. Jupyter evolved from the https://ipython.org IPython project, which focused on interactive computing in Python, tuned to the needs and workflow of scientific computing. Since its start in 2001, IPython was an ethical commitment to building an open source project (so research could be shared without barriers), and a recognition that the features of Python could make it a challenger to the existing proprietary powerhouses common in science at the time. This meant that IPython grew in tandem with the Scientific Python ecosystem, providing the “doorway” to the use of NumPy, SciPy, Matplotlib, pandas, and the rest of this powerful stack. From the start we found a good division of labor, where IPython could focus on the human-in-the-loop problems while other projects provided data structures, algorithms, visualization, and more. The various projects share code freely with a common licensing structure, enabling each to grow its own teams while providing tools that, together, create a powerful system for end users.

Open protocols and formats for the IPython Notebook. Around 2010, after multiple experiments in building a notebook for IPython, we took the first steps toward the architecture we have today. We wanted a design that kept the “IPython experience,” meaning that all the features and workflow of the terminal were preserved but it would operate over a network protocol so that a client could connect to a server providing the computation regardless of the location of each. Using the ZeroMQ networking library, we defined a protocol that captured all the actions we were familiar with in IPython, from executing code to tab-completing (an introspection action) an object’s name. This led, in a little over a year, to the creation of a graphical client (the still-used https://qtconsole.readthedocs.io Qt Console) and the first iteration of today’s Jupyter Notebook (then-named IPython), released in summer 2011 (more details about this process can be found in http://blog.fperez.org/2012/01/ipython-notebook-historical.html this blog post).

From IPython to Jupyter. The IPython Notebook was rapidly adopted by the SciPy community, but it was immediately clear that the underlying architecture could be used in any programming language that could be used interactively. In rapid succession, kernels for languages other than Python (Julia, Haskell, R, and https://github.com/jupyter/jupyter/wiki/Jupyter-kernels more ) were created; we had a hand in some, but most were independently developed by users of those languages. This cross-language usage forced us to carefully validate our architecture to remove any accidental dependencies on IPython, and in 2014, led us to rename most of the project as Jupyter. The name is inspired by Julia, Python, and R (the three open languages of data science) but represents the general ideas that go beyond any specific language: computation, data, and the human activities of understanding, sharing, and collaborating.

The view from today’s vantage point

The ideas that have taken Juypter this far are woven into a larger fabric of computation and data science that we expect to have significant impact in the future. The following are six trends we are seeing in the Jupyter ecosystem:

  1. Interactive computing as a real thing. Data-oriented computing has exposed a much larger group of practitioners to the idea of interactive computing. Folks in the scientific computing community are long familiar with this human-in-the-loop computing, with programs like Matlab, IDL, and Mathematica. However, when we first started working on IPython in the early 2000s, this workflow was mostly foreign to developers in the traditional software engineering world. Languages such as Python and Ruby offered interactive shells, but they were limited in features and meant for lightweight experimentation rather than meant to be first-class working environments. When the first version of IPython was created in 2001, it was an attempt to make interactive computing with Python pleasant for those who did it full time. Tools such as Jupyter, RStudio, Zeppelin, and Databricks have pushed this further with web-based interactive computing. As a result, millions of statisticians, data scientists, data engineers, and artificial intelligence/machine learning folks are doing interactive computing on a daily basis. The traditional integrated development environments (IDE) are being replaced by interactive computing environments; Jupyter/JupyterLab and RStudio are preeminent examples of this trend. This is accompanied by the formalization, identification, and development of building blocks for interactive computing: kernels (processes in which to run code), network protocols (formal message spec to send code to kernels and get back results), user interfaces (that provide a human interface to the kernels), and MIME-based outputs (representation of results of any type beyond simple text), etc.
  2. Widespread creation of computational narratives. Live code, narrative text, and visualizations are all being integrated together into documents that tell stories using code and data. These computational narratives are being leveraged to produce and share technical content across a wide range of audiences and contexts in books, blog posts, peer-reviewed academic publications, data-driven journalism, etc. Document formats such as the Jupyter Notebook and R Markdown are encoding these computational narratives into units that are sharable and reproducible. However, the practice of computational narratives is spreading far beyond these open formats to many interactive computational platforms.
  3. Programming for specific insight rather than generalization. The overarching goal of computer science is generalization and abstraction, and software engineering focuses on the design of libraries and applications that can be reused for multiple problems. With the rise of interactive computing as a practice and the capture of this process into computational narratives—what we refer to as http://blog.fperez.org/2013/04/literate-computing-and-computational.html Literate Computing —we now have a new population who uses programming languages and development tools with a different purpose. They explore data, models, and algorithms, often with very high specificity, perhaps even spending vast effort on a single data set, but ask complex questions and extract insights that can then be shared, published, and extended. Since data is pervasive across disciplines, this represents a dramatic expansion of the audience for programming languages and tools, but this audience’s needs and interests are different from those of “traditional” software engineers.
  4. Individuals and organizations embracing multiple languages. When working with data, many individuals and organizations recognize the benefits of leveraging the strengths of multiple programming languages. It is not uncommon to see the usage of Python, R, Java, and Scala in a single data-focused research group or company. This pushes everyone to develop and build protocols (Jupyter message specification), file formats (Jupyter Notebook, Feather, Parquet, Markdown, SQL, JSON), and user interfaces (Jupyter, nteract) that can work in a unified manner across languages and maximize interoperability and collaboration.
  5. Open standards for interactive computing. A decade ago, the focus was on creating open standards for the internet, such as HTML, HTTP, and their surrounding machinery. Today, we are seeing the same type of standards developed for interactive, data-oriented computing. The Jupyter Notebook format is a formal specification of a JSON document format for computational narratives. Markdown is a standard for narrative text (albeit a slippery one). The Jupyter message specification is an open standard that allows any interactive computing client to talk to any language kernels. Vega and Vega-Lite are JSON schemas for interactive visualizations. These open standards enable a wide range of tools and languages to work together seamlessly.
  6. Sharing data with meaning. Open data initiatives by governments and organizations provide rich sources of data for people and organizations to explore, reproduce experiments and studies, and create services for others. But data only comes alive with the right tools: Jupyter, nteract, RStudio, Zeppelin, etc., allow users to explore these data sets and share their results, humanizing the process of data analysis, supporting collaboration, and surfacing meaning from the data with narrative and visualization.

The question is then: do all of these trends sketch a larger pattern? We think they all point to code, data, and UIs for computing being optimized for human interaction and comprehension.

In the past, humans had to bend over backward to satisfy the various constraints of computers (networks, memory, CPU, disk space, etc.). Today, these prior constraints have been sufficiently relaxed that we can enjoy high-level languages (Python, R, Julia) and rich, network-aware interfaces (web browsers and JavaScript frameworks). We can build incredibly powerful distributed systems with well-designed browser-based user interfaces that let us access computational resources and data regardless of their geographical location. We can now start optimizing for our most important resource: human time.

The relaxation of these prior constraints didn’t magically trigger the creation of human-oriented computing systems, but it opened the door. The real impetus was probably the explosion of data across every imaginable organization and activity. That created a deep need for humans to interact with code and data in a more significant and meaningful way. Without that, Jupyter would still exist, but it would likely be focused on the much smaller academic scientific computing community.


Organizations need to start focusing on humans as they develop their data strategies. The big success of Jupyter in organizations hasn’t come from the top-level managers making purchasing decisions. It has come from the individual developers and data scientists who have to spend their days wrangling code and data. In the coming years, the tools and systems that put humans front and center, prioritizing design and usability as much as raw performance, will be the ones actually used and widely adopted. We built Jupyter the way it is because we wanted to use it ourselves, and we remain committed to these ideas as we move forward.



Acknowledgments

In this space, we can’t do justice to the many individuals who have made Jupyter possible, but we want to collectively thank all of you: users, developers, and participants in the online forums and events of the many communities we interact with. This project exists first and foremost to serve a world of openly shared ideas, tools, and materials, whether you are a high school teacher, a musicologist, a cancer researcher, or a developer building data science tools for your company. From our long-term developers to those bringing our tools to your new colleagues, thanks for engaging with the project.

Our work on Jupyter would be impossible without the funding agencies that have generously supported us: the Alfred P. Sloan Foundation, the Gordon and Betty Moore Foundation, the Helmsley Charitable Trust, and the Simons Foundation. Finally, we’d like to thank the industry partners that contribute funds, resources, and development effort to the project: Bloomberg, Continuum Analytics, Enthought, Google, IBM, MaxPoint Interactive, Microsoft, Netflix, and Rackspace.

We would like to thank Jamie Whitacre and Lisa Mann for valuable contributions to this post.

Fernando Pérez

Fernando Pérez is a staff scientist at Lawrence Berkeley National Laboratory and and a founding investigator of the Berkeley Institute for Data Science at UC Berkeley, created in 2013. He received a PhD in particle physics from the University of Colorado at Boulder, followed by postdoctoral research in applied mathematics, developing numerical algorithms. Today, his research focuses on creating tools for modern computational research and data science across domain disciplines, with an emphasis on high-level languages, interactive and literate...
 
Brian Granger

Brian Granger is an Assistant Professor of Physics at Cal Poly State University in San Luis Obispo, CA. He has a background in theoretical atomic, molecular and optical physics, with a Ph.D from the University of Colorado. His current research interests include quantum computing, parallel and distributed computing and interactive computing environments for scientific and technical computing. He is a core developer of the IPython project and is an active contributor to a number of other open source projects focused on scientific computing in Py...