If you’re in the data science or machine learning field you have heard and, probably, used Jupyter notebooks. People love to love them, people love to hate them.
Jupyter notebooks can be magnificent tools for data analysis, data story building and model testing. They can even be great tools, if you’ve set them up correctly, for a development platform (see Netflix). There are also legitimate reasons to be concerned about Jupyter notebooks.
If you take a look at my personal GitHub repositories, almost 90% of my repositories are Jupyter notebooks. I love to use notebooks for all the posts I do here on my blog because it makes story-telling and data tutorials easy to convey. If you were to be able to look at some of my work repos, they are either a combination of Jupyter notebooks and code files or just code files. It depends on the team I am working for. Am I in a data science team? Software team? Product team? I’ve been part of at least these three types of teams and each team has had their preferences when it comes to dealing Jupyter notebooks, reproducibility and what files go to prod, etc.
Hence, if I were to boil down the reason to choose or not to choose to use Jupyter notebooks, it would be down to what tech environment you belong to and what you are using them for.
I find that software engineering teams have more rigid demands for file handling and how development should be done. There is this idea that notebooks shouldn’t be used at all for development or so has been my experience. Much less used to put code in production. They’re seen as messy, disorganized and hard to maintain. “Proper” coding etiquette can’t supposedly be done in notebooks.
Fully data science or product oriented teams, can be more relaxed as to what tools pass the purity test. The goal of a software engineering team vs. a data science team vs. a product team are vastly different. A software engineering team may be mostly concerned about efficiency while a data science team may be mostly concerned about data analysis. From a data science team standpoint, if you’re conducting a data analysis, you probably would benefit the flexibility and transparency of the Jupyter notebook. If you’re in a software engineering team this might not be so.
Are any of these teams right or wrong? Yes and no.
yay vs. nay
From a software engineering perspective, you can, in theory, use Jupyter notebooks as a development platform. I do not think they would be the best tool for the job. There are much better alternatives for production code from a stability standpoint., especially if Python is not the only tool involved in the product. If you are going to write production code, you really should use a proper IDE instead.
I do think we should follow proper coding etiquette even while working inside notebooks. It doesn’t matter what IDE or editor we use, we should always use proper coding etiquette. Notebooks can get big and out on control. The latter could also cause reproducibility issues. Let’s elaborate on these two points a little bit more.
A common problem with Jupyter notebooks is, surprisingly, the lack of documentation. Jupyter makes it incredibly easy to make nice, organized projects, but often, we use these notebooks as literal scratch paper pads making it hard to know what we were measuring in the first place. Using Jupyter’s markdown makes it incredibly easy to prettify and organize your notebook, so the documentation issue truly is avoidable.
Another common problem is the size of these notebooks. Just like a code file, a given Jupyter notebook should have one purpose and one purpose only – not 5. You shouldn’t do your data cleaning, analysis, modeling, validation etc in one single notebook. Each of these tasks should be their own process. Modularization can certainly help with the notebook size problem. There’s no reason why you shouldn’t be able to modularize your code in a notebook.
Often times, it can be hard to replicate a Jupyter notebook; mainly because there are no indications of the code dependencies. I fault the developer, not the Jupyter notebook. In fact, most of the issues I’ve describe here are not endemic to the notebooks themselves, but to the developers, e.g. bad programming habits.
With that said, I think notebooks truly shine when it comes to prototyping. It is incredibly useful to be able to document your work and see the output of each line of code as you go. I love playing with with new concepts, create visualizations, or write markdown reports with it. These notebook prototypes eventually become production code.
Notebooks are also really practical for situations where you want to work around the code’s output, not the code itself. This is the bread and butter of data science. Notebooks are also great training tools. They offer an excellent way to mix commands, their output, and explanations into a single document with little effort.
This, I think, is the true purpose of Jupyter notebooks and how we should use them.
In a nutshell, if you’re prototyping, 100% take advantage of Jupyter notebooks while still following coding standards. However, if your goal is to have production code, stick a true IDE/editor.