Skip to main content

How to use Jupyter with Meltano

As of Sep-2022, jupyter is not yet available as utility on Meltano Hub, so you need to add it to your Meltano project as (local) custom plugin. Even if it is available as plugin on the Hub, you will probably want to do some customization, so it will make sense to create your own version.

The steps involved are:

  1. Add a (local) custom jupyter utility
  2. Add potential Python libraries you will need
  3. Optional: Add database connection variables to be exposed into the environment
  4. Add Papermill or nbconvert to execute your data transformations & customize commands for jupyter to run the notebook
  5. Execute your notebooks & add them to a schedule

Steps 1-4 are customizing the meltano.yml to suit your setup, if you're comfortable with the process, read through them, and go straight for step 4 to copy & customize the yaml block from there, and you're ready to go.

1. Add a (local) custom jupyter utility

You can add a custom plugin either via the CLI or using the YAML file. As jupyter serves multiple purposes, the type "utility" is recommended as plugin type.

For Jupyter, you can choose both the "classic notebook" installed via the pip-package "jupyter" or the newer jupyterlab installed via the pip-package jupyterlab.

Note: The code snippets will use jupyterlab, if you want the classic notebook, just replace jupyterlab=>jupyter, the "executable" stays the same.

Via the yaml file: add the follow code block inside your meltano.yml (ignore the launch_ip0 command if not necessary for you):

plugins:
utilities: # meltano invoke jupyter will start up the lab...
- name: jupyterlab
namespace: jupyterlab
pip_url: jupyterlab
executable: jupyter
commands:
launch_ip0: #important for Mac users running Meltano inside Docker.
args: lab --ip=0.0.0.0
description: Start lab server, on any ip range for Mac users inside docker.
launch:
args: lab
description: Start lab server

Run meltano install to ensure the correctness of your yaml file. Then run meltano invoke jupyterlab:launch to launch the GUI.

Using the command line, you can also run meltano add --custom utility jupyterlab and interactively fill out these properties.

2. Add potential Python libraries you will need

To work with jupyter notebooks, you will end up using additional Python libraries which will generally fall into three categories

  1. helper libraries like matplotlib or pandas
  2. connection libraries like sqlAlchemy (and psycopg2)
  3. And nbconvert or papermill to execute the notebook. These are handled in step 4.

To add additional Python libraries, extend the meltano.yml definition by space-separated pip-package names behind pip_url: jupyterlab. An example:

plugins:
utilities: # meltano invoke jupyter will start up the lab...
- name: jupyterlab
namespace: jupyterlab
pip_url: jupyterlab pandas matplotlib sqlalchemy psycopg2-binary
executable: jupyter
commands:
launch_ip0: #important for Mac users running Meltano inside Docker.
args: lab --ip=0.0.0.0
description: Start lab server, on any ip range for Mac users inside docker.
launch:
args: lab
description: Start lab server

3. Optional: Expore database connection variables into the environment

To connect to datasources across different plugins it is useful to expose the connection details using environment variables. Meltano is able to do so in the meltano.yml. Here is an example configuration using plain text connection details:

version: 1
default_environment: dev
environments:
- name: dev
config:
env:
PG_HOST: postgres
PG_PORT: "5432"
PG_DB: demo
PG_USER: admin
PG_PWD: password

You can read more on Meltano and environment variables here. These variables will then be accessible inside your jupyter notebooks. E.g.

import os

PG_HOST = os.getenv("PG_HOST", default=None)
PG_PORT = os.getenv("PG_PORT", default=None)
PG_DB = os.getenv("PG_DB", default=None)
PG_USER = os.getenv("PG_USER", default=None)
PG_PWD = os.getenv("PG_PWD", default=None)

4. Execute notebooks via nbconvert or papermill

Whether your notebook is outputting machine learning accuracy data, a business intelligence report or is transforming data inside your data warehouse, you will likely want to be able to execute it without the GUI, using either:

  • a CLI
  • or automatic via the Meltano scheduler.

Jupyter and jupyterlab both offer "nbconvert" as default option to execute notebook. You can also use the additional Python package papermill. Papermill allows you to execute and parametrize notebooks for execution, while nbconvert only does execution.

If you want to use "nbconvert" you will want to add a command to the plugin, replace "notebook/sql_magic.ipynb" with your notebook path:

  - name: jupyterlab
namespace: jupyterlab
pip_url: jupyterlab pandas matplotlib sqlalchemy psycopg2-binary papermill
executable: jupyter
commands:
launch_ip0:
args: lab --ip=0.0.0.0
description: Start lab server, on any ip range for Mac users inside docker.
launch:
args: lab
description: Start lab server
execute:
args: nbconvert --to notebook --execute notebook/sql_magic.ipynb
description: Start lab server

You can then execute with meltano invoke jupyterlab:execute

If you want to use papermill, the easiest option is to use plugin inheritance to reuse the venvs created for each plugin. That way, you will not need to install jupyterlabs and all the dependencies twice. Here's an example yaml block:

  - name: papermill
inherit_from: jupyterlab
executable: papermill
commands:
execute:
args: notebook/sql_magic.ipynb output/output.ipynb -p price_1 1000
description: Start lab server, on any ip range for Mac users inside docker.

You will need to adapt the "args" to your notebook path, output path and parameters. The example notebook here has one cell with a defined parameter "price_1" which we are able to override from the outside. For details, refer to the (pleasently short) documentation from papermill, it's a simple process.

5. Execute & schedule your notebooks

Putting it all together, you will end up with a meltano.yml like this:

plugins:
utilities: # meltano invoke jupyter will start up the lab...
- name: jupyterlab
namespace: jupyterlab
pip_url: jupyterlab pandas matplotlib sqlalchemy psycopg2-binary papermill
executable: jupyter
commands:
launch_ip0:
args: lab --ip=0.0.0.0
description: Start lab server, on any ip range for Mac users inside docker.
launch:
args: lab
description: Start lab server
execute:
args: nbconvert --to notebook --execute notebook/sql_magic.ipynb
description: Start lab server

- name: papermill
inherit_from: jupyterlab
executable: papermill
commands:
execute:
args: notebook/sql_magic.ipynb output/output.ipynb -p price_1 1000
description: Start lab server, on any ip range for Mac users inside docker.

Then execute meltano invoke papermill:execute to run your notebook and possibly include it in your meltano pipeline.