Skip to content

JupyterLab and Jupyter AI

Introduction

With the Machine Learning / Generative AI Tier you will be able to access a hosted JupyterHub system which provides you a JupyterLab environment where you can develop with our Jupyter Notebook templates or develop your own code.

In this chapter we present you some useful tools and guides how to get the most out of JupyterLab.

jupytext

All our notebooks are synched with a .py file by utilizing jupytext and written in percent format (https://jupytext.readthedocs.io/en/latest/formats-scripts.html). This allows to run python scripts directly on the command line while a replicate notebook allows on the other hand to also run the code in a Jupyter notebook. Further, jupytext also supports other languages and makes the JupyterLab programming language agnostic: https://jupytext.readthedocs.io/en/latest/languages.html

Currently we provide all our scripts as Python files and to install jupytext for a Python environment the following pip install command applies:

pip install jupytext

Pair all your notebooks with percent script with a configuration file in the notebooks folder via a jupytext.toml file:

[formats]
"notebooks/" = "ipynb"
"scripts/" = "py:percent"
or in a pyproject.toml at the root of your repo (given that your notebooks are located in a folder with the name 'notebooks'):

[tool.jupytext.formats]
"notebooks/" = "ipynb"
"scripts/" = "py:percent"

This allows you to write .py files and then synch them with a notebook via the according nav menu action. The Jupyter notebook will then be automatically created and after that both files are synched. And any changes on one of these files will be also synched in the other files. In this way you can write .py scripts which are also more easily version controlled while the .ipynb gets synched and updated with. Further it allows you to run the Python scripts (.py file) in environments such as HPC terminals where one does not have access to a JupyterLab environment.

In your .py file you can define multiline markdown cells in the following way:

# %% [markdown]
# This is a multiline
# Markdown cell

And code cells are defined as follows:

# %%
# This is a code cell

If you want to set an environment variable then that would looks as follows:

# %%
# %env ENVNAME=...

Shell commands can be issued with an exclamation mark:

! pip install jupyter-ai

And question marks allow you to print out the documentation of a specific function:

np.argmax?

So called line (%; single line in a cell) or cell (%%; entire cell) magic commands, e.g. %%ai with the jupyter-ai which we will discuss in the next section, are being commented out by default by jupytext. Thus, if you wish to run a magic command provided in a notebook, you need to first uncomment it. If you do not wish this default behaviour then you can set comment_magics=false in the jupytext.toml or pyproject.toml file.

Making use of Large Language Models with Jupyter AI

The Jupyter Lab environment of the Machine Learning / Generative AI Tier includes the possibility to apply Large Language Models (LLMs) for your coding projects via Jupyter AI. This blog article gives an overview how you can use the MQS Dashboard and MQS Infrastructure to get started with LLM with minimum efforts and no need to install your own cloud infrastructure.

In the following we will showcase some applications for quantum chemists, chemical engineers and chemists how to make use of the MQS tool stack to create sophisticated workflows.

jupyter-ai

Jupyter AI allows you to use generative AI / LLMs within Jupyter notebooks quite easily via the magic command %%ai followed by the specific model and argument settings:

%%ai PROVIDER:MODEL_ID

In the next subsections we show how different models can be used and have tested some simple examples. This documentation page provides an oversight of the different model providers supported currently: https://jupyter-ai.readthedocs.io/en/latest/users/index.html#model-providers

The output format is defined via the --format (-f) argument and the output formats available are:

  • code

  • image (for Hugging Face Hub’s text-to-image models only)

  • markdown

  • math

  • html

  • json

  • text

No format option provision will give you markdown output as the default.

Here the first notebook lines for setting up your notebook for jupyter-ai:

! pip install jupyter-ai
%load_ext jupyter_ai
%env HUGGINGFACEHUB_API_TOKEN=YOUR_TOKEN
%env OPENAI_API_KEY=YOUR_TOKEN

and if you would like to set your default provider and model, then you can apply the following line:

%config AiMagics.default_language_model = "PROVIDER:MODEL_ID"

Now you can apply the following magic command:

%%ai chatgpt -f code
Write a Python code module which parses the output of a Quantum Espresso pw.x calculation to retrieve the final geometry of the molecular structure. 

OpenAI's ChatGPT model

In the following we will show how to make use of the OpenAI API to generate CAS numbers for a class of chemicals to then search the MQS database with these CAS numbers.

In a Jupyter Notebook you can send a text string to OpenAI in the following way:

%%ai chatgpt -f json
.Generate a list of CAS\
identifiers for polychlorinated biphenyl (PCB) compounds with 12 C atoms. Print the set of CAS in a Python list.

If you are working with Python scripts then the following code lines will allow you to communicate with the OpenAI API:

response = openai.ChatCompletion.create(model='gpt-3.5-turbo',
                                        messages = [{"role": "user",
                                                     "content": "Generate a list of CAS\
identifiers for polychlorinated biphenyl (PCB) compounds with 12 C atoms. Print the set of CAS in a Python list."

As you see with this example we have tested to generate CAS numbers of polychlorinated biphenyls (PCBs; Figure 4).

Figure 4: Left: Polychlorinated biphenyl (PCB) structure

The following text string was sent to the gpt-3.5-turbo LLM and gpt-4.0:

"Generate a list of CAS identifiers for polychlorinated biphenyls (PCBs) with 12 C atoms."

The results can look like this (you will probably experience that your results will differ for each request you send to the OpenAI API):

When querying this list of CAS identifiers against the MQS database we found ... of the ... molecules in the database.

Either many of the compounds gpt-3.5-turbo and gpt-4.0 have generated are not available in the database or the two LLMs have generated wrong information. Screening against a database can be a good countercheck to see how well the LLM generated structures which are already well documented.

For novel, undocumented molecules generated by a LLM no countercheck can be made against the database. For generating novel molecules one has to have a whole set of consistency checks (e.g. applied with quantum chemistry calculations) and a methodology in place.

Feel free to reach out if you want to talk more about such a molecular design pipeline tailored to your use-case applications ([email protected]). We have different approaches and implemented ML/AI models for generating chemical structures.

Claude

We compare now the same request for ChatGPT with Anthropic's Claude model:

%%ai anthropic:claude-v1.2 -f json
.Generate a list of CAS\
identifiers for polychlorinated biphenyl (PCB) compounds with 12 C atoms. Print the set of CAS in a Python list.

Hugging Face Hub

Hugging Face is a great platform to test different available open-source LLM models easily. In the following you will get a quick introduction how to make use of Hugging Face with JupyterLab. For this you need to have already set yourself a Hugging Face user account (https://huggingface.co/) and generate an Access Token.

Via Hugging Face Hub you can get access to a large number of open-source models.

Let's test the same example as before but this time with a 1.56bit model which we access through Hugging Face Hub:

%%ai huggingface_hub:

MQS Python SDK

The MQS Python SDK allows you to access easily the MQS molecules database with around 200 million molecules and their different identifiers and data.

You can for example retrieve the SMILES identifiers of a set of CAS identifiers: