Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Using polyglot notebooks (dotnet interactive) as input #806

Closed
kMutagene opened this issue Mar 7, 2023 · 21 comments · Fixed by #874
Closed

Using polyglot notebooks (dotnet interactive) as input #806

kMutagene opened this issue Mar 7, 2023 · 21 comments · Fixed by #874

Comments

@kMutagene
Copy link
Contributor

kMutagene commented Mar 7, 2023

Emitting notebooks with this tool is a killer feature. Since there is already compatibility with the internal model and the ipynb format, i suggest another notebook-based feature: using notebooks as input. I would love to help implement this, but would need some pointers on how to navigate the code base.

This has huge advantages over working with literal scripts:

  • one can work on a single notebook in an editor such as vscode and interactively evaluate it (e.g. check that the markdown renders correctly or inspect plots and other interactive output) without having to run the tool in watch mode (which for example takes more than 10 minutes to start up for a project with many literate scripts such as plotly.net.
  • It makes colaborating on docs and blog-post style documents way easier, since it eliminates getting the actual project running with it's intricacies regarding the individual build chains. You just have to commit a single notebook file.

This would be similar to python's nbconvert, and i think such a tool is a critical part missing in the .NET notebook landscape. Maybe it would be better to make a standalone tool for this, but that is up for debate, i focused on this repo since it can already generate notebooks.

Note that you can already use nbconvert to convert polyglot notebooks to html (which is what i currently do), but that means you have to install and maintain a python environment instead of being able to use a dotnet tool.

@nojaf
Copy link
Collaborator

nojaf commented Mar 7, 2023

Hello,

Thank you for bringing up this compelling feature request. While I am interested in supporting this idea, I would like to gather more information to better understand its implementation. As I do not have prior experience with notebooks, I would appreciate it if you could provide more details about the technical aspects of the proposed feature.

For instance, would this feature require a new file format or would it be embedded within an existing one? I would also appreciate it if you could describe the input and output process for an end user. This would enable me to gain a better understanding of the technical side of the feature and provide better feedback.

Thank you, and I look forward to hearing more about this feature request.

@kMutagene
Copy link
Contributor Author

kMutagene commented Mar 7, 2023

Sure, ill try my best.

What are notebooks?

Notebooks are interactive documents popularized by the python package jupyter. The file extension is .ipynb (Interactive PYthon NoteBook), but you can use many languages in this format, more on that below. They are especially useful for all kinds of iterative data science applications, where a data transformation <-> visualization loop is common.

Notebooks are rendered as an interactive document that can contain code cells and markdown. Code cells can be executed, and the last object in a code cell is usually displayed as formatted output below the code cell. Markdown cells can be used to add formatted text annotations to contextualize code. You can maybe already see that there are many parallels of notebooks and using literate scripts with embedded output via fsdocs, the only real difference being that notebooks are interactive, meaning you have a "play" button next to each code cell, while fsdocs is usually a tool that you run once and then host the output somewhere.

An example

For an example how such a document looks like, take a look at jupyters official example here or a F# notebook generated by fsdocs for plotly.net (a data visualization library that i maintain) docs here (note that you have to execute the cells in that notebook to get interactive output)

The file format

.ipynb files are basically json files that contain the cells, their outputs, and document metadata (full specs: https://nbformat.readthedocs.io/en/latest/). For example this file:

{
 "cells": [
  {
   "cell_type": "code",
   "execution_count": 1,
   "metadata": {
    "dotnet_interactive": {
     "language": "fsharp"
    },
    "polyglot_notebook": {
     "kernelName": "fsharp"
    }
   },
   "outputs": [
    {
     "data": {
      "text/html": [
       "<div class=\"dni-plaintext\"><pre>42</pre></div><style>\r\n",
       ".dni-code-hint {\r\n",
       "    font-style: italic;\r\n",
       "    overflow: hidden;\r\n",
       "    white-space: nowrap;\r\n",
       "}\r\n",
       ".dni-treeview {\r\n",
       "    white-space: nowrap;\r\n",
       "}\r\n",
       ".dni-treeview td {\r\n",
       "    vertical-align: top;\r\n",
       "    text-align: start;\r\n",
       "}\r\n",
       "details.dni-treeview {\r\n",
       "    padding-left: 1em;\r\n",
       "}\r\n",
       "table td {\r\n",
       "    text-align: start;\r\n",
       "}\r\n",
       "table tr { \r\n",
       "    vertical-align: top; \r\n",
       "    margin: 0em 0px;\r\n",
       "}\r\n",
       "table tr td pre \r\n",
       "{ \r\n",
       "    vertical-align: top !important; \r\n",
       "    margin: 0em 0px !important;\r\n",
       "} \r\n",
       "table th {\r\n",
       "    text-align: start;\r\n",
       "}\r\n",
       "</style>"
      ]
     },
     "metadata": {},
     "output_type": "display_data"
    }
   ],
   "source": [
    "let a = 42\n",
    "\n",
    "a"
   ]
  }
 ],
 "metadata": {
  "kernelspec": {
   "display_name": ".NET (F#)",
   "language": "F#",
   "name": ".net-fsharp"
  },
  "polyglot_notebook": {
   "kernelInfo": {
    "defaultKernelName": "fsharp",
    "items": [
     {
      "aliases": [
       "f#",
       "F#"
      ],
      "languageName": "F#",
      "name": "fsharp"
     },
     {
      "aliases": [
       "frontend"
      ],
      "languageName": null,
      "name": "vscode"
     }
    ]
   }
  }
 },
 "nbformat": 4,
 "nbformat_minor": 2
}

is rendered by vscode like this:

image

Notebooks in .NET

Jupyter can use any kind of Kernel, which is basically a program that tells jupyter how to run code of a specific language. The official .NET kernel that can run F#, C#, and many more is contained in dotnet interactive

dotnet interactive allows library authors to distribute custom renderers, which can take a .NET object and transform it into output in the notebook. We have this for example for plotly.net:
image

Answering your question

For instance, would this feature require a new file format or would it be embedded within an existing one

Since i am not sure of how fsdocs works internally, i cannot answer this for certain. My personal usecase would be converting .ipynb files to html, meaning i would write a notebook with markdown and code cells, with the markdown being converted to html tags as usual, and the code cells being converted into code blocks, with the output embedded below.

For further demonstration, here is a simple page that i work on for fun, where i convert F# notebooks to html via nbconvert, a python package and parse and embedd the html output via fornax: https://kmutagene.github.io/the-dotnet-graph-gallery/graphs/distribution/histogram/histogram-fsharp.html

It would be awesome if i would be able to do that only using .NET tools instead of relying on a python tool for that conversion.

There are many similarities between this literate file using fsdocs fsi evaluation:

let a = 42
(***include-value: a***)

and this cell and output combinate in a notebook:

image

So my naive approach would be just parsing code cell content and output and emitting that into html?

@nojaf
Copy link
Collaborator

nojaf commented Mar 7, 2023

Thank you very much for this elaborate response. Much appreciated!
I'm starting to see where this is going and some bits are making sense.

However, I'm a bit lost about the execution part of things.
The fsi evaluation happens once right now and you make it seem like the notebook can be edited and then run. (Which I believe is the whole point).
How does that magic work? Pressing that play button in the let a = 42 ; a snippet. I assume that would be in the browser and the F# code is evaluated somewhere?

Otherwise, I believe we can add support for a new input format. We would need to take a good look at what needs to be done to have parity with scripts for example. We'll definitely need to refine things.

And last but not least, I'd like to know if my fellow maintainers would be on board with this as well.
@dsyme, @nhirschey and @baronfel thoughts?

@nhirschey
Copy link
Collaborator

Hi @kMutagene, this is a feature that dsyme requested to be added by the F# data science community back when he rebooted fslab and gave your team admin control (2020?). It's in that old "discussion" thread in the fslab repo where Don is replying to you fslaborg/FsLab#3 (comment), also fslaborg/FsLab#3 (reply in thread)

So, Don approves also and I would really like it and it's been on a mental "todo" of mine for this library. Getting closer to nbdev functionality would be nice.

At a high level, the notebook is json with code and markdown cells denoted by metadata, so it's even easier than parsing an .fsx or .md file. We get the markdown and code paragaphs for free. Here is ParseScript.fs, we'd want a ParseIpynb.

I had a recent discussion related to this on the dotnet/interactive repo, they have a parsing library that they suggested we could use. The issue would be whether we want to support full polyglot notebooks with potential .js, powershell, C#, F# code ... and magic variables, etc. I kinda feel like that's out of scope and we should parse it ourselves and restrict it for now to parsing pure F# notebooks that only have the same features as .fsx files (i.e., no fancy polyglot notebook features).

@nhirschey
Copy link
Collaborator

nhirschey commented Mar 7, 2023

However, I'm a bit lost about the execution part of things...

@nojaf I think the idea is right now by adding _template.ipynb FSharp.Formatting will generate notebooks that you can download with links at the top of an F# library documentation. See the "download notebook" button at the top of https://plotly.net/.

That notebook is generated by fsdocs by converting .fsx -> FSharp.Formatting literate document -> .ipynb. I think the idea is to add .ipynb parsing so that you can have f# index.ipynb -> FSharp.Formatting literate document -> .fsx/.html/.md/.tex. So in the docs folder for input to fsdocs build you could use index.ipynb to generate documentation rather than index.fsx.

@kMutagene
Copy link
Contributor Author

kMutagene commented Mar 7, 2023

However, I'm a bit lost about the execution part of things.
The fsi evaluation happens once right now and you make it seem like the notebook can be edited and then run. (Which I believe is the whole point).
How does that magic work? Pressing that play button in the let a = 42 ; a snippet. I assume that would be in the browser and the F# code is evaluated somewhere?

The interactive part is only relevant while working with the notebook. Once saved, the notebook keeps the execution results saved in the output part of the json. Once that happened, the file is static (interactive work is done). Then we can take that saved notebook file and generate the fsdocs model from it, with direct mapping of markdown, code, and output. The workflow i'd imagine here is 'playing around' in the notebook until i am satisfied with the code/markdown, save the file, and use that saved file as input for fsdocs.

I think the idea is to add .ipynb parsing so that you can have f# index.ipynb -> FSharp.Formatting literate document -> .fsx/.html/.md/.tex

exactly, however direct conversion to html skipping the translation into the literate model would be enough for me personally if that's easier to implement

The issue would be whether we want to support full polyglot notebooks with potential .js, powershell, C#, F# code ... and magic variables, etc. I kinda feel like that's out of scope and we should parse it ourselves and restrict it for now to parsing pure F# notebooks that only have the same features as .fsx files (i.e., no fancy polyglot notebook features).

Not sure if any magic in the notebook matters at all - ultimately the notebook can be executed and saved by the respective tools that support the magic stuff. If the notebook environment takes care of all the execution, the only thing fsdocs would need to do is parsing code blocks and output and embedding it 'as-is' - but maybe that would go more in the direction of a standalone tool that does not need to care about integrating into an existing codebase. However, if execution is needed there might also be the option of using https://github.com/jonsequitur/dotnet-repl#-run-a-notebook-script-or-code-file-and-then-exit for that.

@kMutagene
Copy link
Contributor Author

So i had a first go at a library parsing iypynb files (nbformat v4): https://github.com/kMutagene/FSharp.Data.NBFormat

I would love it to be a very granular nuget package, but I am not sure about the namespace yet.
Should it be FSharp.Formatting.NBFormat or FSharp.Data.NBFormat?

However, please let me know if that modelling of nbformat is helpful for implementing ParseIpynb.fs in this library, and how you suggest to move forward ;)

@nhirschey
Copy link
Collaborator

Hi @kMutagene, nice work.

Is your objective with the library only to help with FSharp.Formatting, or are you working on your parser for other purposes too?

Have you seen https://www.nuget.org/packages/Microsoft.DotNet.Interactive.Documents? See description here dotnet/interactive#2685 (comment).

If we rely on an external nuget library for ‘ParseIpynb.fs’, it might make sense to use the one mentioned above that ‘dotnet/interactive’ already provides because it would presumably incorporate the special dotnet interactive features.

But it might be better to not take an external dependency (previously dsyme did not want to take an external dependency for an html dsl).

@kMutagene
Copy link
Contributor Author

Have you seen https://www.nuget.org/packages/Microsoft.DotNet.Interactive.Documents? See description here dotnet/interactive#2685 (comment).

I have not seen that, might have just used that instead of writing things from scratch, but here we are 💀

Is your objective with the library only to help with FSharp.Formatting, or are you working on your parser for other purposes too?

That's the reason why I was initially not sure about the namespace - I would have no problem with just commiting it into this library (meaning it would become FSharp.Formatting.NBFormat). However, that might be unnecessary because my approach is not without external dependencies either, as it uses System.Text.Json and a F# extension library for that.

I think we can expect the main way of writing F# notebooks being polyglot notebooks, and therefore it might make sense to just use their parser/writer library. I think i'll stick to the FSharp.Data.NBFormat namespace in that case.

Another reason why I'd like this kind of nuget package is that I would like to create a .NET tool equivalent of nbconvert, since that is something the dotnet interactive ecosystem does not seem to offer at the moment, and using the OG nbconvert has problems with their new polyglot format (for example no syntax highlighting for F# and C# due to the language_info name being polyglot-notebook)

@kMutagene
Copy link
Contributor Author

I think I should however add for the sake of completeness that I have not found a simple way of just parsing a notebook file using the API surface of Microsoft.DotNet.Interactive.Documents, but I might just not versed enough with doing things the C# way.

It seems like the way to use that library is creating a parsing server which reacts to Parse requests - something that totally makes sense for the dotnet interactive tool, but might not be what we need here. At least for my purposes this seems kind of convoluted when i could just have a simple 'ipynb in, document model out' function.

@nhirschey
Copy link
Collaborator

Another reason why I'd like this kind of nuget package is that I would like to create a .NET tool equivalent of nbconvert

It's a good use case. The original version of this library that tomasp made was more in this spirit, writing .fsx and then the tools converted to html, latex, pdf. The code for this is still here, but dotnet fsdocs is not exposing it for that purpose.

FYI, not sure if you've tried pandoc (I believe it's what nbconvert uses under the hood), e.g., pandoc .\notebook.ipynb -s -o notebook.html

image

@kMutagene
Copy link
Contributor Author

hey @nojaf @nhirschey

I just wanted to try to summarize what we talked about at the Data Science In F# on this topic, if i forgot/misremembered something feel free to add to this.

  • There are multiple ways to approach this, some fitting better than others to the current model of FSharp.Formatting.
  • One way of doing it would be mapping markdown cells to markdown blocks, and code cells to code blocks in the model
    • This fits well to the current model
    • This ignores notebook cell output
    • cells would have to be re-evaluated when running with --eval
    • the ipynb format has to be parsed only rudimental, because only cells are needed, and they are way less complex than the cell outputs.
    • what to do with non-f# code cells? map them to a markdown code block?
  • The other way would be a more direct transformation of the notebook content. Markdown is treated as usual, but code cells (of any language) are handled together with their output.
    • This is more of a specialized approach to this type of input
    • --eval would either do nothing or mean running the input notebooks using dotnet-interactive before parsing them.
    • the ipynb format has to be parsed thoroughly for this approach, as outputs are formatted really weird in that json schema
    • the most direct comparison to how things work in the current tool is the (***include-it-raw***) command for literal documents, which include any kind of string output directly into the output document. Something like this would need to be done for cell outputs.

In general, i am working on a .NET port of a subset of nbformat/nbconvert features that aims to basically only support .ipynb -> html conversion over at https://github.com/fslaborg/NBFormat.NET. How is FSharp.Formattings approach to external dependencies? It would be possible for example to re-use the ipynb parser from that library. I think i'll improve that libarary a bit until it reaches a 'works for me stage', and then try to integrate useful features into FSharp.Formatting gradually.

Any thoughts/additions on this?

@nojaf
Copy link
Collaborator

nojaf commented Oct 30, 2023

Hi @kMutagene, this is more or less how I remember it. I think the dotnet-interactive route was the most preferred. It would be ok for the tool to have another dependency I think.

@kMutagene
Copy link
Contributor Author

It would be ok for the tool to have another dependency I think

If that is the case, i think we are pretty far at solving this already. The basic parsing and converting is implemented at NBFormat.NET. However that lib currently uses prism.js for syntax highlighting. I think FSharp.Formatting has custom syntax highlighting, is there a way to incorporate this or apply it to a string post-conversion?

@kMutagene
Copy link
Contributor Author

On another note, NBFormat.NET also uses Markdig for markdown conversion. I think FSharp.Formatting also has a markdown parser/converter. So it might be more reasonable to actually include the notbook conversion into the codebase here, to prevent those unnecessary duplications of markdown and syntax highlighting pipelines

@nhirschey
Copy link
Collaborator

Hi @kMutagene,

I think FSharp.Formatting has custom syntax highlighting, is there a way to incorporate this or apply it to a string post-conversion?

Yeah, the simplest thing (and I believe and the way nbconvert does it) is transform the ipynb to markdown, then FSharp.Formatting ingests the notebook as a markdown file. I'm part way to testing this, let me check a few things and get back to you soon.

@kMutagene
Copy link
Contributor Author

kMutagene commented Nov 3, 2023

transform the ipynb to markdown

what happens to the cell output when doing it this way?

@nhirschey
Copy link
Collaborator

nhirschey commented Nov 3, 2023

what happens to cell output?

This is quick and dirty, but taking the fslab blog post cytoscape example, ipynb -> md and then using the md as input is shown below. I’m passing through the ipynb html outputs “as is” (i.e., markdown convention is pass html through unmodified). There's more work to do it properly, but it's at least proof of concept? :

ipynb input

@kMutagene
Copy link
Contributor Author

kMutagene commented Nov 5, 2023

Yeah that looks good actually. If i understand correctly, you do the conversion to md via pandoc though?

So it looks to me like we can just parse the notebook model via NBFormat.NET, and create a markdown output with it. That would just mean leaving markdown cells as-is, putting code cells into markdown code tags with the correct language, and leaving cell output as-is. Does that input have to be an actual markdown file, or could it be done in-memory?

Advantage of this would be no non-.NET dependencies

@nhirschey
Copy link
Collaborator

nhirschey commented Nov 6, 2023

Yeah that looks good actually.

Great!

Does that input have to be an actual markdown file, or could it be done in-memory?

FSharp.Formatting expects everything to be files. Even Literate.ParseMarkdownString write the string to a file before proceeding. So to minimize "surgery" on core functions, it's easy to write to a temp file and send the temp file down the path.

If i understand correctly, you do the conversion to md via pandoc though?

I didn't use pandoc. I used ~130 lines of F# code to parse and convert to FSharp.Formatting markdown. Quick and dirty, but zero dependencies. The gist is here.

You could also use NBFormat.NET. Or maybe it makes sense to use Microsoft.DotNet.Interactive.Documents? I noticed that DotNet.Interactive once again changed their metadata. And if we take a dependency on Microsoft.DotNet.Interactive to execute the notebook, maybe it makes sense to also use it to parse ipynb and also for producing ipynb so that ipynb produced by this library always have the correct metadata.

For reference, to parse using dotnet interactive see code below. I'm working on mapping this version to my above-linked gist converter so we can compare.

#r "nuget: Microsoft.DotNet.Interactive.Documents, *-*"

open Microsoft.DotNet.Interactive.Documents

let nb = Jupyter.Notebook.Parse(System.IO.File.ReadAllText("post.ipynb"))

@kMutagene
Copy link
Contributor Author

kMutagene commented Nov 6, 2023

Microsoft.DotNet.Interactive.Documents

the last time i looked into this, it seemed to me like it is expected to create a deamon-like service that can be sent notebooks to parse, which was the reason i decided to write my own lib in the end. It seems like i just did not find the correct API looking at your code sample, so i guess using that one is ideal because it will keep up with changes to the polyglot notebook format as you said. They are doing some slightly unexpected stuff (e.g. how languages are named in the code cells), so having their original model seems like the way to go here.

So to summarize, it looks to me like the pipeline would be

ipynb_file
|> Parse as notebook domain type (e.g. Microsoft.DotNet.Interactive.Documents)
|> Re-format as single temporary .md file
|> Use .md file for standard FSharp.Formatting pipeline

looks like almost everything is there, once the mapping of the Jupyter.Notebook.Parse result to your initial script is done. That sounds to me like the cleanest approach with the minimal amount of code necessary in FSharp.Formatting 👍

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging a pull request may close this issue.

3 participants