notes of call with Arpita #2008

amitu · 2024-11-01T08:20:49Z

amitu
Nov 1, 2024
Maintainer

Yesterday I had a call with Arpita and discussed a few ideas.

Concurrent Compilation

In 0.4 we did not do any concurrent processing, but I feel we have opportunities to do so in 0.5.

So this is how our parsing plan looks like right now. We first create Sections, as described by our "p1" grammar. We are going to rename it to "fastn section grammar" now onwards, as p1 is not a great name. We do very little work during section parsing, and it cannot be made trivially concurrent. We will have to try to partition the source &str or Vec<char> into multiple parts, but since without parsing we cannot cleanly divide them into sections.

Given this input:

-- record foo:
integer x:

We get an array of (we also get errors, and spans so we can do power syntax highlighting, etc, but ignoring all that for simplicity):

// pseudo code
Section { 
    name:  KindedName { kind: "record", name: "foo" },
    headers: vec![ (KindedName { kind: "integer", name: "x" }, Empty) ],
    ...
}

The next phase is converting these raw sections into this (ignoring some errors, comments, etc):

pub struct Document {
    pub imports: Vec<fastn_parser::unresolved::Import>,
    pub definitions: std::collections::HashMap<fastn_parser::Identifier, Definition>,
    pub content: Vec<fastn_parser::Section>,
}

We go through each section, and first check if it is an "import", "definition" or "content". Definition is:

pub enum Definition {
    Component(fastn_parser::Section),
    Variable(fastn_parser::Section),
    Function(fastn_parser::Section),
    TypeAlias(fastn_parser::Section),
    Record(fastn_parser::Section),
    OrType(fastn_parser::Section),
    Module(fastn_parser::Section),
}

Now this pass can be done in parallel. We can go through each section and classify them concurrently. The output of any classification does not depend on any other classification. Each classification action can result in zero or more errors, so our "classification coordinator" can accept the vec from each of them, and join them together at the end. This way we can also avoid any locks. This is what we are going for, opportunities for concurrency without lock, as these are typically obviously superior to non-concurrent implementation. If locks/atomics/syncing gets involved then it's no longer obviously superior, and we should try maybe both implementations to see which performs better, and in the interest of getting the first launch out, we will prefer a non-concurrent approach.

Resolved Vs UnResolved

The output of the last pass is actually un-resolved, meaning for example, if the type of some field is "foo.bar", we have not yet resolved if "foo.bar" exists. In the unresolved state, we only do minimal syntax parsing, as Section is not fully parsed, it leaves out a lot of strings, the unresolved document parser.

This brings us to the third phase of our compilation (planned), we will take unresolved stuff and try to type-check everything, and create resolved stuff.

Content vs definition

When the main document is being rendered, the "content" of that document is rendered. Content is component invocations. All the data that is referenced, directly or indirectly by the content of the main documents are also instantiated. Other documents can provide definitions but their top-level content is ignored.

When the parser starts, it creates two "bags" (bags are maps from names, now onwards calls symbols, to the corresponding Definition): the unresolved bag and the resolved bag. It then loads the main document and does the section-level parsing. Then it creates unresolved symbols. The second step can happen concurrently. The second phase returns a list of (unresolved) content and (unresolved) definitions. For each unresolved definition, we prepend the document name, <document-name>#<symbol-name> and put them in the unresolved bag.

Then we start resolving by going through the unresolved content of the main document. We also keep track of import aliases for each document, and keep a set of "used symbols", which is a set of every symbol we have used so far while processing the resolved content of the main document.

When resolving, we go to each unresolved content, each content is basically a component invocation. We first resolve the component name itself. All the built in components, kernel components, are added to resolved-symbols at the beginning of the process. If a non-kernel component is found we first do the alias analysis (by looking at imports) and find the full name of the component being referred, this is component name resolution step. Once the full component name, we see if it is already resolved or not. If it is not already resolved, we try to find the component definition in the unresolved bucket.

So we first resolve the component definition, which entails going through type of every argument of the component, and resolving them (if they are not already resolved), we then resolve the variables referred as the default value, and then we go through the body of the component, and find all components and resolve them.

Continuation Based Concurrent Resolution Algorithm

If we want to resolve concurrently, and since the process of resolution requires us to mutate the resolved bag, we can pass a sync version of the resolved bag to all threads. We can do this concurrently as each unresolved content can be technically resolved independently. But if both section 1 and section 2 find say an unresolved component foo, both may try to resolved, and waste work.

So we use continuation based concurrent resolution algorithm. We spawn n thread to do resolution, and we pass each section to each thread, and let each section return either success, or a "stuck at". We create an unresolved content tree into a resolved-unresolved enum based one, eg each item if the tree is wrapped inside an enum with two variants, initially entire tree will be un resolved.

The coordinator will try to resolve the tree, by passing each branch to a resolver thread, and all threads will have readonly reference to the all the bags. The response can be either resolved component, or a "stuck on symbol", or "stuck on document".

The coordinator will update the tree where resolved component was founder and go through every stuck on document, request its source, and do do first two phases of parsing describing above, creating a bunch of new symbols to be added to unresolved symbols bag.

It will then look at all the stuck on symbols list, and try to resolve them concurrently, by passing them the readonly version of all the bags. Each thread will return resolved symbol, or stuck on another symbol or stuck on document. Whatever is resolved will be added to resolved bucket, and whatever document is needed is parsed and added to unresolved bucket.

We will then try to resolve the content once again, and this time some of the stuck on symbols would have been resolved, and some more of the document would be processed.

This process will happen in a loop till the document is fully resolved.

Note that at each step we will also get errors, and the coordinator will keep collecting all the errors as well.

Fully Resolved Document

At the end of the previous process, we will have a fully resolved document tree, and we will have a list of symbols that were referred when resolving that document. We will also have resolved and unresolved bags. Each resolved component innovation and each symbol used, can be then converted to a corresponding JS code, and this is the final JS.

The Three JS Files

We are hopeful that this process is fast enough that it can analyse all the documents of a fastn package in one go, and create a list of symbols used across all documents in the fastn package. At this point, everything in the resolved bag is resolved because it was ever needed, and the JS for each of them will be created in a deps.js file. We will have fastn.js, the fastn runtime, which only changes when a new fastn release is created. And for each document we will have a <doc>.js, which will contain the content of that document.

File Watcher / Incremental Build

We can write some sort of file watcher, which will keep updating the deps.js, actually it would be deps-<hash>.js, a content hashed version, so when any document is requested, and in parallel if a document is updated, we do not clobber deps.js.

We can keep the resolved and unresolved symbols in .fastn folder, and update them instead of parsing everything from scratch to achieve faster incremental builds.

siddhantk232 · 2024-11-01T16:06:27Z

siddhantk232
Nov 1, 2024
Collaborator

Sounds like a topological sort to me. You start by constructing a DAG of symbols. Symbols with no dependencies can be scheduled by a master thread for resolution in parallel. The master thread then updates the DAG and reduces it to contain nodes that are still unresolved. The step is repeated on the updated DAG and eventually the graph will only contain nodes that have been resolved.

The construction of DAG process will exclude any symbol that are not needed during the rendering process.

I need to read more on this to see if I can find out how a parallel topological sorting works

1 reply

siddhantk232 Nov 1, 2024
Collaborator

This is just an optimisation suggestion. I agree to the solution that you presented.

amitu · 2024-11-01T16:14:28Z

amitu
Nov 1, 2024
Maintainer Author

Summary of Call Today

Async vs Continuation

We have renamed our current fastn 0.5 to fastn-lang, as it is probably going to be full compiler. We can call it compiler as well. We want our compiler to agnostic of where the files are read. A lot of compilers are written with direct access to file system, we want to create an indirection, so files can come from file system or DB (and other logic can happen like access control, which can remain outside of the compiler concern).

We have a continuation/state machine based approach, where we call continue on compiler in an infinite loop till compiler says all done, else it keeps us asking, for more files etc. We discussed we can also use async trait. Since the very beginning, 2020 or so, we have been following this (continuation based) approach, and at that time async was not as good a shape as right now. So we can revisit this using trait (we will call it document store trait).

fastn-lang vs fastn-resolver vs other crates

I was also thinking about our fastn-core etc crates in 0.4, and wondering how things would be in future, or how can we simplify things. Came up with the idea of fastn-resolver, which will get access to FASTN.ftd file (and possibly list of files in main package and all dependencies), a URL, "data", and the resolver returns the fastn file to serve (with data), static file to serve, or instruction to pass the request to a given wasm file or 404. This way resolver can be a small crate.

Other crates wise we want to probably have some fastn-terminal for rendering to terminal, and I jokingly suggested (secretly hoping it happens), that we consider the terminal our primary target, so we can start using the terminal renderer UI as spec.

File Source

So the data structures we are creating right now extensively uses std::ops::Range<usize>, but since those spans (which is what we call them in our code) are coming from different files (as symbols are collected from different files), we must either also store the source strings, or create a trait and ask for the string when needed. We can also extend our Span to contain both the range and the name of the file it is coming from. Or one can argue since the file name is encoded in the key in the corresponding bag, that is not really needed, but we do need to fetch the file multiple times from the document store trait. The plus of storing strings is simplicity, the plus of trait is we can remove the LRU etc logic, how to cache the source from the compiler and to the trait provider.

Or we make this discussion moot and String instead of std::ops::Range<usize>.

SQLite

So we have resolved and unresolved symbols in bags, and we def do we want to keep track of symbol dependency graph so we can do incremental compilation, it would mean we have to persist these data structures, and to incrementally update them on every file change.

One possibility we create a bunch of Rust types, and let serde do its thing, but another approach would be to document the internal data structure, use sqlite to store them, and maybe even expose it to fastn programs.

We can let fastn compiler etc work off of a sqlite db, and create a fastn add-file <filename> that updates the db, and maybe fastn watch to monitor file system.

SQLite As Our Package Format

Since after each add-file we can snapshot the sqlite file, the sqlite file can be our package. We can create our package manager based on distributing these sqlite files. We do not have to store the wasm/image/font etc files in the sqlite db, only the fastn source files, which keeps the package size reasonable, we can store the binary files on S3 etc and download them on demand.

We have considered elaborate partial package, and downloading only the specific files that are needed instead of the whole package etc, but I feel this may be a better. This removes the cleverness, gives us more standard version management problem, solve for dependency versioning etc, manage sqlite files etc. We would be lot more like Python/Rust in that sense, we only have to figure out if we are more Python like, where only one version of any package can be added a dependency (direct or transitive), or Rust/Node like, where multiple versions of same package can live in the dependency tree. Which is anyways an independent question.

0 replies

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

notes of call with Arpita #2008

{{title}}

{{editor}}'s edit

{{editor}}'s edit

Replies: 2 comments 1 reply

{{title}}

{{title}}

{{title}}

{{editor}}'s edit

{{editor}}'s edit

Select a reply

notes of call with Arpita #2008

amitu Nov 1, 2024 Maintainer

Concurrent Compilation

Resolved Vs UnResolved

Content vs definition

Continuation Based Concurrent Resolution Algorithm

Fully Resolved Document

The Three JS Files

File Watcher / Incremental Build

Replies: 2 comments · 1 reply

siddhantk232 Nov 1, 2024 Collaborator

siddhantk232 Nov 1, 2024 Collaborator

amitu Nov 1, 2024 Maintainer Author

Summary of Call Today

Async vs Continuation

fastn-lang vs fastn-resolver vs other crates

File Source

SQLite

SQLite As Our Package Format

amitu
Nov 1, 2024
Maintainer

Replies: 2 comments 1 reply

siddhantk232
Nov 1, 2024
Collaborator

siddhantk232 Nov 1, 2024
Collaborator

amitu
Nov 1, 2024
Maintainer Author