Skip to content
This repository has been archived by the owner on Jun 29, 2022. It is now read-only.

RFC: Link encoding in IPLD #70

Closed
wants to merge 2 commits into from
Closed

RFC: Link encoding in IPLD #70

wants to merge 2 commits into from

Conversation

mikeal
Copy link
Contributor

@mikeal mikeal commented Aug 28, 2018

This is a bit different than what we initially discussed in ipld/ipld#44

After implementing dag-json I felt comfortable enough writing up a solid set of recommendations for codec implementations.

I think this strikes the right balance of flexibility and interoperability. It avoids restricting a developers ability to use language and encoding features but still requires enough support for a canonical serialization that we can trans-encode nodes between codecs.

Copy link
Contributor

@Stebalien Stebalien left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I'm really not happy baking this into IPLD. {'/': ...} was a hack to get JSON working.

This will need some strong arguments/motivations.

Links.md Outdated
+--------------------+ +---------------------+
```

A codec may represent object types and tree structures any way it wishes.
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

"s/codec/format"?

Links.md Outdated
etc) or even new custom serializations. We will refer to this as the
**representation**.

Therefor, a **format** is the standardized representation of IPLD Links and Paths in a given **representation**.
Copy link
Contributor

@Stebalien Stebalien Aug 29, 2018

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Currently, the "format" is describes how to translate between structured data and binary.

# Canonical Link Representation

Codec **serializers** MUST reserve the following canonical
representation of link encoding. The canonical representation is an object with a single key of `"/"` and a base encoded string of the link's CID.
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

We originally said that no objects can have slashes in keys (keys must be valid path components) but backed off when we realized that wasn't going to work. At this point, I'm not sure if we can introduce a restriction like this. CBOR objects definitely can have a single "/" keys.

We really do need to sit down and think through what can and can't go into an IPLD object because I think we're getting closer and closer to "everything goes". That might be fine but we need to address it explicitly...

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Also note, this was never intended to be the canonical representation. It was a hack to get JSON working.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I was somewhat aware of the history. But I think that we do need some form of canonical representation that can be represented in pure JSON in order to open a path for people to encode objects from one format to another.

However, I don't think, and am actively trying to change in dag-cbor, the default use of the canonical representation in the deserializer. It's a horrible pain to work with and, while I want to reserve it for interop, I don't want it to be in common use but instead buried in the implementations.

Copy link

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

We originally said that no objects can have slashes in keys (keys must be valid path components) but backed off when we realized that wasn't going to work

It is worth nothing that CID encoded in base64 will have slashes in them. I am wondering it was really a good idea to allow that encoding, it will mess up paths like /ipfs/<base 64 cid>/file.txt.

@Stebalien
Copy link
Contributor

So, it feels like this is trying to work around the fact that js-ipld has a single dag.put(anythingGoes) function. That is a nice function but I wonder if there's a better solution.

In go, we have typed nodes. dag.put(...) takes typed node (with a codec attached, etc.) and moves on from there. JavaScript could also do that. That is, one could say that untyped nodes are assumed to be raw objects. If the node has bytes() and cid() methods (on the prototype), the dag will use those instead (there may be more "javascripty" ways to do this).

@mikeal
Copy link
Contributor Author

mikeal commented Aug 29, 2018

So, it feels like this is trying to work around the fact that js-ipld has a single dag.put(anythingGoes) function. That is a nice function but I wonder if there's a better solution.

I'm not actually thinking much about the dag.put() function right now. Most of what I've been doing lately is writing graphs with a whole bunch of nodes in memory and then dumping them all out to the block store.

What I'm mostly thinking about is how to define interop between implementations, specifically dag-json and dag-cbor, and how we can create better APIs for creating and working with nodes.

In go, we have typed nodes. dag.put(...) takes typed node (with a codec attached, etc.) and moves on from there. JavaScript could also do that. That is, one could say that untyped nodes are assumed to be raw objects. If the node has bytes() and cid() methods (on the prototype), the dag will use those instead (there may be more "javascripty" ways to do this).

What I've tried to do is leave the door totally open to people doing this kind of stuff in the serializer/deserializer. What I'm not comfortable with is defining the types that must be used in a particular language or serializer/deserialzer. There's a whole lot of preferences and opinions I'd prefer to just not step on or potentially exclude.

I really don't like working with the {"/": cid-string} representation but I find that it does provide a very nice compatibility vector across implementations without having to use a custom object or type and force all the implementations to use it.

For instance, dag-json's deserializer uses CID instances for links but provides a stringify() function that returns stringified JSON using the canonical representation. It's nice to know that I can always do something along these lines:

let node = dagJSON.from(block || buffer)
let transcoded = dagSomeFormat.serialize(JSON.parse(dagJSON.stringify(node)))

That only works if we have some canonical representation each serializer/deserializer has reserved. If we want to just completely give up on that, we can, but we won't have a good way to transcode nodes.

@mikeal
Copy link
Contributor Author

mikeal commented Aug 29, 2018

Pushed some fixes for the other comments.

I also removed the yaml example because I find that it just complicates the messaging. The purpose of the form reservation isn't for expressing in the DSL but for expression in code between codecs.

@warpfork
Copy link
Contributor

and how we can create better APIs for creating and working with nodes.

FWIW, on that front: I've been playing with some fresh takes on go-ipld APIs in a little sandbox off to the side, and one of the ideas I'm playing with that might have merit turned up these ideas:

  • ipldcbor.Node is implements an interface, has roughly what you'd expect
  • ipldbind.Node implements that same interface, and works by binding to an existing Go type.
    • it can be traversed for reading, like other nodes...
    • attempts to mutate it via the Node API might work, or might be rejected: you won't be capable of putting an int into a field that's of string type on the struct{..} that's bound, for obvious reasons.
    • and most interestingly... it doesn't have a serializable form. You'd have to convert it to another kind of Node which is serializable if you wanted to generate a CID for it.

Now, like I said, that's just in a little toy experiment somewhere, and I don't actually know if it's a good idea. But maybe it's interesting food for thought, as another example of how {the way we operate on the data} versus {the codec we use for the data and hashing it} can be distinct.

/2c, I'll go back to lurking now :)

@mikeal
Copy link
Contributor Author

mikeal commented Sep 10, 2018

I don't think that @Stebalien and @diasdavid are in alignment about the future of the canonical JSON representation. @diasdavid could you please weigh in so that we can move forward.

Copy link
Member

@daviddias daviddias left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Overall I'm onboard but I would like to see some examples to ensure that we and our future selfs are on the same page.

implementation of `dag-json` includes a method called `stringify()` which
returns a standard JSON string with links encoded in the canonical format.
This makes trans-encoding of nodes into other formats much easier since
they are required to accept the canonical format.
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@mikeal can you add a few examples to this RFC that show how objects with links will be serialized and deserialized (and then again serialized and deserialized) by the dag-json and dag-cbor formats?

It will provide a ton of clarity to implementers and users and what is the expected behavior and how dag-json differs from dag-cbor and just plain JavaScript objects (!== JSON).

@mikeal
Copy link
Contributor Author

mikeal commented Sep 24, 2018

Ok, I think we're closer to alignment now, but after some recent conversations I'm thinking about re-naming/re-scoping this document.

Essentially, what we care about here a JSON representation that can be used to convert between implementations. It's not just about links, we may want to reserve space for converting between other types in the future. To that end, I'd like to re-name to something like "Canonical JSON Representation" and also take a crack at standardizing a binary form, possibly something along the lines of {"/":{type: "binary", "base64": base64}}.

@Stebalien
Copy link
Contributor

let node = dagJSON.from(block || buffer)
let transcoded = dagSomeFormat.serialize(JSON.parse(dagJSON.stringify(node)))

The issue here is that you're using the normal JSON parser. A dagJSON deserializer should, IMO, turn the CIDs into a special link type that the dagSomeFormat serializer would understand.

One could write:

let serialized = dagJSON.stringify({
  "thing": new Cid("QmId"),
})
assert(dagJSON.parse(serialized)["thing"] instanceof Cid)
let transcoded = dagSomeFormat.serialize(dagJSON.parse(serialized))

(Cid could even have a toJSON method that converts it to {"/": ...})

That only works if we have some canonical representation each serializer/deserializer has reserved. If we want to just completely give up on that, we can, but we won't have a good way to transcode nodes.

I see. So you're not saying that the format necessarily needs to use this, just that if I hand {"/": Cid} to, e.g. the CBOR serializer, it should turn it into a normal CID?


This is really looking like a JavaScript UX issue, not an IPLD format issue. We do need a consistent way to represent IPLD objects in-memory in javascript, but that doesn't have to conform to the DagJSON.

@Stebalien
Copy link
Contributor

Stebalien commented Sep 26, 2018

At the end of the day, my objection to this is in the motivation. If we had an "IPLD needs this" motivation and we couldn't find a reasonable alternative, I'd be fine with it (albeit really unhappy as it (the {"/": ...} syntax) really is an ugly hack). However, the current motivation is "JavaScript wants this" "JavaScript developers expect to work with JSON". That is, JavaScript developers expect the following workflow:

  1. Fetch a JSON blob from some API endpoint.
  2. Deserialize the JSON blob with JSON.parse.
  3. Stick the JSON blob into some datastore/database.

The catch is that we're working with DagJSON, not JSON.

Brainstorming solutions:

  1. Bare JavaScript objects use the DagJSON format. To use a "/" key, one would have to write something like: new Node({"/": "not a cid", "link": new Link("QmId...")}). The dag would have to detect if a node is a raw javascript object and handle it appropriately.
  2. Move away from JSON. That is, come up with an IPLD textual format. This will probably just piss off users but it'll definitely get rid of the confusion.
  3. Teach users to use DagJSON.parse(...). Kind of a footgun but possible.

Personally, I prefer option 1 but there are probably more we haven't considered.


FYI:

still requires enough support for a canonical serialization that we can trans-encode nodes between codecs.

In general, we still won't be able to transcode between formats until we get a type system. There was an endeavor to try to find a set of primitives to allow for this (see: #56) but this hit a dead-end (see the comment I just added). Basically, we agreed on a set of primitives and then realized that they wouldn't quite cut it, rinse, repeat, until we realized it just wasn't going to work. Unfortunately, without a concrete set of primitives, translating between formats isn't going to happen.

@mikeal
Copy link
Contributor Author

mikeal commented Sep 26, 2018

The issue here is that you're using the normal JSON parser.

You're right, this is my mistake and I shouldn't have done this.

Rather, what this should be is something close to a standard toJSON() method, which does not return a string but instead returns a value encoded only into native types that can be encoded into JSON.

I see. So you're not saying that the format necessarily needs to use this, just that if I hand {"/": Cid} to, e.g. the CBOR serializer, it should turn it into a normal CID?

Exactly. How the codec decides to encode links is completely at the codec's discretion. The codec is also free to take any object it could interpret as a Link and encode it into the Link format it chooses. All we're asking is that, if any codec serializer see's this representation {"/": String} it should also interpret it as a Link and encode it into the standard internal format that serializer is using for Links.

To recap:

  • We are not dictating a codec's internal representation.
  • We are not limiting the forms a codec serializer interprets as a Link.
  • We are not dictating the form a codec deserializer uses for Links.
  • We are asking that codec serializers interpret a particular expression of a Link in simple types, {"/": String}, as a Link.

However, because we are reserving the interpretation of this form in the serializer it will necessarily make it impossible to use the same form to represent something that is not a Link.

However, the current motivation is "JavaScript wants this" "JavaScript developers expect to work with JSON".

Again, my apologies for relying on the standard parser in my example.

I think that this use case, transcoding nodes from one codec to another, is a broader need than just in JS.

The closest thing we have to a cross-language basic type system is JSON. Every language supports JSON and has a way to represent JSON types as types native in that language and encode those same types back into JSON. In a way, this isn't actually a "canonical JSON representation" it's a "canonical simple types representation." We're saying, "the language you write a serializer in will support these basic types, please interpret this encoding of links in those simple types as a link."

If we said that "IPLD Types: Level 0" is just the types that are in JSON, we would describe this with language along the lines of "this is how you describe a Link in strictly L0 types."

I hope that clears things up. This conversation makes it clear that this particular document needs to be scrapped as the approach the document has taken is confusing.

Instead, I'm going to define IPLD terminology generally in a document, which will include the definitions at the top of this document related to codec serializers and formats. Once that lands I'll take a pass at an RFC for how to support transcoding links (and maybe binary).

@mikeal
Copy link
Contributor Author

mikeal commented Sep 28, 2018

Closing.

Canonical representations are out. ipld/ipld#50

Sign up for free to subscribe to this conversation on GitHub. Already have an account? Sign in.
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

5 participants