-
Notifications
You must be signed in to change notification settings - Fork 12
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Dilated File Transfer #23
base: main
Are you sure you want to change the base?
Conversation
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Did a first quick initial pass on it. There is a lot of conceptual overlap with my transfer-v2 proposal, and I think a few features of the latter could easily be integrated in here.
dilated-file-transfer.md
Outdated
They are actually an encoded `Map` with `String` keys (to use `msgpack` lingo) and values as per the pseudo-code. | ||
|
||
All control-channel messages contain a integer "kind" field describing the sort of message it is. | ||
(That is, `"kind": 1` for example, not the single-byte tag used for subchannel messages) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I am strongly in favor of using strings instead of integers for all enums and other type discriminants, here and below.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Why?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Because integers are just another layer of indirection. You need to specify which integer means what, instead of giving things a declarative name. Many serialization frameworks (e.g. serde, gson) support automatic naming based on the identifier name in the code, so this reduces both programming effort and possibilities for mistakes. Furthermore, the space of strings is a lot larger than just single byte integers, so we don't need to do things like having to reserve IDs for purposes and stuff like that.
The only advantage of using integers I can see is size/efficiency, and I don't think this is a concern for us here.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Using a fixed-length field for this in subchannel messages does have advantages since the vast majority of data down those channels will simply be payload bytes of file data. So, you save an additional pass over that data and a second copy in memory.
No matter what, the spec still needs to define a mapping from "thing" to "meaning" whether "thing" is 0x02
or "payload-data"
.
Ideally we'd just get rid of this part entirely and base the meaning of records on their position (e.g. "first one is the offer", etc) which would allow writing the file-data blobs directly. The implications of this if a stream got 'out of sync' -- e.g. one side thinks "this is msgpack data" while other thinks "it's file data" could be interesting to consider -- but also, that shouldn't happen (unless e.g. metadata is wrong about the file size etc).
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I just had a quick read through and it looks sensible to me. I'd personally be in favour of including compression (more specifically ZSTD compression) as part of the protocol at this time. However, I understand that it also might make sense to handle compression later on.
thanks @Jacalz
I did look at that, and agree that Zstandard sounds like a good approach. I left it for a future enhancement (likely as a "feature") for several reasons:
I would like to have compression though. If others agree that's "not too much more complexity" I could spec that out as well in this draft (as the one and only "feature" for now)...? Vote with emojis:
|
If compression is done as a "feature" it still allows implementations to "upgrade slowly". Idea being, you can implement just "mode=send" and "mode=receive" for essentially feature-parity with classic transfer (this should mean minimal changes to UX etc). This still means multiple files, if the sender offers more than one, so it's not precisely the same... Then, you can implement "mode=connect" (possibly with very different UX) for multiple, bi-directional transfers. Some tools may never implement this mode (e.g. (If we specify compression) Separate from both the above, you can implement compression if desired. It could also be "implemented" but not used in some situations (e.g. smaller computers) as it will be a "feature". Perhaps the above should be a section in the spec? |
In #1, I proposed a simple "format" option for compression. This is fairly simple, under the assumption that the compressed data can be decompressed without requiring any information about the compression settings (this is usually the case). It uses the usual feature intersection which makes it both opt-in and future proof for format additions. The only real open question is how to deal with folders. Compressing folders as one stream is very likely to be more efficient (no new dictionary per file, deduplication across files), so we probably want that. The downside is that the individual files cannot easily be separated prior to decompressing. This is not a huge deal IMO, but it makes the "send file header then bytes" approach unfeasible. I propose to skip the individual file headers in directory transfers and just concatenate the (compressed) bytes. The order of the file is deterministic and dictated by the send offer. The checksum message might be modified so that it is one message with a list of checksums, one for each file. |
No, you can still separate the "compressed bytes for file 1" from the "compressed bytes for file 2" as zstandard allows a "flush" operation. I have tested this. Many compressions also offer their own container format (as does zstd) but you don't have to use these.' ...and yes, we'll want to use the same compression context for as much as feasible as that gives better compression (usually). (This is one for the "there are subtle things" point and partly why I wanted to skip this until later .. I don't think we should include more than one offer per compression context, because some implementations may process each offer / subchannel in a thread and insisting on one compression-context wouldn't allow that). |
I agree, and actually I never thought about that before. To me, different transfers are independent in every sense of the word, and sharing a compression context would create interdependence between them. In theory, one could make them share compression without making them dependent by providing some kind of pre-trained dictionary prior to the transfer. I don't know how common this is across compression formats, but at least zstd supports it. However, I'm against this for the same reason I rejected that approach for sending individual files: It adds quite a bit of complexity which I'd like to avoid. |
Just a random side-note here. Should we consider using |
The problem with ProtoBuf is that it does not have first-class support for self-describing messages: https://developers.google.com/protocol-buffers/docs/techniques#self-description msgpack is conceptually a drop-in replacement from JSON, whereas trying to use ProtoBuf would be an entirely different beast. I haven't looked at ProtoBuf (& friends) a lot though, so I'd appreciate some analysis on how it could be integrated into Magic Wormhole. |
As specified, we're only using JSON for the "already built" parts that currently must use it (e.g. version exchange). That has to allow open-ended contents (and anyway would be a new mailbox protocol too). For protocol messages related to just this application protocol the preliminary conclusion is "msgpack". CBOR2 is also similar (and has better performance, IIRC). I would not consider ProtocolBuffers; it uses pre-compiled specs and generated code, but without the advantages of Cap'n'Proto (essentially its successor) or flatbuffers (more-actually the successor because it's Google too). It supports fewer languages than either. For "no-parse" options, I'd say "flatbuffers" is probably the front-runner. It supports many more languages, is "no-parse" and doesn't bundle other concerns like cap-n-proto does. It still has the disadvantage of requiring generated code though. We could use specified messages like this, but consider "features" that have optional fields: then we'd have to have different versions of the flatbuffer specs for each feature. On the one hand, those conceptually exist anyway -- but can be handled by "if" statements easily instead of completely separate flatbuffers specs. (e.g. consider the "thumbnail" example in the "protocol expansion" section -- with msgpack/cbor2 one can do something like "if thumbnails on, include thumbnail in the Offer message" whereas with flatbuffers you'd need |
```python | ||
class FileOffer: | ||
filename: str # filename (no path segments) | ||
timestamp: int # Unix timestamp (seconds since the epoch in GMT) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Is the timestamp
the mtime
from stat()
? If so, may be we can add some text in that regard?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Yeah .. added some words near here to clarify that.
Peers MUST remember the stat information (at least size + modification-time) of each file in the offer. | ||
When sending file data peers MUST compare the current size + modification-time to that recorded when making the offer. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
OK, so looks like the above timestamp
value is indeed mtime
.
filename: str # filename (no path segments) | ||
timestamp: int # Unix timestamp (seconds since the epoch in GMT) | ||
bytes: int # total number of bytes in the file | ||
``` |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Just thinking about file permissions here. Do we want to also send file permissions (the mode
bits in stat()
) or whether we should just rely on the user's umask for the file creation permissions and attempt to use a sane default like 0644
or something. The current file transfer protocol does not send file permissions. For directories, I think it is left to the clients on the permissions of the base directory. The python client currently uses the zipfile
module to extract the zip and each file's permission stored in the extended bits in the file header for each file in the zip is used to restore the original permissions. Storing permission bits in extended bits is not really specified by zip spec, just that zipfile
module uses those bits for this purpose.
Just thinking aloud here, perhaps a file/directory transfer program shouldn't bother about permissions etc and just leave it to the client's discretion..
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
It doesn't stop with permissions though. There are also access control lists (acl). It should support either nothing or everything. Maybe it could be an optional extension.
Permissions are also different for every platform. Windows doesn't use the POSIX permission model at all for example.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Yeah. This is kind of a complex topic and gets a little bit into the "philosophy" here -- that is, this is "a file-transfer protocol" at heart (and not, for example, a "filesystem synchronization protocol" such as rsync
).
I did consider for example something like encoding "execute": True
or similar (e.g. "some" of the permissions) but even that seems a bit open for abuse. On the flip side, if you're sending a bunch of shell scripts from one computer to another it might be boring/surprising to have to chmod +x *.sh
after -- but ultimately safer.
And, as @felinira notes, different platforms have different permissions (and other attributes) available.
These same philosophic considerations come into play with the "symlink" question too -- where the classic protocol takes a more "file transfer" approach versus "synchronize filesystems" (by sending the contents of links).
Using that as guidance, it probably makes sense to not try and transmit permissions? I'm still a little torn about "execute" just from a convenience point of view, but ...overall I think pushing this to "possible future "feature"
" seems best.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Thinking further, even something "simple" like transferring "group read permission": it's quite likely that the user on one computer is a different user (with different group-membership) on the other computer -- and, it could even be an OS / filesystem that doesn't even have the concept of "group".
So, yet more reasons to not try and transfer these sorts of attributes. Client implementations should be encouraged to select good defaults for their OS/etc conditions.
I have a very rough proof-of-concept implementation in the Python client and here post updated Automat-produced state-machine diagrams. These are obviously tied to a particular implementation but I think are still generic enough to be useful. In case it's not obvious, a This only supports File transfers right now, not directories or any control messages (i.e. "text messages"). |
Checking back in here, although the PoC code doesn't support multiple files yet I think we should decide on magic-wormhole/magic-wormhole-mailbox-server#31 before finalizing this transfer spec (and, perhaps that ticket would really be better in this repository in hindsight). |
@abitrolly this refers to the "Dilation" feature in Magic Wormhole -- which I guess is supposed to get at the notion that you "make the wormhole bigger and more useful"...? See magic-wormhole/magic-wormhole#445 for WIP / PoC |
@meejah that's nice, but also it doesn't explain anything AT ALL. :D |
@abitrolly can you state that as a question? I thought you were asking why "Dilated" is in the title? |
Specification for a new file-transfer protocol based on Dilation.