-
Notifications
You must be signed in to change notification settings - Fork 34
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
[Feature]: Generate blob parts to support embedding blobs (aka files) #57
Comments
The cbor-x is generally synchronous because there are significant performance regressions involved in making things asynchronous. However, one idea is that you could possibly allocate buffer space in an encoding and then stream data into (and out of?) that allocating space (assuming you know the size a priori). Also are you potentially looking for a way to not hold the entire encoding in memory at once (treat it as a stream)? That often goes hand in hand with the need for asynchronicity. |
This comment was marked as off-topic.
This comment was marked as off-topic.
I now know how i exactly want this to work instead! Blob object represents a blob, of immutable, raw data; Blobs can represent data that isn't necessarily in a JavaScript-native format. and can expanding it to support on the user's system. So when you get a file from eg const file = input.files[0]
console.log(file.size) // 2 GiB
const blob = new Blob([file, file])
// 4 GiB (but you still haven't allocate any memory)
// blob just now holds 2 references point to where it should read the content from
console.log(file.size) // So, I really don't want to read/allocate memory of the content of the blob if i don't need to. When i want to embed a file/blob into a cbor payload then i would like it to just simply be the same thing as if i did const file = await getFileFromSomewhere()
const arrayBuffer = await file.arrayBuffer()
cbor.encode({ content: arrayBuffer })
// would be the same thing as if i did
cbor.encode({ content: file }) ...So no special tag attribute that says it's a Blob or a File So when i would encode this: uint8 = cbor.encode({
content: new Uint8Array([97,98,99]).buffer
})
/*
Then i would get back:
b9000167636f6e74656e7443616263
9 0001 # map(1)
67 # text(7)
636f6e74656e74 # "content"
43 # bytes(3)
616263 # "abc"
*/
// Same result could be produced using a blob instead:
cbor.encode({
content: new Blob([ new Uint8Array([97,98,99]).buffer ])
}).toString('hex') === 'b9000167636f6e74656e7443616263' but instead of giving me back a single large buffer and having to read the content of the blob then i would like to get back blob parts that can be an array of either Uint8Arrays or blob parts const uint8 = new Uint8Array([97,98,99]) // abc
const blob = new Blob([ uint8 ])
const chunks = cbor.encode({ content: blob })
console.assert(chunks.length === 2, 'number of chunks getting back from encoding is 2')
console.assert(chunks[0] instanceof Uint8Array, 'first piece is a uint8array')
console.assert(chunks[1] === blob, 'the 2nd piece is the same blob i encoded with') the result of encoding [
new Uint8Array([ 185, 0, 1, 103, 99, 111, 110, 116, 101, 110, 116, 67 ]),
new Blob([ 'abc' ])
] And nothing would ever have to be read into memory and not even have to allocate any memory for those blobs. you would get a ultra fast encoder for adding things such as blobs into the mix. now it would ofc be up to you to send this response payload to a server or a client and/or save this somewhere const blob = input.files[0]
// console.log(await blob.text()) // abc
const blobParts = cbor.encode({ content: blob })
console.log(Array.isArray(blobParts)) // true
const finalCborData = new Blob(blobParts)
// const arrayBuffer = await finalCborData.arrayBuffer()
fetch('/upload', { method: 'POST', body: finalCborData })
|
Sry to write sooo much, but wanted to over simplify stuff to tell you what i want to happen. I want this logic sooo badly right now! Blob support exist now in both Deno and NodeJS and for NodeJS i have also built this fetch-blob library that lets you get blobs from the file system without ever having to read the content of the files upon till you really need to read the content. so you can do things like NodeJS plans on impl something where you are able to get a blob from the file system... |
The result could even be something like [ subArray1, blob, subArray2 ] // blob parts
subArray1.buffer === subArray2.buffer where subArray uses the same underlying arraybuffer |
This does seem like a good idea. I think one problem with this approach is that I don't necessarily like the type inconsistency of having encode usually return a Buffer except when there is a Blob somewhere in the object graph. Instead, I think it might make more sense to have an encodeAsBlob function:
and then
|
Yea, i would like that.
yea, i figured as such. i just wanted to describe what i wanted to be able to do. And after we could discus the method name, and or options to the constructor to I can give a few suggestions: opt 1I was thinking of maybe: what if instead of a
blob = cbor.encodeAsBlob({ content: blob }) opt 2if we are going to return a new Blob([
new Uint8Array([ 185, 0, 1, 103, 99, 111, 110, 116, 101, 110, 116, 67 ]),
new Blob([ 'abc' ])
]) then maybe a better name for it would be opt 3if we are going to have a new name for encoding to this new format. why not just go and ahead for something that's more stream/RAM friendly right from the bat? // Somewhere in core
cbor.encodeAsBlobIterator = function * (content) {
while (encoding) {
yield blobPart
}
}
const iterable = cbor.encodeAsBlobIterator({ content: blob })
// stream.Readable.from(iterable).pipe(dest) // node solution
// globalThis.ReadableStream.from(iterable).pipeTo(dest) // web stream solution
// plain iterator
for (const blobPart of iterable) {
// blobPart would be either a ArrayBuffer, uint8array or blob
// or even a subArray of an uint8array
} This way you can also potentially reset the allocated buffer and reuse the same buffer, and reset the offset to zero again. chunks = []
for (const blobPart of iterable) {
if (blobPart instanceof uint8array) {
chunks.push(blobPart)
} else {
// it's a blob
}
}
chunks[0] === chunks[1] // true this ☝️ would maybe be confusing for most users as they would have to manually slice it themself if they need to... but it would be a good performence win. also with the new iterator helpers propsals you wouldn't be able to do iterator.toArray() for instance. instad you would have to use a |
I am not sure I understand the difference between your opt 1 & 2. I was suggesting that I return a Blob instead of an array (that can be passed to the Blob constructor), and that looks your opt 1 & 2 (opt 2 is showing the array passed to the Blob, but still returns a Blob). And yes, opt 3 would be really good, and I have other uses for this well (with embedded iterators). This is certainly more complicated though. I am doubtful that I can convert the encode code to generators without significant performance regression (for current encodings). I think it may be viable to have a mechanism that detects embedded blogs/iterators, and throws and reruns with specific knowledge of where a generator is needed (and caches the plan), but again, this will require some experimentation. Also, I don't think the iterator would need to return anything other than Uint8Arrays (that's what transitively streaming embedded Blobs would yield, at least by default, right?) |
opt 1 would return a blob directly and you wouldn't have to pass it into a blob constructor to get a blob
don't know what "transitively streaming" means iterators don't return anything, they yield stuff. but yea i guess they would yield uint8arrays. it could also yield here are some psudo code: /**
* @return {Iterable<Uint8Array | Blob>} returns an iterator that yields either a uint8array or a blob
*/
cbor.encodeAsBlobPartsIterator = function * (content) {
let offset = x
...
if (current_value instanceof Blob) {
// Append bytes that says it's a byte array and what the size is
current_serialized_uint8array.set(...)
// yield what we already have in store
yield current_serialized_uint8array.subArray(x, y)
// yield a blob or a file (could also do `yield current_value.slice()`, but what's the point?
yield current_value
// reset the offset to 0 and start filling in the buffer again to reuse an already allocated arrayBuffer
offset = 0
}
...
} ofc you could do something like: // stream the blob
yield * current_value.stream() Then it would only yield Uint8Arrays but then that also means that the generator function would also have to be async but this is not really what i would want, and i don't think you want that either.
are generators really that slow? haven't bench tested it or anything... maybe it could be done with callbacks instead? I think generators adds up in functionality and extra features in the dispense of some performance loss. it makes it easy to create readable streams out of iterators when you need them. |
@kriszyp do you have any plans on start working on this? |
I do have plans to start working on this (actually already have started a little bit), as we actually need similar functionality for new features at my job too (more for encoding from iterators, but will be treated roughly the same as blobs). Of course you are welcome to take a stab at it. FWIW, #61 would also make a helpful PR if you think have a (separate) module to do re-sorting of objects would be useful. |
cool, then i will wait for a new feature release 👍 🙂 |
You can take a look at the associated commit for my first pass at an implementation (there is some cleanup to do, but I believe it basically works). |
tested it out just now. it's able to handle blob's just fine as i wised for. just don't know if it would be expected for something like const blob = new Blob(['abc'])
encodeAsIterator([blob])
encodeAsAsyncIterator([blob]) to produce a bit of a concern about this when i saw Lines 762 to 780 in 3f2e2a2
and having that ☝️ at the very top. Many things have Symbol.iterator (even array's string, Map and Set, typed arrays) those things have a known length. so the following provides two different results: const it = encodeAsIterator([123])
for (let chunk of it) console.log(chunk)
console.log('----')
console.log(encode([123]))
I do not see any reason why the iterator could not produce the same result as
...or even one single uint8array. (as long as it have some room for it). this one was a bit unexpected: const blob = new Blob(['abc'])
// const ab = await blob.arrayBuffer()
const sync = encodeAsIterator(blob)
const asy = encodeAsAsyncIterator(blob)
for await (const c of sync) console.log(c)
console.log('---')
for await (const c of asy) console.log(c) i got:
but i expected to get
I think you forgot to do this in the sync iterator as well: Line 829 in 3f2e2a2
|
there is just one thing i'm wondering about... right now when i use I bet somethings could been speed up if cbor just gave me one single uint8array that could include many values at once. |
just taking a quick glance of the commit it feels like quite much logic is duplicated. The async iterator could just be a async function * encodeAsAsyncIterator(value) {
const syncIterator = encodeAsIterator(value)
for (const chunk of syncIterator) {
if (chunk instanceof uint8array) yield chunk
else if (chunk instanceof Blob) {
yield * readableStreamToIterator(blob.stream())
// or just yield * blob.stream() if it where supported... (only works in NodeJS atm)
// or alternative add a polyfill for RedableStream[Symbol.asyncIterator]
}
else {
// or maybe it's a read function that return a asyncIterator? where something could have `yield blob.stream`
// const read = chunk
// yield * read()
// or maybe it's a other async iterator
// const reader = chunk
// yield * reader
}
}
} fyi, here is a nice polyfill: https://github.com/ThaUnknown/fast-readable-async-iterator/blob/main/index.js |
Yes, you are right, I had not intended to use indefinite length for arrays, that should be fixed.
Yes, that should be fixed.
Yes, you are right, that was very inefficient. The latest version should be much more efficient about collecting bytes and returning them in larger chunks.
If the goal is to reduce unnecessary code, it doesn't seem like the 12 lines of code in the polyfill is an improvement on my 5 lines of code. Or is there something else you are wanting here? |
👍
Hmm, discard that, just saw that you do kind of ish what i already suggested. async function* encodeObjectAsAsyncIterator(value, iterateProperties) {
for (let encodedValue of encodeObjectAsIterator(...)) { |
I'm happy at where it's at. |
Now this is very simple to encode large files in a cbor format. new Blob([ ...encodeAsIterator( fileList) ]).stream().pipeTo(dest)
fetch('/upload', { method: 'post', body: new Blob([ ...encodeAsIterator( fileList) ]) }) It's a good alternative way to replace the old FormData that needs a boundary. decoding formdata is pretty easy in nodejs v18 now that you no longer need any library to decode formdata payloads const fd = await new Response(incommingReq, { headers: incommingReq.headers }).formData()
for (const entry of fd) { ... } |
Now decoding cbor with a 4+ GiB large files on the other hand. Could that be improved somehow? i see two options:
thinking something like const ab = new ArrayBuffer(1024)
const blob = new Blob([ab])
const cborPayload = new Blob([ encode({ content: blob }) ])
// design 1, solution of reading a blob
decode(cborPayload, (t) => {
if (t.token === 'byteArray') {
// return a slice of the original cborPayload
// and skip reading `t.size` --- sets: t.offset += t.size
return cborPayload.slice(t.offset, t.offset + t.size)
}
}).then(result => { ... })
// design 2, solution of reading a blob
decode(cborPayload, async (t) => {
if (t.token === 'byteArray') {
const iterator = t.createReadableIterator()
const rs = new Readable.from(iterator)
const root = await navigator.storage.getDirectory()
const fileHandle = await root.getFileHandle(t.key)
const wr = fileHandle.createWritable()
await rs.pipeTo(wr)
return fileHandle.getFile()
}
}).then(result => { ... }) I do not know... maybe you want to read a stream or iterator instead... decode(cborPayload.stream(), async (t) => { Either way... some kind of way to fuzzy search, jump skip reading x amount of bytes would be a cool lower level solution that wish to have some kind of way to search inside of cbor I'm just shooting out ideas. but this is an topic for anther issue/feature idea. |
realize i should probably try to write my own tag extension to really learn how cbor-x works and what it's catable of. haven't done that yet. and i probably should. think i want to try and write a tag-extension for Blob/File representation now. addExtension({
Class: File,
tag: 43311, // register our own extension code (a tag code)
encode (file, encode) {
// define how your custom class should be encoded
encode([file.name, file.lastModified, file.type, file.slice()]);
},
decode([ name, lastModified, type, arrayBuffer ]) {
// define how your custom class should be decoded
return new File([arrayBuffer], name, { type, lastModified } )
}
}); haven't tried this ☝️ yet but i assume that's how you write extensions |
Would it be a circular problem if i tried to write something like this now that cbor-x support encoding blob as per this particular new feature that you have implemented? addExtension({
Class: Blob,
tag: 43311, // register our own extension code (a tag code)
encode (blob, encode) {
encode([ blob.type, blob.slice() ]);
},
decode([ type, arrayBuffer ]) {
return new Blob([arrayBuffer], { type } )
}
}) |
The encodeAsIterator (and encodeAsAsyncIterator) should be published in v1.5.0. And yes, I think it would be nice to eventually support iterative decoding in the future as well, which could allow for decoding stream/iterators with >4GB of data (and I think another valuable use case could be progressively decoding remote content without having to wait for all data to download). And yes, it would make sense that if you were decoding a stream or blob, that any embedded binary data would likewise be returned as a stream or blob. |
As in partial http range request 🙂 |
Hi again. i got one small request if you wouldn't mind. i tried using my fetch-blob impl. and the issue with it is that's more arbitrary then other native blob implementations (in the way that you can create blob's that are backed up by the filesystem or any blob look-a-like) so they are not really instances of NodeJS own Blob instances. So therefore this i have tough about extending native NodeJS built in blob and override all properties. but i can't really do that. So i was wondering if you could maybe do duck type checking to check if it matches a blob signature? import { Blob } from 'node:buffer'
import { blobFromSync } from 'fetch-blob'
/** @returns {object is Blob} */
const isBlob = object => /^(Blob|File)$/.test(object?.[Symbol.toStringTag])
const readme = blobFromSync('./package.json')
isBlob( readme ) // true
isBlob( new Blob() ) // true
readme instanceof Blob // false I do really wish nodejs/node#37340 got resolved ot that something like Blob.from(...) ever become a solution |
fyi, i just want to share that NodeJS have shipped |
scroll down to my 3rd comment: #57 (comment)
Original post
if i have something that needs to be read asyncronus or with a stream can i do that then?I'm thinking of ways to best support very large Blob/Files tags...
Here is some wishful thinking:
The text was updated successfully, but these errors were encountered: