A schema for collections? #40

bmcfee · 2015-06-09T14:09:02Z

Going back to this comment, we punted on the idea of managing extrinsic data (eg, file paths) explicitly from within a JAMS object. Now that the dust has settled a bit on JAMS schema, I'm wondering if we can come up with a better solution than sandboxing this stuff.

I bring this up because maintaining links between audio content and annotations is still kind of a pain, and I'd prefer to not solve it over and over again.

How do people feel about introducing an interface/schema for managing collections of jamses? At the most basic level, this would provide a simple index of audio content, jams content, and collection-level information. (It might also be useful to index which annotation namespaces are present in each jams file.) This kind of thing can spiral out of control easily, so if we do it, we should keep it tightly scoped.

ejhumphrey · 2015-07-14T11:44:23Z

How's about a FileManager object that inherits from a dict or list, depending on whether or not key or integer-based indexing makes sense (I typically use, and prefer, key-based indexing so you're robust to shuffling / partitioning), and contains a FileCollection, consisting of fields which point to any number of file paths.

As an added bonus, we / the user could register different load / open methods with filetypes for transparent (lazy) loading, i.e. "npz" -> np.load, "jams" -> jams.load, etc.
For example...

fmgr = FileManager()
fmgr['my_song'] = FileCollection(
    audio='/path/to/my/song.wav', 
    annotation='/a/different/file.jams',
    features='/data/features/my_song.npz')

# Assuming 'npz' -> np.load by default
data = fmgr['my_song'].features.load()

Additionally, if everything inherits from JObject, then this database-style object can be saved / loaded just as easily.

Thoughts?

bmcfee · 2015-07-14T13:36:19Z

How's about a FileManager object that inherits from a dict or list, depending on whether or not key or integer-based indexing makes sense (I typically use, and prefer, key-based indexing so you're robust to shuffling / partitioning)

I'd argue that int-based indexing never makes sense, unless the int is actually treated as a key (eg in gtzan).

It may also be worth looking at something like asdf for inspiration, since they have many of the same problems we do.

As an added bonus, we / the user could register different load / open methods with filetypes for transparent (lazy) loading, i.e. "npz" -> np.load, "jams" -> jams.load, etc.

I like this idea, but transparent loading seems a little tricky to get exactly right. Ideally, I'd want to be able to clobber load arguments (such as audio sampling rate). This could be supported pretty easily by setting defaults on kwargs, but the resulting api may be kind of a mess.

Maybe we should ponder on that a bit.

bmcfee · 2015-09-14T15:06:36Z

Circling back on this after a bit of pondering.

 fmgr = FileManager()
 fmgr['my_song'] = FileCollection(
     audio='/path/to/my/song.wav', 
     annotation='/a/different/file.jams',
     features='/data/features/my_song.npz')

This looks exactly like a dataframe to me.

 # Assuming 'npz' -> np.load by default
 data = fmgr['my_song'].features.load()

How about something a little less objecty? I like your idea of having a dispatch object that can map a key (eg features) to a loader function (np.load). Why does that need to be attached to the object? We could just as easily construct the dispatcher as an object, and feed it a data frame where keys correspond to samples, and each column is a field that can be loaded via dispatch.

This way, we don't have to worry about schematizing the whole thing, and it becomes much easier to import data sets on the fly. (We can also tag along non-loadable fields at the same level, such as an artist id for split filtering.)

bmcfee · 2016-02-01T16:05:32Z

Punting this to #98

bmcfee · 2018-05-31T15:30:12Z

Having thought on this for years at this point, I think the reasonable course of action here is as follows:

Implement the unified schema refactor proposed in RFC: more rigid, but simpler schema validation #178
Expose the schema over the web with proper versioning and references.
Any collection-level schema can be built using references to (2). Objects can be sharded and linked by uuids at the collection-level, but the objects themselves do not need to contain identifiers. This way, the schema can be backward-compatible.
Provide a standard implementation / example schema for managing jams collections in mongodb (the famed jamongo) using the above.

bmcfee · 2018-06-05T13:45:01Z

Provide a standard implementation / example schema for managing jams collections in mongodb (the famed jamongo) using the above.

Of course, it couldn't be that simple. MongoDB does not support $ref in json schema (?!).

bmcfee added enhancement question labels Jun 9, 2015

bmcfee modified the milestone: 0.2.1 Jul 18, 2015

bmcfee modified the milestones: 0.2.1, 0.2.2 Oct 13, 2015

bmcfee mentioned this issue Nov 6, 2015

RFE: One file/one jam paradigm not suited for large datasets #86

Open

bmcfee mentioned this issue Dec 14, 2015

Schema refactor #92

Open

bmcfee modified the milestones: 0.3.0, 0.2.2 Feb 1, 2016

bmcfee modified the milestones: 0.3.0, 0.4.0 May 11, 2017

lostanlen mentioned this issue Jun 25, 2019

How should we handle big datasets? mir-dataset-loaders/mirdata#20

Closed

bmcfee added the schema Issues pertaining to schema definitions label Aug 12, 2019

bmcfee mentioned this issue Apr 22, 2020

Next generation jams #208

Open

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

A schema for collections? #40

A schema for collections? #40

bmcfee commented Jun 9, 2015

ejhumphrey commented Jul 14, 2015

bmcfee commented Jul 14, 2015

bmcfee commented Sep 14, 2015

bmcfee commented Feb 1, 2016

bmcfee commented May 31, 2018

bmcfee commented Jun 5, 2018

A schema for collections? #40

A schema for collections? #40

Comments

bmcfee commented Jun 9, 2015

ejhumphrey commented Jul 14, 2015

bmcfee commented Jul 14, 2015

bmcfee commented Sep 14, 2015

bmcfee commented Feb 1, 2016

bmcfee commented May 31, 2018

bmcfee commented Jun 5, 2018