Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Genome folds #1

Closed
casblaauw opened this issue Sep 2, 2023 · 1 comment
Closed

Genome folds #1

casblaauw opened this issue Sep 2, 2023 · 1 comment

Comments

@casblaauw
Copy link

Dear developers,

Thanks for releasing your model, I'm sure I can speak for many in the community to say that it's looking hugely impressive!
To use and validate it, I'd like to see know what regions of the genome are in each of the test/validation folds that were used to the four models. For Enformer/Basenji, that was easily reconstructed from the helpfully shared sequences_[human|mouse].bed files in the public Google Storage bucket with 'supplementary' small files here, but I don't believe that's available for Borzoi yet?

Of course, it could be reconstructed from the large training dataset files, but given that I'm only looking for the genomic coordinates rather than the fully processed tracks corresponding to those, I was hoping there is an easier way.

Related to that though, all files in the borzoi-paper bucket currently don't seem to be available, as it returns the following error:

<Error>
 <Code>UserProjectMissing</Code>
 <Message>
  Bucket is a requester pays bucket but no user project provided.
 </Message>
 <Details>
  Bucket is a requester pays bucket but no user project provided.
 </Details>
</Error>

Although I'm hoping to not need those files at the moment, I figured I'd still mention it to let you know.

I'm sure the public release has left everyone swamped with questions coming in and issues popping up, so I appreciate any bit of time you are willing to spend on this!

davek44 added a commit that referenced this issue Sep 2, 2023
@davek44
Copy link
Contributor

davek44 commented Sep 2, 2023

Thanks for your interest! I add the sequences and targets files into a data/ directory from the github, too, so you don't have to figure out GCP for that.

For model f0, sequences labeled fold0 form the test set and fold1 form validation.
For model f1, sequences labeled fold1 form the test set and fold2 form validation. Etc

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

2 participants