Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Reproduce the Enformer's input sequences split #190

Open
sararb opened this issue Mar 5, 2024 · 1 comment
Open

Reproduce the Enformer's input sequences split #190

sararb opened this issue Mar 5, 2024 · 1 comment

Comments

@sararb
Copy link

sararb commented Mar 5, 2024

I would like to regenerate the input sequences for Enformer/Basenji2 (using basenji_data.py), and for this purpose, I am using the following command line:

python basenji_data.py -g hg38.gaps.bed -u umap_k36_t10_l32_hg38.bed -b hg38.blacklist.rep.bed -l 131072 -crop_bp 8192 -break_t 786432 -s 65599 -t .1 -v .1 -w 128 -o data/input_mseqs -p 8 targets.txt

However, I am observing differences when compared to the sequences.bed file stored here

Can you please confirm if I am using the right options to generate the same sequence split?

@davek44
Copy link
Contributor

davek44 commented Mar 9, 2024

Hi Sara, can you say a little more about your goal? It'll influence how I can best help. It'd be a little tricky for me to track down the exact parameters and basenji_data.py has changed over the years. Is it OK if the recipe is equivalent in quality, but different due to minor tweaks and random number seeds?

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

2 participants