-
Notifications
You must be signed in to change notification settings - Fork 4
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
"All objects to combine must have the same number of columns" should not be forced for intersect. #98
Comments
Thank you for finding this issue. we did implement a relaxed combine operation for scenario's like this (https://github.com/BiocPy/BiocFrame/blob/master/src/biocframe/BiocFrame.py#L1487). Should be straightforward for me to switch to that. |
* As noted in #98, using `relaxed_combine_rows` allows to perform operations on granges objects that may contain different metadata columns. * Set numpy version to < 2.0; since a few operations are incompatible with the new release * Add tests
@ghuls: Just released a new version 0.4.19 that fixes this. Let me know if you run into any more issues. Thanks again for letting us know about this. |
The command now works. But for my actual usecase (intersecting BED file with bigWig with help of biobear for reading), GenomicRanges is way to slow. import biobear as bb
import polars as pl
from genomicranges import GenomicRanges
session = bb.new_session()
bed = session.read_bed_file("consensus_peaks_bicnn.bed", bb.BEDReadOptions(n_fields=4))
%time bed_df = bed.to_polars()
CPU times: user 225 ms, sys: 17.9 ms, total: 243 ms
Wall time: 223 ms
bigwig = session.read_bigwig_file("pybigtools/Astro.bw")
%time bigwig_df = bigwig.to_polars()
CPU times: user 5.49 s, sys: 1.36 s, total: 6.85 s
Wall time: 4.9 s
%time bed_gr = GenomicRanges.from_polars(bed_df.rename({"reference_sequence_name": "seqnames", "start": "starts", "end": "ends"}))
%time bigwig_gr = GenomicRanges.from_polars(bw_df.rename({"name": "seqnames", "start": "starts", "end": "ends"}))
CPU times: user 29.5 s, sys: 2.94 s, total: 32.4 s
Wall time: 32.4 s
In [60]: bed_df.shape
Out[60]: (546993, 4)
In [61]: bigwig_df.shape
Out[61]: (71164307, 4)
%time bed_bigwig_gr = bed_gr.intersect(bigwig_gr)
Not directly related to thing above, but I noticed in the IRanges is the usage of e.g. # Calculating max on a 1957 element numpy array.
In [43]: %timeit b = max(a)
25 ms ± 859 µs per loop (mean ± std. dev. of 7 runs, 10 loops each)
In [44]: %timeit b = a.max()
77.3 µs ± 3.77 µs per loop (mean ± std. dev. of 7 runs, 10,000 loops each)
# Calculating max on a 71164307 element numpy array.
In [47]: %time d = c.max()
CPU times: user 30.3 ms, sys: 445 µs, total: 30.8 ms
Wall time: 30.3 ms
In [48]: %time e = max(c)
CPU times: user 3.25 s, sys: 7.06 ms, total: 3.26 s
Wall time: 3.25 s |
interesting, thank you for digging deeper. Also can you let me know what numpy version you have? |
@ghuls Can you share the links to the two files you mentioned? "consensus_peaks_bicnn.bed" and "pybigtools/Astro.bw" only if they are publicly available? |
I updated to the latest iranges. intersect now is at least able to finish within an hour: In [7]: %time bed_bigwig_gr = bed_gr.intersect(bigwig_gr)
/software/anaconda3/envs/genomicranges/lib/python3.11/site-packages/genomicranges/SeqInfo.py:348: UserWarning: 'seqnames' is deprecated, use 'get_seqnames' instead
warn("'seqnames' is deprecated, use 'get_seqnames' instead", UserWarning)
CPU times: user 14min 48s, sys: 34.8 s, total: 15min 23s
Wall time: 16min 42s
In [2]: import numpy as np
In [3]: np.__version__
Out[3]: '1.26.4'
In [4]: import genomicranges
In [5]: genomicranges.__version__
Out[5]: '0.4.20'
In [6]: import iranges
In [7]: iranges.__version__
Out[7]: '0.2.9' Test files are here:
|
https://github.com/BiocPy/IRanges/blob/master/src/iranges/IRanges.py#L117-L127 _sanitize_start and _sanitize_end can be rewriten like this, I think:
From
|
Not great yet, but this now dropped to 3 minutes by switching to NCLS to perform the interval operation ( |
Dropped to a minute and a half. But I'm going to be looking into integrating with c/rust based indexing libraries for faster performance. Thank you for bringing this up. going to close this for now. |
All objects to combine must have the same number of columns
should not be forced for intersect as this makes intersecting e.g. to BED files with different kind of additional columns (or a GTF and a BED file) impossible.The text was updated successfully, but these errors were encountered: