bedgraph loading is very slow #13

Phlya · 2018-03-15T13:24:23Z

Hi, I understand that bigWigs should be preferred to bedgraphs for high resolution data, and they should be much faster than bedgraphs, but currently it seems that loading bedgraphs is unreasonably slow... A ~360Mb file takes a few seconds to load with pandas, but I have been waiting a few minutes for it to load as a BedGraphTrack and just had to interrupt it because I got bored. Is it something that might get improved? Thanks!

The text was updated successfully, but these errors were encountered:

fidelram · 2018-03-15T13:28:55Z

Honestly, it takes less time to convert the bedgraph to bigwig than to wait each time you want to plot something. We use UCSCTools wigToBigWig or bedGraphToBigWig and they are fast.

Pandas has super optimized parsers that can read tables very fast. But besides reading the file, pyGenomeTracks needs to create an interval tree of the bedgraph file to be able to index it and this also adds some further time.

Other solution, besides the bigwig idea, is to split your bedgraph per chromosome. In the case of a mammalian size genome, that should reduce the loading time by about 20 times (e.g. 1 min vs. 20).

fidelram · 2018-04-05T11:19:40Z

@Phlya I added support for tabix files for the bedGraph Track. If you convert to bedgraph file to tabix, then the loading is quite fast. To convert to tabix you need to have samtools installed and do:

$ sort -k1,1 -k2,2n bedgraph_file | bgzip > bedgraph_file_sorted.bg.bgz
$ tabix -p bed bedgraph_file_sorted.bg.bgz

Phlya · 2018-04-05T12:03:52Z

Awesome, thanks!

BenoitM-I2BC · 2020-08-28T11:36:23Z

Hi, I recently discovered pyGenomeTracks and I’m very pleased to start using.
This thread is rather old but my question is related.

The tabix support for bedgraph indeed allows to speed-up the loading of the data.
Yet, the initialization of the track (before the loading bar appears) is rather slow. A 800Mb tabix-indexed bedgraph can take more than 1min to initialize (and 2-3 sec to retrieve the data).

Is this a normal behavior? Am I missing something obvious?
Thanks.

====== track.ini file =====
[test bedgraph tabix]
file = Data.bg.gz
title = tabix-indexed bedgraph
type = fill
summary_method = max
number_of_bins = 3000
file_type = bedgraph
===========================

(python 3.8)
(pyGenomeTracks 3.5)

lldelisle · 2020-08-28T11:53:49Z

Hi,
First of all, we are happy to have new users...
For the moment, the code only consider a bedgraph as a potential tabix file if it ends with .bgz. So, If I am not wrong the fact that you tabix indexed your file is not used.
To test this, you can simply change your file name (don't forget to change the name of the index also) and see if it is quicker.
Since version 3.5, we do an intersection (with bedtools) between the bedgraph and the regions to plot, so it should be quicker than before. Can you give use the output you get during the initialization (especially the progress bar)?

Thanks

BenoitM-I2BC · 2020-08-28T12:13:50Z

Hi,
Awesome ! that's what I was missing.
Indeed, the file needs to ends with .bgz to be loaded instantaneously (0.04sec versus 79sec)

====== Using a .bgz file =======
INFO:pygenometracks.tracksClass:initialize 1. [test bedgraph tabix]
INFO:pygenometracks.tracksClass:initialize 2. [x-axis]
INFO:pygenometracks.tracksClass:time initializing track(s):
INFO:pygenometracks.tracksClass:0.045243263244628906
DEBUG:pygenometracks.tracksClass:Figure size in cm is 40 x 7.3138297872340425. Dpi is set to 72

====== Using a .bg.gz file =======
INFO:pygenometracks.tracksClass:initialize 1. [test bedgraph tabix]
100%|█████████████████████████| 65475/65475 [00:02<00:00, 24004.00it/s]
INFO:pygenometracks.tracksClass:initialize 2. [x-axis]
INFO:pygenometracks.tracksClass:time initializing track(s):
INFO:pygenometracks.tracksClass:79.6509759426117
DEBUG:pygenometracks.tracksClass:Figure size in cm is 40 x 7.3138297872340425. Dpi is set to 72

Out of curiosity: I thought that one could use the file_type parameter to specify the type of files irrespective of the file extension. Should I conclude that's not possible for (at least) tabix-indexed bedgraphs?

Thanks for this very nice piece of work
And Thanks so much for your super fast answer !

lldelisle · 2020-08-28T12:44:30Z

The file_type will in fact specify the type of track to display. Associated to each file_type you have different parameters that you can customize. The file_type can have a limited number of values which are described here as a list (https://pygenometracks.readthedocs.io/en/latest/content/all_tracks.html) or here as the columns of a table with the possible parameters (https://pygenometracks.readthedocs.io/en/latest/content/possible-parameters.html).
Here it is more subtile, it is just the way to load the data which is different between bedgraph and tabix bedgraph, this is not customizable by the user.
I proposed a change where loading as tabix would be tested independently of the extension (#276 ).

fidelram closed this as completed Mar 16, 2018

fidelram mentioned this issue Jul 26, 2018

feature request: visualize chromatin state epilogos #6

Closed

lldelisle reopened this Aug 28, 2020

lldelisle mentioned this issue Aug 28, 2020

Test tabix independently of extension #276

Merged

lldelisle closed this as completed Aug 28, 2020

lldelisle mentioned this issue Oct 8, 2020

Update version 3.5.1 #290

Merged

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

bedgraph loading is very slow #13

bedgraph loading is very slow #13

Phlya commented Mar 15, 2018

fidelram commented Mar 15, 2018 •

edited

Loading

fidelram commented Apr 5, 2018

Phlya commented Apr 5, 2018

BenoitM-I2BC commented Aug 28, 2020

lldelisle commented Aug 28, 2020

BenoitM-I2BC commented Aug 28, 2020

lldelisle commented Aug 28, 2020 •

edited

Loading

bedgraph loading is very slow #13

bedgraph loading is very slow #13

Comments

Phlya commented Mar 15, 2018

fidelram commented Mar 15, 2018 • edited Loading

fidelram commented Apr 5, 2018

Phlya commented Apr 5, 2018

BenoitM-I2BC commented Aug 28, 2020

lldelisle commented Aug 28, 2020

BenoitM-I2BC commented Aug 28, 2020

lldelisle commented Aug 28, 2020 • edited Loading

fidelram commented Mar 15, 2018 •

edited

Loading

lldelisle commented Aug 28, 2020 •

edited

Loading