Suggestion: Positron needs to improve its stability when viewing ‘large-scale’ data #4474

ZhimingYe · 2024-08-24T17:33:13Z

ZhimingYe
Aug 24, 2024

This might be a topic composed of several issues. The reason for putting this in the discussion is mainly out of concern that the development team lacks test scenarios for heavy data loads. (Sorry, but please allow me to say this)This may be overlooked in all the development, making Positron extremely fragile when dealing with "large" data. The RPC frequently reports errors, and the connection between the backend and frontend keeps failing, similar to the previous issue #3628. #3628 was also triggered by the need to preview relatively large tables, which led to the crash of the entire R session.

However, this kind of fragility isn't limited to the R environment—even Python is affected. Positron, built on VSCode, should ideally support such analytical needs very well, as VSCode does. Yet, as a dedicated data science IDE, Positron's performance is actually worse than VSCode. And VSCode isn’t even specifically developed for data science. To this day, VSCode provides various features like line-by-line execution for Python, various code autocompletions, image preview and saving, a variables pane, integration with Jupyter notebooks, and data table preview capabilities through Data Wrangler—these are the features that Positron is supposed to have.

Given this, what exactly makes Positron unique?

This might sound a bit harsh, but I have immense respect for the work you do. Since the first day I stepped into data science, I have greatly benefited from your company’s outstanding products—especially RStudio, ggplot2, and Reticulate. Particularly RStudio and Reticulate. Moreover, Posit are selflessly developing open-source software, and we really shouldn't have so many critical demands.

I'm concerned that the communication method between Positron's frontend and backend may have systematically overlooked scenarios with heavy data preview loads, leading to issues like #3628 and a series of related problems. That's why I want to point this out early in the project's development. I may be overstepping here, so please forgive me.

System details:

Positron and OS details:

Positron Version: 2024.08.0 (Universal) build 48
Code - OSS Version: 1.91.0
Commit: ed616b3
Date: 2024-08-19T04:26:51.868Z
Electron: 29.4.0
Chromium: 122.0.6261.156
Node.js: 20.9.0
V8: 12.2.281.27-electron.0
OS: Darwin arm64 23.5.0

Interpreter details:

Python 3.9.18

Example 1

When creating multi-faceted scatter plots with Matplotlib containing over 40k+ points: For figure previewing, Positron's RPC may intermittently report timeouts, especially when submitting commands multiple times. This could be because the images are sometimes too large, exceeding the rendering wait time limit. When changing the size of the plot pane, whether or not accompanied by switching images, or even just switching images alone, the entire console can freeze. In this case, RStudio's default handling is more preferable. The lag only occurs the first time, after which resizing the image and re-rendering are separated from code execution, allowing the code to run smoothly.

Code to reproduce:

# wget "https://datasets.cellxgene.cziscience.com/981bcf57-30cb-4a85-b905-e04373432fef.h5ad"
import scanpy as sc
test=sc.read_h5ad("/home/yezhiming/redoSC/981bcf57-30cb-4a85-b905-e04373432fef.h5ad")
sc.pl.umap(test,color=['ENSG00000081237','ENSG00000119888','ENSG00000261371','ENSG00000164692','ENSG00000107796'],use_raw=True,legend_loc="on data")
sc.pl.umap(test,color=['ENSG00000081237','ENSG00000119888','ENSG00000261371','ENSG00000164692','ENSG00000107796'],use_raw=True,legend_loc="on data")
sc.pl.umap(test,color=['ENSG00000081237','ENSG00000119888','ENSG00000261371','ENSG00000164692','ENSG00000107796'],use_raw=True,legend_loc="on data")
sc.pl.umap(test,color=['ENSG00000081237','ENSG00000119888','ENSG00000261371','ENSG00000164692','ENSG00000107796'],use_raw=True,legend_loc="on data")

Example 2

And also, when trying to open large pandas table, it was usually display an error like this:

VSCode's Data Wrangler enable a smooth strolling even when viewing large table with 20k+ rows, and without any error.

Example 3

Positron's Data Viewer only can stroll according to each column (like video), and sometimes is slow.

PositronBehavior.mp4

VSCode's Data Wrangler is much better in this

VscodeDW.mp4

With Positron, every time you click on any preview, you worry about an RPC crash freezing the entire IDE. What I’m trying to say is, Positron, based on VSCode and supposedly designed for data science, doesn’t seem more suitable for it than VSCode itself—maybe even less so.

Thank you again for your work. I hope Positron continues to improve and wish you all the best.

jthomasmock · 2024-08-26T13:41:01Z

jthomasmock
Aug 26, 2024
Collaborator

Thanks for the feedback here!

I'd like to split up this discussion into a few different areas, and thus will open a few different threads here:

Data Explorer
Internal RPC (in various areas, including the plot pane)

0 replies

jthomasmock · 2024-08-26T14:06:46Z

jthomasmock
Aug 26, 2024
Collaborator

Data Explorer

We've made an explicit design decision to always move row by row and column by column in the Data Explorer. The long-term goal here is that keyboard movement/selection and mouse-scroll are consistent and that the display is snapped to a grid. It sounds like maybe you're not a fan of this behavior?
Do you have a specific example datasets where the Data Explorer is struggling? Or generally a description of that dataset (dimensions, data types, etc).
We do some large-scale data testing and definitely want to ensure that the Data Explorer is highly scalable.
I've personally tested 3 to 30 million rows by 24 columns. This generally works the same as data scales, although I've seen some bugs for, say, 40 million rows. There are some of the slowdowns/errors you showed at that scale.

0 replies

ZhimingYe · 2024-08-26T14:54:44Z

ZhimingYe
Aug 26, 2024
Author

@jthomasmock
Thank you for your response! In fact, you may notice that this issue can be easily triggered, and it tends to occur more frequently when previewing large datasets consecutively. Meanwhile, I believe Positron could improve the user experience in this area by considering the following points:

When loading large dataset previews:
- (1) RStudio will proactively pause and notify us that the preview is in progress, preventing multiple preview clicks that could lead to a queue of preview events in the background. Positron, however, sometimes encounters a series of errors if multiple clicks occur in quick succession.
- (2) After a brief wait, RStudio will automatically display the table preview. While the table might not always load (which is understandable, as we often don't expect such large datasets to be previewed, and sometimes it’s triggered accidentally), this doesn’t affect the execution of other commands in the console or the stability of the entire session.
In the case of image previews, especially when resizing images:
- Positron’s Python session seems tightly coupled with the console, whereas in RStudio, once the image loading is complete, resizing the image is completely decoupled from the console. Resizing an image doesn’t occupy the session’s time or affect the execution of commands in the console.

To summarize, my points are:

(1) Limit multiple RPC triggers that require significant communication time to avoid affecting the console.
(2) Decouple image resizing, RPC communication errors, and the console. We don’t want these additional events to interfere with our ongoing work.

In short, I want to emphasize that while previews are not essential, the stability of the data being processed should always be the top priority. Before the development of similar IDEs, we primarily conducted data analysis in terminals without these features or by using Jupyter Lab. So, we’re already accustomed to working without the additional features that Positron provides. What I want to emphasize is that the added functionalities in Positron should ensure the stability of the Console or Session. Thank you for your work.

Another is your reply: We've made an explicit design decision to always move row by row and column by column in the Data Explorer. The long-term goal here is that keyboard movement/selection and mouse-scroll are consistent and that the display is snapped to a grid. It sounds like maybe you're not a fan of this behavior?

For example, if you have a data table with cells containing lengthy information, you might expand the cell (taking up half of the IDE window) to view the contents. However, when you want to look back at the previous columns or check the columns ahead, intermittent scrolling is not well-suited for this situation. Continuous scrolling has been adopted in RStudio for many years, and Microsoft Data Wrangler has also adopted this approach. More broadly speaking, Excel supports cell scrolling with keyboard operations, although its default behavior is still not intermittent scrolling.

However, this issue isn’t actually that important in this discussion thread. Everyone can maintain their own preferences; there’s no right or wrong. What I’m more interested in is how Positron behaves under heavy load. From a software engineering perspective, rendering large-scale data is undoubtedly challenging. You really don’t have to render it (I feel it’s too difficult), but it’s crucial to ensure the stability of the working session in the Console. It’s really not necessary to compromise stability just to render all the previews.

Thank you for your work @jthomasmock . The reason I’m bringing this up is that RStudio is incredibly stable, no matter the workload. As the next-generation IDE introduced by Posit PBC, I genuinely hope this characteristic can be maintained.

2 replies

jthomasmock Aug 27, 2024
Collaborator

Thanks @ZhimingYe ! I will open up a few issues relevant to this thread about performance, blocking operations, and RPC timeouts.

Can you do me a favor and let me know the dimensions of the data you are seeing the Data Explorer issues with? With pandas, I can use about 15 million rows x 24 columns, but things start to go a bit wonky after about 30 million rows x 24 columns and we will explore.

I do see that the "https://datasets.cellxgene.cziscience.com/981bcf57-30cb-4a85-b905-e04373432fef.h5ad" dataset is about 10 GBs which is bigger than our smoke tests at this point.

ZhimingYe Sep 1, 2024
Author

@jthomasmock Thank you! Last week I was busy with my work. Sorry, these are several separate requirements. One of which is what you mentioned (but not for this specific data file). You’ve already seen this issue. I think Positron just needs to simply inform the user in such cases that "This request is taking too long! You should consider using other methods to overview the data." Large datasets like this should be managed with a database rather than just being handled in pandas, let alone rendering them. For the other requests, I’ve already opened two issues (#4546 and #4547) to address the most urgent problems. However, I still recommend that Positron create relevant prompts (e.g., warning windows) and smoke tests for such extreme scenarios.

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Suggestion: Positron needs to improve its stability when viewing ‘large-scale’ data #4474

{{title}}

Replies: 3 comments 2 replies

{{title}}

{{title}}

{{title}}

{{editor}}'s edit

{{editor}}'s edit

{{title}}

{{title}}

Select a reply

Suggestion: Positron needs to improve its stability when viewing ‘large-scale’ data #4474

ZhimingYe Aug 24, 2024

System details:

Positron and OS details:

Interpreter details:

Example 1

Example 2

Example 3

Replies: 3 comments · 2 replies

jthomasmock Aug 26, 2024 Collaborator

jthomasmock Aug 26, 2024 Collaborator

Data Explorer

ZhimingYe Aug 26, 2024 Author

jthomasmock Aug 27, 2024 Collaborator

ZhimingYe Sep 1, 2024 Author

ZhimingYe
Aug 24, 2024

Replies: 3 comments 2 replies

jthomasmock
Aug 26, 2024
Collaborator

jthomasmock
Aug 26, 2024
Collaborator

ZhimingYe
Aug 26, 2024
Author

jthomasmock Aug 27, 2024
Collaborator

ZhimingYe Sep 1, 2024
Author