-
-
Notifications
You must be signed in to change notification settings - Fork 8.2k
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
[rust] Tracking Selenium Manager usage through Plausible (#11211) #13173
Conversation
642a543
to
ee3cbb8
Compare
Excellent. Have we tested the output? What does it look like on Plausible? |
deed17c
to
06b3b2d
Compare
Codecov ReportAll modified and coverable lines are covered by tests ✅
❗ Your organization needs to install the Codecov GitHub app to enable full functionality. Additional details and impacted files@@ Coverage Diff @@
## trunk #13173 +/- ##
=======================================
Coverage 58.03% 58.03%
=======================================
Files 88 88
Lines 5335 5335
Branches 224 224
=======================================
Hits 3096 3096
Misses 2015 2015
Partials 224 224 ☔ View full report in Codecov by Sentry. |
06b3b2d
to
d4392f3
Compare
@bonigarcia is this ready? |
Yes, it is. |
Is all still the same after you have placed the call in a non blocking thread? |
The call to Plausible is now executed in a different thread, so the time overhead is null. |
@bonigarcia can you resolve conflicts here? |
fe5c236
to
14e6f72
Compare
Done. |
14e6f72
to
676196a
Compare
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Great. |
cc1641a
to
606111e
Compare
|
We're going to reduce the frequency of reports in the next version, but I'm not sure I understand your underlying concern. |
The underlying concern is some of us use Selenium in more sensitive networks and now we have more work to protect the privacy of what we do to disable this "feature" that was opt-out vs opt-in. I understand the desire to collect such metrics but a more diplomatic way of doing it, in the face of ever growing invasive privacy and tracking issues, would have been "we've added telemetry collection on an opt-in basis. It would help us to know where to focus our efforts if you'd turn it on" vs "" Currently downgraded the version used until I can put in a place to permanently block all telemetry data as the above mentioned environment variable doesn't seem to be checked within the code anywhere and doesn't appear to be a known one for other systems either. |
Regarding privacy, it is worth mentioning that we use https://plausible.io/, which is privacy first and GDPR compliant. |
Thanks for clarifying. I was a little confused because the number of times the same information is sent seemed irrelevant to the point you were making. Three potential concerns as I see it.
We talked about a several different ways to implement this, some more complex and some more simple, and this was the best compromise we came up with. The binary itself has not been opt-in for a while now. The new behavior (automatic downloads) is currently opt-in, but this will change soon as well. |
It seems my concerns weren't clear enough. Regardless of Plausible's stance on privacy, that does not mean all of us wish to share such data for a variety of reasons and this change caused several alarms to flag on our end during testing due to unexpected requests being made that we did not authorize. I don't deny that the data you request is valuable for the project as a whole, but such analytics should be Opt In vs Opt Out. Ask the community as a whole to opt-in to new analytics and tell them why, don't just turn it on with barely a mention in the readme. This issue caused additional billable time to my client to ensure the data is not sent so this decision is having monetary damages to individuals and companies. |
Thank you for providing visibility into this metric gathering. Due to some sensitive work/contracts, I can't allow any data to escape. Could you let me know if there's a way to turn this off for such projects that require it? At the moment, we get |
I understand your concerns, and at the very least it could have been communicated better. Opt-in is typically for personally identifiable info or tracking data; we view this as essentially the same GDPR compliant data we collect by default from all of our website visitors. Everything collected is publicly available. If we had more resources and more assistance we might have found other ways of implementing or communicating things, but we're doing the best we can do with the resources we have. |
Yes, you can disable it by setting environment variable SE_AVOID_STATS to "true" or with the config file |
We've found that using the environment variable does somehow work despite not being checked in code and are using that IN ADDITION to other methods. I'm not dismissing the work you're doing, we do appreciate it, but even with it being GDPR compliant collection it is still something that should have been opt-in. |
It is checked by the binary (generated from the Rust code). @diemol / @bonigarcia do either of you have time to write a blog post and update our docs? I can add it to my list if not. |
@titusfortner FYI on the release blog post, it was categorized as 2023 by mistake, may wish to move that to the 2024 folder so it's more obvious and visible. I agree with others that it should have been communicated more widely in advance, and also that the data collected is valuable. However, an opt-in approach is always best when collecting data. But primarily, communication on the intent should have been made before the release. |
Using GitHub search only shows it in three places and looking through what some would consider to be obvious places does not show that environment variable being checked anywhere. Even extending to all of GitHub still does not show where it is being checked. Again, I agree the data is valuable in general and for the future of the project, and I do appreciate the hard work y'all put into this. In a world where people don't like being tracked and reported on, especially developers, opt-in is ALWAYS the best approach EVEN when the data is compliant with all laws. The data being shared via a browser making a request is expected, that same data being harvested and sent to a third-party we didn't agree to, even if benign, is not. |
I'm not sure why it matters where in the code it is set. I'm not disagreeing with you, your viewpoint is valid, but all software is compromises and tradeoffs based on resources and requirements and the best I can say is that this implementation was at the confluence of our consensus evaluation. |
The location matters to those of us who may now have to do additional auditing to ensure the dependencies we include in our projects don't introduce undesired effects such as this feature. Might I suggest changing the default to be You still get your data, eventually, and can do a re-do on the communication front. Every decision is a compromise and a set of trade offs regardless of it is software or IRL. Having this be opt-out for tracking was a bad decision. |
Alternately, if you specify the location of the driver you want in the service class, the binary won't be used at all. |
For those using the Ruby gem, configure a custom browser and pass an options package to it with this argument.
Still suggest putting in extra measure on your network to block the tracking as a backup however. |
That... shouldn't work. Selenium Manager needs the args and those options arguments get passed to the driver when starting the browser. |
Oddly it does. Would love a more elegant solution to ensure this feature is disabled on deployments. Currently have a custom hosts file, environment variables set, and this to disable sending of data using the selenium-webdrivers gem. |
Hmm, really think is has to be something else and/or this behavior will get changed unexpectedly for you. Can you turn on logging ( |
I saw it report it as unrecognized but it does not attempt to send the analytics. https://gist.github.com/rjhancock/718994edf7756392336f07ae1a05d950 This is only 8 of 1800+ tests. |
It's not compliant, because when you install the package consent is not triggered. Thus when someone installs without reading README but by using standard Python tools they don't see a warning nor consent on stealing sensitive data (and just having an app receive IP address of client without explicit consent is enough for breaking PI laws). As long as you can install selenium without triggering EXPLICIT consent - it is not legal. Moreover - there are use cases when subdomain (non public) can be a privately identifiable (and remember in GDPR Article 4 is identifiable) information. Besides that neither manual nor README includes a statement about telemetry. I see lawsuits written all over this |
@andrew-azarov I am not a lawyer, but SFC approved this tool mainly because of its focus on data privacy.
Check their content about it. https://plausible.io/privacy-focused-web-analytics |
The issue is that USING the PI collecting tool makes you liable for the data you collect. So there should be liability disclosure ie privacy policy, BEFORE you collect the data, doesn't matter how or for what, it can be a local app collecting mouse movement on your PC without sending it to a third party - it still must disclose that this data is collected and saved so the user is aware that this happens. Otherwise it is by all definition a malware. This is the first breach of GDPR. So what should have been done. There is a reason Google switched to consent 2.0 recently. That is because GDPR was written by lawyers and not technical people, and lawyers don't understand that even ping command exposes your IP address. |
We looked at https://plausible.io/data-policy, and that gave us an understanding of what is done with the data. We have also placed this in the docs: https://www.selenium.dev/documentation/selenium_manager/#data-collection, but from what you are saying, is it not enough? Again, I am not a lawyer, and we rely on the advice of others at SFC. However, would you be willing to help us improve based on your comments? We just want to make Selenium better by understanding global usage; we are not interested in individual data. I am just confused because the tone you are using is kind of threatening, so I wonder if you would like to help us improve. |
So you implemented and launched a data collection feature without ever consulting a lawyer or data protection expert, but claim it is GDPR compliant without ever having verified this? That seems very negligent to me. Simply saying 'I am not a lawyer' is not a solution either. When you implement such critical features, you have to ensure compliance before making any claims. |
I did not say that; I said the tool was reviewed by SFC, which owns Selenium and has a legal team. Could you tell me why you suggested that when I did not say that explicitly? I am having trouble understanding the threatening tone. If there is something that needs improvement, we can work it out with your help. This is the type of behavior that burns down OSS maintainers; instead of collaborating, we get harsh comments. |
As I stated previously, the issue is that data started being collected and sent without informed consent from us developers and the only notice that was given was a line item on the release notes. It was not clearly stated what was going on or what was going to happen. Having it be an opt-in situation vs opt-out would have solved this problem AS WELL AS a plea to the community for why you want the data and requesting they turn it on. I recently had this discussion with another OS project about adding Sentry monitoring. Everyone was on board for including the package as it will help, but we decided to very strictly have it off by default. We've already started notifying users MONTHS before we implement it, why we want to collect the data, and requesting they turn it on. In a world where data is king, and is generally taken without consent, be the bigger firm and specifically REQUEST it BEFORE taking it. I've wasted several hours ensuring all of my systems have extra protections in place because I've yet to find a suitable replacement. |
No PII is retained, and we make all information we collect public — https://plausible.io/manager.selenium.dev I do understand your frustration; I also wish there were a better option, but there is no way to get meaningful information and have it be opt-in. So the options were either not to collect any information or to do so the way we have. I get that you would make another decision than we did, that's fine, but it doesn't mean we didn't sufficiently consider the options or that we would have chosen differently if only you explained it louder/harder at us. This conversation is not being helpful to anyone at this point, so I'm locking it. |
Description
This PR implements a mechanism to track the usage of Selenium Manager through Plausible.
To that, the Rust code sends a
pageview
request to plausible using the events API to track visitors. As requested in #11211, the data we want to collect is the following:For "language binding", a new flag called
--language-binding
has been introduced in this PR. Bindings should use this property to report it (e.g.,--language-binding java
,--language-binding ruby
, etc.).All of this information is gathered through custom properties (called
props
in the Plausible Event API). The (fake) page to track these visitors will be the following:https://selenium.dev/sm-stats
An example of the equivalent POST request to track a page view is the following:
The data gathered by Plausible will be available on:
https://plausible.io/selenium.dev
To configure the custom properties (e.g., for filtering), these properties should be enabled:
https://plausible.io/selenium.dev/settings/properties
I tested this feature using a free Plausible account and my personal domain (bonigarcia.dev):
This PR is already pointing to selenium.dev. @diemol @titusfortner If you want to check how data is gathered, you can trigger the CI Rust workflow using the branch
sm_plausible
.NOTE: I was concerned about the impact of this feature on the overall performance of Selenium Manager. To get some numbers, I calculated the average time of running the Selenium Manager tests using the
trunk
branch (i.e., without this feature) compared to the branchsm_plausible
.For that, I executed the following command in Linux:
I executed this command (with every asset in the cache) 10 times for each branch, and I calculated the average total time.
average_total_time_trunk = 11.1399 seconds
average_total_time_plausible = 12.4713 seconds
Therefore, the overhead introduced to the requests made to Plausible averages 0.7341 seconds. Since 23 page views were requested during the tests, each
pageview
request added an overhead of around 32 ms, which seems reasonable.Motivation and Context
This PR resolves #11211.
Types of changes
Checklist