Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

[rust] Tracking Selenium Manager usage through Plausible (#11211) #13173

Merged
merged 14 commits into from
Dec 13, 2023

Conversation

bonigarcia
Copy link
Member

@bonigarcia bonigarcia commented Nov 20, 2023

Description

This PR implements a mechanism to track the usage of Selenium Manager through Plausible.

To that, the Rust code sends a pageview request to plausible using the events API to track visitors. As requested in #11211, the data we want to collect is the following:

  • Selenium version
  • Language binding
  • Browser
  • Browser version
  • Operating system
  • Architecture

For "language binding", a new flag called --language-binding has been introduced in this PR. Bindings should use this property to report it (e.g., --language-binding java, --language-binding ruby, etc.).

All of this information is gathered through custom properties (called props in the Plausible Event API). The (fake) page to track these visitors will be the following:

https://selenium.dev/sm-stats

An example of the equivalent POST request to track a page view is the following:

curl -i -X POST https://plausible.io/api/event \
  -H 'User-Agent: Selenium Manager 0.4.16' \
  -H 'Content-Type: application/json' \
  --data '{"name":"pageview","url":"https://selenium.dev/sm-stats","domain":"selenium.dev","props":{"browser":"chrome","browser_version":"119.0.6045.105","os":"linux","arch":"x86_64","lang":"java","selenium_version":"4.16"}}'

The data gathered by Plausible will be available on:

https://plausible.io/selenium.dev

To configure the custom properties (e.g., for filtering), these properties should be enabled:

https://plausible.io/selenium.dev/settings/properties

I tested this feature using a free Plausible account and my personal domain (bonigarcia.dev):

image

This PR is already pointing to selenium.dev. @diemol @titusfortner If you want to check how data is gathered, you can trigger the CI Rust workflow using the branch sm_plausible.

NOTE: I was concerned about the impact of this feature on the overall performance of Selenium Manager. To get some numbers, I calculated the average time of running the Selenium Manager tests using the trunk branch (i.e., without this feature) compared to the branch sm_plausible.

For that, I executed the following command in Linux:

time cargo test

I executed this command (with every asset in the cache) 10 times for each branch, and I calculated the average total time.

average_total_time_trunk = 11.1399 seconds
average_total_time_plausible = 12.4713 seconds

Therefore, the overhead introduced to the requests made to Plausible averages 0.7341 seconds. Since 23 page views were requested during the tests, each pageview request added an overhead of around 32 ms, which seems reasonable.

Motivation and Context

This PR resolves #11211.

Types of changes

  • Bug fix (non-breaking change which fixes an issue)
  • New feature (non-breaking change which adds functionality)
  • Breaking change (fix or feature that would cause existing functionality to change)

Checklist

  • I have read the contributing document.
  • My change requires a change to the documentation.
  • I have updated the documentation accordingly.
  • I have added tests to cover my changes.
  • All new and existing tests passed.

rust/src/lib.rs Outdated Show resolved Hide resolved
@titusfortner
Copy link
Member

Excellent. Have we tested the output? What does it look like on Plausible?

@codecov-commenter
Copy link

codecov-commenter commented Nov 23, 2023

Codecov Report

All modified and coverable lines are covered by tests ✅

Comparison is base (5b1b449) 58.03% compared to head (c7a81fa) 58.03%.

❗ Current head c7a81fa differs from pull request most recent head d4473e4. Consider uploading reports for the commit d4473e4 to get more accurate results

❗ Your organization needs to install the Codecov GitHub app to enable full functionality.

Additional details and impacted files
@@           Coverage Diff           @@
##            trunk   #13173   +/-   ##
=======================================
  Coverage   58.03%   58.03%           
=======================================
  Files          88       88           
  Lines        5335     5335           
  Branches      224      224           
=======================================
  Hits         3096     3096           
  Misses       2015     2015           
  Partials      224      224           

☔ View full report in Codecov by Sentry.
📢 Have feedback on the report? Share it here.

@titusfortner
Copy link
Member

@bonigarcia is this ready?
@diemol do we see what the output looks like in Plausible?

@bonigarcia
Copy link
Member Author

is this ready?

Yes, it is.

@diemol
Copy link
Member

diemol commented Dec 5, 2023

@bonigarcia

NOTE: I was concerned about the impact of this feature on the overall performance of Selenium Manager. To get some numbers, I calculated the average time of running the Selenium Manager tests using the trunk branch (i.e., without this feature) compared to the branch sm_plausible.

Is all still the same after you have placed the call in a non blocking thread?

@bonigarcia
Copy link
Member Author

Is all still the same after you have placed the call in a non blocking thread?

The call to Plausible is now executed in a different thread, so the time overhead is null.

@titusfortner
Copy link
Member

@bonigarcia can you resolve conflicts here?

@bonigarcia
Copy link
Member Author

can you resolve conflicts here?

Done.

Copy link
Member

@diemol diemol left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I changed the domain because I created one to only have the SM stats and avoid mixing them with the website.

image

rust/src/stats.rs Outdated Show resolved Hide resolved
@bonigarcia
Copy link
Member Author

I changed the domain because I created one to only have the SM stats and avoid mixing them with the website.

Great.

rust/src/stats.rs Show resolved Hide resolved
@rjhancock
Copy link

  1. I wish this was communicated more publicly before it was put in place. I do understand the desire and need to know the information, but make it opt-in vs opt-out.

  2. The method described for opting out appears to be a no-op unless it is applied in a library that is not included within the entire SeleniumHQ organization as it only appears thrice and not a single place to disable sending actually occurs within the code.

  3. It sends those reports multiple times. Just within my test suite it has attempted to send the info at least a dozen times across 1800 tests. Since it is also used in production, it would be reporting thousands of times A DAY just from our use case.

@titusfortner
Copy link
Member

We're going to reduce the frequency of reports in the next version, but I'm not sure I understand your underlying concern.

@rjhancock
Copy link

The underlying concern is some of us use Selenium in more sensitive networks and now we have more work to protect the privacy of what we do to disable this "feature" that was opt-out vs opt-in.

I understand the desire to collect such metrics but a more diplomatic way of doing it, in the face of ever growing invasive privacy and tracking issues, would have been "we've added telemetry collection on an opt-in basis. It would help us to know where to focus our efforts if you'd turn it on" vs ""

Currently downgraded the version used until I can put in a place to permanently block all telemetry data as the above mentioned environment variable doesn't seem to be checked within the code anywhere and doesn't appear to be a known one for other systems either.

@diemol
Copy link
Member

diemol commented Feb 1, 2024

Regarding privacy, it is worth mentioning that we use https://plausible.io/, which is privacy first and GDPR compliant.

@titusfortner
Copy link
Member

Thanks for clarifying. I was a little confused because the number of times the same information is sent seemed irrelevant to the point you were making.

Three potential concerns as I see it.

  1. Performance — shouldn't be affected the http call made by a separate thread
  2. Privacy — there is nothing personally identifiable, browser name, version, os, architecture, calling language. Plausible itself gathers IP addresses for locations, but it is only aggregated: https://plausible.io/privacy-focused-web-analytics
  3. Security — If there is a proxy or other network restrictions, the call won't go anywhere. Also we're still working on getting these signed

We talked about a several different ways to implement this, some more complex and some more simple, and this was the best compromise we came up with. The binary itself has not been opt-in for a while now. The new behavior (automatic downloads) is currently opt-in, but this will change soon as well.

@rjhancock
Copy link

It seems my concerns weren't clear enough. Regardless of Plausible's stance on privacy, that does not mean all of us wish to share such data for a variety of reasons and this change caused several alarms to flag on our end during testing due to unexpected requests being made that we did not authorize.

I don't deny that the data you request is valuable for the project as a whole, but such analytics should be Opt In vs Opt Out. Ask the community as a whole to opt-in to new analytics and tell them why, don't just turn it on with barely a mention in the readme.

This issue caused additional billable time to my client to ensure the data is not sent so this decision is having monetary damages to individuals and companies.

@walksonair
Copy link

Thank you for providing visibility into this metric gathering. Due to some sensitive work/contracts, I can't allow any data to escape. Could you let me know if there's a way to turn this off for such projects that require it? At the moment, we get Error sending stats to Plausible on some networks. Again, it would help if we could opt out of this metric collection for some projects. Thank you again!

@titusfortner
Copy link
Member

I understand your concerns, and at the very least it could have been communicated better. Opt-in is typically for personally identifiable info or tracking data; we view this as essentially the same GDPR compliant data we collect by default from all of our website visitors. Everything collected is publicly available. If we had more resources and more assistance we might have found other ways of implementing or communicating things, but we're doing the best we can do with the resources we have.

@titusfortner
Copy link
Member

Yes, you can disable it by setting environment variable SE_AVOID_STATS to "true" or with the config file

@rjhancock
Copy link

We've found that using the environment variable does somehow work despite not being checked in code and are using that IN ADDITION to other methods.

I'm not dismissing the work you're doing, we do appreciate it, but even with it being GDPR compliant collection it is still something that should have been opt-in.

@titusfortner
Copy link
Member

despite not being checked in code

It is checked by the binary (generated from the Rust code).
We decided that it was only worth collecting the data if it was opt-out.
That there would be concerns and frustrations was foreseeable and we could have communicated it better.

@diemol / @bonigarcia do either of you have time to write a blog post and update our docs? I can add it to my list if not.

@zfouts
Copy link

zfouts commented Feb 2, 2024

@titusfortner FYI on the release blog post, it was categorized as 2023 by mistake, may wish to move that to the 2024 folder so it's more obvious and visible.

I agree with others that it should have been communicated more widely in advance, and also that the data collected is valuable. However, an opt-in approach is always best when collecting data. But primarily, communication on the intent should have been made before the release.

@rjhancock
Copy link

despite not being checked in code

It is checked by the binary (generated from the Rust code).

Using GitHub search only shows it in three places and looking through what some would consider to be obvious places does not show that environment variable being checked anywhere.

Even extending to all of GitHub still does not show where it is being checked.

Again, I agree the data is valuable in general and for the future of the project, and I do appreciate the hard work y'all put into this.

In a world where people don't like being tracked and reported on, especially developers, opt-in is ALWAYS the best approach EVEN when the data is compliant with all laws. The data being shared via a browser making a request is expected, that same data being harvested and sent to a third-party we didn't agree to, even if benign, is not.

@titusfortner
Copy link
Member

I'm not sure why it matters where in the code it is set.
it parses environment variables with SE_*_* and cli values with --*-* and config file values, etc
https://github.com/SeleniumHQ/selenium/blob/selenium-4.17.0/rust/src/lib.rs#L1478-L1480
Everything we see is publicly available on plausible's site - https://plausible.io/manager.selenium.dev

I'm not disagreeing with you, your viewpoint is valid, but all software is compromises and tradeoffs based on resources and requirements and the best I can say is that this implementation was at the confluence of our consensus evaluation.

@rjhancock
Copy link

The location matters to those of us who may now have to do additional auditing to ensure the dependencies we include in our projects don't introduce undesired effects such as this feature.

Might I suggest changing the default to be true until the next major version and re-communicate to the community why the change was made and why you want the data?

You still get your data, eventually, and can do a re-do on the communication front.

Every decision is a compromise and a set of trade offs regardless of it is software or IRL. Having this be opt-out for tracking was a bad decision.

@titusfortner
Copy link
Member

Alternately, if you specify the location of the driver you want in the service class, the binary won't be used at all.

@rjhancock
Copy link

For those using the Ruby gem, configure a custom browser and pass an options package to it with this argument.

options.add_argument('--avoid-stats=true')

Still suggest putting in extra measure on your network to block the tracking as a backup however.

@titusfortner
Copy link
Member

pass an options package to it with this argument

That... shouldn't work. Selenium Manager needs the args and those options arguments get passed to the driver when starting the browser.

@rjhancock
Copy link

Oddly it does. Would love a more elegant solution to ensure this feature is disabled on deployments. Currently have a custom hosts file, environment variables set, and this to disable sending of data using the selenium-webdrivers gem.

@titusfortner
Copy link
Member

Hmm, really think is has to be something else and/or this behavior will get changed unexpectedly for you. Can you turn on logging (Selenium::WebDriver.logger.level = :debug) and link to a gist with the results?

@rjhancock
Copy link

I saw it report it as unrecognized but it does not attempt to send the analytics.

https://gist.github.com/rjhancock/718994edf7756392336f07ae1a05d950

This is only 8 of 1800+ tests.

@andrew-azarov
Copy link

andrew-azarov commented Jun 11, 2024

Regarding privacy, it is worth mentioning that we use https://plausible.io/, which is privacy first and GDPR compliant.

It's not compliant, because when you install the package consent is not triggered. Thus when someone installs without reading README but by using standard Python tools they don't see a warning nor consent on stealing sensitive data (and just having an app receive IP address of client without explicit consent is enough for breaking PI laws).

As long as you can install selenium without triggering EXPLICIT consent - it is not legal.

Moreover - there are use cases when subdomain (non public) can be a privately identifiable (and remember in GDPR Article 4 is identifiable) information.
For example you setup a catchall subdomain ala jack-coch-zx0vu80.clients.website.tld for ease of access to a private panel for customer and limit that to 1 ip address or subnet - this is by all definitions private space and private identifiable information as nobody but the script runner can see it and access it. And now this goes to plausible without ANY consent.

Besides that neither manual nor README includes a statement about telemetry.
EDIT: not even website has consent or Privacy policy. This must be reported asap to national PII protection services.
Any company which uses selenium - could have inadvertently compromised their own privacy policies and must report breach (https://commission.europa.eu/law/law-topic/data-protection/reform/rules-business-and-organisations/obligations/what-data-breach-and-what-do-we-have-do-case-data-breach_en)

I see lawsuits written all over this

@diemol
Copy link
Member

diemol commented Jun 11, 2024

@andrew-azarov I am not a lawyer, but SFC approved this tool mainly because of its focus on data privacy.

Built with privacy of your visitors in mind. No need to annoy them with a cookie/GDPR consent banner.

Check their content about it. https://plausible.io/privacy-focused-web-analytics

@andrew-azarov
Copy link

andrew-azarov commented Jun 11, 2024

@andrew-azarov I am not a lawyer, but SFC approved this tool mainly because of its focus on data privacy.

Built with privacy of your visitors in mind. No need to annoy them with a cookie/GDPR consent banner.

Check their content about it. https://plausible.io/privacy-focused-web-analytics

The issue is that USING the PI collecting tool makes you liable for the data you collect. So there should be liability disclosure ie privacy policy, BEFORE you collect the data, doesn't matter how or for what, it can be a local app collecting mouse movement on your PC without sending it to a third party - it still must disclose that this data is collected and saved so the user is aware that this happens. Otherwise it is by all definition a malware.
Just a note about data collection deep in the help docs is not enough, especially when you can completely skip the docs when installing via packages.

This is the first breach of GDPR.
Second one is that they SEND the data to plausible, inadvertently exposing IP addresses, library and OS data (which can include hostname in some cases, even hidden hostnames like Kernel compilation hostname of internal VM in case of BSD for example).
As there is no detailed description of what data is sent I can only speculate, but per GDPR if the bulk of data is enough for personal identification - it is not compliant, it must be anonymized beforehand.
Plausible in this case receives IP (because selenium connects to it) and browser, OS, whatever else there. It's enough to identify a person even without explicit personal information available like in case of hostname I wrote about previously.

So what should have been done.
Selenium MUST have a privacy policy with clearly described and written PI it uses. Also this document must disclose all of the third parties processing that data. The consent must be explicitly taken upon installation of software, if no consent and selenium wants to keep telemetry on by default - then selenium MUST NOT be able to install.

There is a reason Google switched to consent 2.0 recently. That is because GDPR was written by lawyers and not technical people, and lawyers don't understand that even ping command exposes your IP address.

@diemol
Copy link
Member

diemol commented Jun 11, 2024

We looked at https://plausible.io/data-policy, and that gave us an understanding of what is done with the data.

We have also placed this in the docs: https://www.selenium.dev/documentation/selenium_manager/#data-collection, but from what you are saying, is it not enough?

Again, I am not a lawyer, and we rely on the advice of others at SFC. However, would you be willing to help us improve based on your comments? We just want to make Selenium better by understanding global usage; we are not interested in individual data. I am just confused because the tone you are using is kind of threatening, so I wonder if you would like to help us improve.

@Schnurres
Copy link

So you implemented and launched a data collection feature without ever consulting a lawyer or data protection expert, but claim it is GDPR compliant without ever having verified this?

That seems very negligent to me. Simply saying 'I am not a lawyer' is not a solution either. When you implement such critical features, you have to ensure compliance before making any claims.

@diemol
Copy link
Member

diemol commented Jun 11, 2024

I did not say that; I said the tool was reviewed by SFC, which owns Selenium and has a legal team. Could you tell me why you suggested that when I did not say that explicitly?

I am having trouble understanding the threatening tone. If there is something that needs improvement, we can work it out with your help. This is the type of behavior that burns down OSS maintainers; instead of collaborating, we get harsh comments.

@rjhancock
Copy link

As I stated previously, the issue is that data started being collected and sent without informed consent from us developers and the only notice that was given was a line item on the release notes.

It was not clearly stated what was going on or what was going to happen. Having it be an opt-in situation vs opt-out would have solved this problem AS WELL AS a plea to the community for why you want the data and requesting they turn it on.

I recently had this discussion with another OS project about adding Sentry monitoring. Everyone was on board for including the package as it will help, but we decided to very strictly have it off by default. We've already started notifying users MONTHS before we implement it, why we want to collect the data, and requesting they turn it on.

In a world where data is king, and is generally taken without consent, be the bigger firm and specifically REQUEST it BEFORE taking it. I've wasted several hours ensuring all of my systems have extra protections in place because I've yet to find a suitable replacement.

@titusfortner
Copy link
Member

No PII is retained, and we make all information we collect public — https://plausible.io/manager.selenium.dev
If you can find lawyers that think they can win an argument that this constitutes a violation of personal data retention, they can talk to our lawyers.

I do understand your frustration; I also wish there were a better option, but there is no way to get meaningful information and have it be opt-in. So the options were either not to collect any information or to do so the way we have. I get that you would make another decision than we did, that's fine, but it doesn't mean we didn't sufficiently consider the options or that we would have chosen differently if only you explained it louder/harder at us.

This conversation is not being helpful to anyone at this point, so I'm locking it.

@SeleniumHQ SeleniumHQ locked as too heated and limited conversation to collaborators Jun 11, 2024
Sign up for free to subscribe to this conversation on GitHub. Already have an account? Sign in.
Labels
Projects
Status: Done
Development

Successfully merging this pull request may close these issues.

[🚀 Feature]: Implement download tracking in selenium manager
9 participants