Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

supermaven-nvim sends the entire buffer to the server even when ignore_filetypes is configured to skip the file #85

Open
cfal opened this issue Aug 26, 2024 · 36 comments

Comments

@cfal
Copy link

cfal commented Aug 26, 2024

supermaven-nvim adds a TextChanged autocmd here which calls binary:on_update

binary:on_update(buffer, file_name, "text_changed")

BinaryLifecycle:on_update sends everything to stdin which i assume ends up writing to the server (it's a closed source binary that is fetched so i can't easily check):

this code path never seems to hit poll_once which is the only place where ignore_filetypes seems to be checked:

if config.ignore_filetypes[vim.bo.filetype] then

it seems misleading that ignore_filetypes doesn't actually ignore files of that filetype and instead will send everything in every buffer backed by a file.

@cfal
Copy link
Author

cfal commented Aug 26, 2024

seems like this is a dupe of https://github.com/supermaven-inc/supermaven-nvim/pull/35/files from 3 months ago which isn't even merged. what an incredible lack of urgency for a huge privacy issue.

@sm-victorw
Copy link
Collaborator

https://github.com/supermaven-inc/supermaven-nvim/pull/35/files does not address the issue being raised here, as the sm-agent binary automatically includes files in the git repo as part of the context, even if they are not opened.

If a file contains sensitive information it should be included in .gitignore, as Supermaven does not send .gitignore files to the server even if they are opened. Alternatively, you could include a .supermavenignore and globs specified in that file will also not be sent to the server.

This isn't clear from documentation, so I think we could change that...

@cfal
Copy link
Author

cfal commented Aug 26, 2024

https://github.com/supermaven-inc/supermaven-nvim/pull/35/files does not address the issue being raised here, as the sm-agent binary automatically includes files in the git repo as part of the context, even if they are not opened.

this also needs to be clearly documented.

If a file contains sensitive information it should be included in .gitignore, as Supermaven does not send .gitignore files to the server even if they are opened. Alternatively, you could include a .supermavenignore and globs specified in that file will also not be sent to the server.

this is untenable for large or internal repos. imo there should be a way (allowlist and blocklist) to configure which repos to enable.

@ahmedelgabri
Copy link

this is untenable for large or internal repos. imo there should be a way (allowlist and blocklist) to configure which repos to enable.

I think this #58 could solve it in a programmable way. Check the path of the file, and disable supermaven when needed. Because ignore_filetypes is very limited. But again, was open for 2 months and not merged yet.

@sm-victorw
Copy link
Collaborator

I've merged both PRs mentioned, as they are useful in their own rights, and could seemingly help address some of the privacy concerns here, though as I mentioned earlier these don't address the underlying issue involving sm-agent, which .supermavenignore was intended to solve

@GitMurf
Copy link

GitMurf commented Aug 26, 2024

...though as I mentioned earlier these don't address the underlying issue involving sm-agent, which .supermavenignore was intended to solve

Thank you for confirming because I was wondering the same thing. I believe your point is that the full context of the repository is sent at startup via the sm binary which has nothing to do with the neovim plugin / config? And the only way to prevent things is either in the .gitignore or .supermavenignore as the binary respects those by default out of the box (regardless of anything in the neovim plugin).

Do I have this correct @sm-victorw ?

@GitMurf
Copy link

GitMurf commented Aug 26, 2024

@sm-victorw this brings up two further questions I have been wondering about:

  1. is there any command we can run to see exactly what supermaven is using as context and has sent to the servers?

    • If not, this would be a wonderful command to add to the neovim plugin to be able to log out all the files sent / in context
    • This would give users confidence / peace of mind on what is actually being sent
    • And to help them test their configuration to make sure it is doing what they want.
  2. what if I am in a github repo (cwd) in neovim but open up a buffer with a file from outside the repo? See example:

  • Like a markdown file or a .env file or even a code file from another repository (I think this is a common practice)... what will happen?
  • will it send / add that file to the context and send to the servers?
  • will the gitignore still apply even if the file is not in the repo?
    • For example if my gitignore has .env in it for my repo I am currently in, but I open up a .env file from outside the repo (somewhere else on my hard disk)
    • will it still respect that filter before sending it to the server?
    • or because its outside the repo and a buffer opened separately, will it send it up regardless?

Thanks in advance for clearing these things up!

@sm-victorw
Copy link
Collaborator

Thank you for confirming because I was wondering the same thing. I believe your point is that the full context of the repository is sent at startup via the sm binary which has nothing to do with the neovim plugin / config? And the only way to prevent things is either in the .gitignore or .supermavenignore as the binary respects those by default out of the box (regardless of anything in the neovim plugin).

Yes this is roughly what is happening, though depending on how large the repository is, the context might not include everything. Also note that the context is kept on the server for up to 7 days, as mentioned in the code policy (https://supermaven.com/code-policy)

is there any command we can run to see exactly what supermaven is using as context and has sent to the servers?

There isn't any way currently to see exactly what is being included in the context, if you are interested in what files are eligible to be included the sm-agent binary, typically located at $HOME/.supermaven/binary/[version]/[platform]-[arch]/sm-agent can be run with the list-files command to see what isn't being ignored. e.g. ./sm-agent list-files /path/to/repo

If you are interested in whether or not a file is being ignored, ./sm-agent check-ignore /path/to/file can be used as well

what if I am in a github repo (cwd) in neovim but open up a buffer with a file from outside the repo?

Whenever a file is sent to the binary, the only .gitignore/.supermavenignore considered are the ones inside the repository of the file in question. If you have multiple buffers open they could potentially be following different .gitignore rules. The .env in your scenario would be uploaded if it isn't part of a git repository. In general files which are not part of a git repository are uploaded when they are edited, with no additional context included.

The lack of control for non-git files is unfortunate, and should have a robust solution. ignore_filetypes was not intended for this use, and until now wasn't meant to be a privacy related feature. Ideally we will have an allow/blocklist of some kind that does not make this sort of determination based on file type.

@GitMurf
Copy link

GitMurf commented Aug 26, 2024

Thank you for answering all my questions. Exactly what I needed.

I think the biggest "risk" are the files outside of the git repo. Personal markdown notes, internal docs etc.

Is this something handled in the nvim plugin? If so I wonder if for the time being a super conservative approach of just prompting the user in nvim for any file outside of the git repository asking if they want it uploaded? Since typically these will just be one off files opened up ad-hoc.

Another option that would be nice is a config option to just blanket disable uploading any files outside the git repo (if that's possible).

@ahmedelgabri
Copy link

I think the biggest "risk" are the files outside of the git repo. Personal markdown notes, internal docs etc.

Wouldn't a single .supermavenignore in $HOME solve this?

  • Open .env inside repo (ideally this should be ignored because of <repo root>/.gitignore)
  • Open a new buffer for ~/myNotes/note.md (this should be ignored because of $HOME/.supermavenignore)

@GitMurf
Copy link

GitMurf commented Aug 27, 2024

@ahmedelgabri thanks for the response.

  1. The .env (is just a common example) or any other sensitive info is not always going to be from the same repo root that I am currently cwd at. Often times I am flipping between repos and have to open up common files that would not be under that particular git repo. Based on the response above, it is only covered if the files in the .gitignore are actually within that repo.

  2. On windows it is not common to have your files (like notes etc.) under your "HOME" (I put in quotes because we don't really have a HOME ;) .... it is usually something like USERPROFILE) ... documents / notes are often not under that "HOME" path. But even if they were, I don't know that supermaven is looking that far up the tree looking for a supermaven ignore?

Is there any official documentation on using supermavenignore?

@leet0rz
Copy link

leet0rz commented Aug 27, 2024

Is there a way for supermaven to just not do this in the first place out of the box or does it have to have this behavior? No one wants their personal information leaked.

@sm-victorw
Copy link
Collaborator

sm-victorw commented Aug 27, 2024

@leet0rz Could you specify which behavior you are referring to? The uploading of non-repository files? Or the repository based indexing that the binary performs?

We could probably give the option to have the plugin disabled by default, and require a call to the api (.start()) before the binary is ever started, or something similar to this. I'm not sure if that's what you're proposing

@leet0rz
Copy link

leet0rz commented Aug 27, 2024

@leet0rz Could you specify which behavior you are referring to? The uploading of non-repository files? Or the repository based indexing that the binary performs?

We could probably give the option to have the plugin disabled by default, and require a call to the api (.start()) before the binary is ever started, or something similar to this. I'm not sure if that's what you're proposing

I mean not entirely sure how this works but this does seem like a major privacy concern, as stated before obviously people will run this in all sorts of notes and would never want their personal information uploaded or leaked in any way and supermaven should not be uploading this sort of information in any way to anything ever. What I heard is that it uploads the entire buffer and I guess sources or creates information or "AI responses" or inputs that we can accept from that? If that is the case, is it possible to do this locally instead of uploading it (which is the privacy concern).

I hope I am doing an ok job explaining this and have actually understood what's going on?

@GitMurf
Copy link

GitMurf commented Aug 28, 2024

@leet0rz the power comes from uploading. Most laptops are not powerful enough to do the type of processing it does and even if it could our laptops would be burning up high cpu/gpu/ram resources constantly. Also to be clear, this is how most of these AI code tools work including GitHub copilot. The difference is Supermaven is more powerful sending your entire repository to its models (more context). None of those things are the main problem. The main problem really is files that are not in your git repository but that you open in a buffer because those also are being sent up to the servers.

@GitMurf
Copy link

GitMurf commented Aug 28, 2024

We could probably give the option to have the plugin disabled by default, and require a call to the api (.start()) before the binary is ever started...

@sm-victorw I think this would be great as step 1. But I think the other important thing should be changing the default of any files that are not part of your current opened git repository should be opt-in instead of opt-out. By default files outside git repo are not sent to servers unless you white list them... preferably a glob / glob array, or even better a callback function we can configure to return true if we want a file sent to servers (with the file path as an input parameter to the cb function).

Thoughts?

@leet0rz
Copy link

leet0rz commented Aug 29, 2024

@leet0rz the power comes from uploading. Most laptops are not powerful enough to do the type of processing it does and even if it could our laptops would be burning up high cpu/gpu/ram resources constantly. Also to be clear, this is how most of these AI code tools work including GitHub copilot. The difference is Supermaven is more powerful sending your entire repository to its models (more context). None of those things are the main problem. The main problem really is files that are not in your git repository but that you open in a buffer because those also are being sent up to the servers.

What about usage outside of github when you just use neovim to open personal files, which a lot of us do. Will that still not upload the entire buffer and cause a privacy concern? I mean I use neovim to open any file I want to edit outside of github related things too and if a file with sensitive information I open out of some text document and with supermaven being enabled by default will that not cause said privacy concern?

@GitMurf
Copy link

GitMurf commented Aug 29, 2024

@leet0rz yes that is the concern we have been discussing in this thread. It is definitely a concern. I was just explaining why the idea of doing anything local just on your machine is not an option.

@sm-victorw
Copy link
Collaborator

@leet0rz Yes, both the pull requests mentioned earlier in this issue can help mitigate this issue, but as I mentioned earlier we are going to want a robust and clear approach for letting users specify which files they would like to exclude

@leet0rz
Copy link

leet0rz commented Aug 29, 2024

@GitMurf @sm-victorw Cool thanks guys.

@dzirtusss
Copy link

Another side of the problem is, that if I create some temporary file, I should first update .gitignore and then can start doing something.

I mean, normally, it is the opposite - I work in project local directory which is "safe", and only when commiting, think what should be commited and what should be gitignored and what should be deleted.

I mean now, if I create any temporary and/or scratch file with some probable secret inside the repo folder, even when nvim runs in different window, e.g. as a script output (I usually do some script > 1.txt) it will be uploaded to supermaven. And supermaven will "like" that file because it is fresh.

Which is even a worth problem, because many tools "expect" to run from project folder to pick up configuration.

Atm, I think I might do:

# .supermavenignore
*
!*.js
!*.jsx
...

This at least might prevent some surprizes.

@dzirtusss
Copy link

As well what might be useful - a GLOBAL IGNORE, somewhere in ~/.supermaven. Which will be a system-wide set of rules followed by a binary despite if a file is in a git or not in a git repo. Maybe local supermavenignores should override it, maybe not.

@dreson4
Copy link

dreson4 commented Sep 20, 2024

I have seen this issue again and again. I stopped using it for a while as it's a big issue.
I have files in .gitignore it works well on some projects on some it doesn't care simply sends everything.
On VSCode it works much better compared to other IDEs, this problem happens frequently on Jetbrains IDEs. I'm using Goland, you just have to pray for it to skip sometimes. On VSCode it almost always skips

@leet0rz
Copy link

leet0rz commented Sep 20, 2024

I have seen this issue again and again. I stopped using it for a while as it's a big issue. I have files in .gitignore it works well on some projects on some it doesn't care simply sends everything. On VSCode it works much better compared to other IDEs, this problem happens frequently on Jetbrains IDEs. I'm using Goland, you just have to pray for it to skip sometimes. On VSCode it almost always skips

For me the issue is having to add files to ignore, I don't want to do that. I want non-code files to be ignored by default. I don't want to keep track of and ignoring every file except for my code files, that should be default behavior if it's not.

@sm-victorw
Copy link
Collaborator

I have seen this issue again and again. I stopped using it for a while as it's a big issue. I have files in .gitignore it works well on some projects on some it doesn't care simply sends everything. On VSCode it works much better compared to other IDEs, this problem happens frequently on Jetbrains IDEs. I'm using Goland, you just have to pray for it to skip sometimes. On VSCode it almost always skips

Can you elaborate on what you mean it 'skips'? As in you get completions on files which are included in .gitignore? The intellij and neovim plugins are not responsible for deciding what is or isn't sent to the server, this is determined by the binary sm-agent which makes that determination based on the file path and any .gitignore it finds. Until somewhat recently all of these plugins used the same binary so the behavior shouldn't have been different

@dzirtusss
Copy link

There is a way to guarantee that binary does use only permitted files on MacOS via sandboxing. This is a native OS feature, thus highly secure and only couple text files needed.

How to do:

  1. create a wrapper for the agent somewhere, e.g.:
#!/bin/sh
sandbox-exec -f /.../supermaven.sb /.../.supermaven/binary/v15/macosx-aarch64/sm-agent "$@"
  1. create a policy
(version 1)
(allow default)

(deny file-read*)
(allow file-read* (literal "/"))
(allow file-read* (subpath "/System/Volumes/Preboot/Cryptexes/OS"))
(allow file-read* (subpath "/dev"))
(allow file-read* (subpath "/Library/Preferences"))
(allow file-read* (subpath "/usr/share/icu"))
(allow file-read* (subpath "/private/var/db/timezone"))
(allow file-read* (subpath "/var"))

(allow file-read* (subpath "/Users/sergey/.supermaven"))

(allow file-read-metadata (subpath "/Users/sergey/projects"))

(allow file-read* (regex #"/\.git/"))
(allow file-read* (regex #"/\.gitignore$"))
(allow file-read* (regex #"/\.supermavenignore$"))

(allow file-read* (regex #"\.rb"))
(allow file-read* (regex #"\.lua"))

Here first pack is needed to start binary correctly (including all shared system libs), then read its own folder, then read ignores and restrict to ruby/lua.

  1. Fork plugin and replace binary to a wrapper (or if you don't wanna fork, use other ways e.g. links)

This ^^^ is a fully working template, which I wanted to improve, but don't have time atm. Thus decided to post it AS IS that somebody may have pick it up. When/if I will have more time to work on this, will post updated version.

Beauty of this way, is that compliance is guaranteed by OS sandboxing (at least for binary), plugin is another story it may send whatever directly.

Definitely system libs restrictions should be fine-tuned more, but overall I don't care that much about that part, as this is "normal binary way" something, doesn't relate much to personal sensitive info.

@dreson4
Copy link

dreson4 commented Sep 22, 2024

I have seen this issue again and again. I stopped using it for a while as it's a big issue. I have files in .gitignore it works well on some projects on some it doesn't care simply sends everything. On VSCode it works much better compared to other IDEs, this problem happens frequently on Jetbrains IDEs. I'm using Goland, you just have to pray for it to skip sometimes. On VSCode it almost always skips

Can you elaborate on what you mean it 'skips'? As in you get completions on files which are included in .gitignore? The intellij and neovim plugins are not responsible for deciding what is or isn't sent to the server, this is determined by the binary sm-agent which makes that determination based on the file path and any .gitignore it finds. Until somewhat recently all of these plugins used the same binary so the behavior shouldn't have been different

Yes as it doesn't work, I get completions on files added to .gitignore this happens on nvim and goland, when I use vscode it ignores the files, open the same project on goland I get completions, sometimes it can work then sometimes it just doesnt, it's a hit and miss like 50% of the times it ignores fine the other 50% it autocompletes.

I made sure I opened the project root, same as in vscode. Checked .gitignore, put both .env and *.env it doesnt work still autocompletes on the files. I have just disabled it on those two IDEs at the moment.

@zoltrain
Copy link

I've been on the pro trail of Supermaven for that past week and came across this issue. If I were to have a preference as a vim user, I think it would be for file groups. Sort of like how prettier works with lazyvim. Named groups that can be 1 to N file extensions. Then you can give people presets, but they also can just enable explicitly what they need.

You'll also likely need a local override for repos where you want to enable "more" types. E.g. you might have a infrastructure repo with both terraform and pulumi in it, or k8s manifests you want to manage where you want to use the YAML integration. But I won't want supermaven to pay attention in the repo I'm in. Then it becomes DENY by default, with explicit ALLOW. Over the current status quo of ALLOW by default, with explicit DENY

In my case I tend to only work in a handful of languages, and always configure vim to only really care about those languages over enabling everything. If this is explicit on setup, and there's concrete examples and definitions of each "group" then it should be easy for a user to enable what they need. It also would put the risks front and centre with it being part of the setup. A great place to educate users on the risks associated with enabling copilots.

@lucax88x
Copy link

let's suppose we have this structure.

~/.config
~/repos/important-repo1
~/repos/important-repo2
~/repos/my-custom-project

1)if I add a .supermavenignore in the ~ (so in the root), it's gonna make sm-agent stop working for all the child folders?

2)if in the .supermavenignore I add

!./repos/my-custom-project

supermaven will work only for that folder?

  1. if I start neovim and the supermaven, accidentally, in the important-repos, it's gonna respect supermaven ignore even if the 3cwd & git root is not ~?

can we use .supermavenignore as whitelist?

We have tons of professional that could use these and pay these AI features, but for the "fear" of sending "important" repos to the servers, we're afraid of that, why nobodys goes for the "whitelist" approach?

@sm-victorw
Copy link
Collaborator

@lucax88x Currently the sm-agent binary is expecting .supermavenignore to be contained in the git repo, it does not function as a whitelist and won't be respected if it is in the parent directory of the git repository being edited (the sm-agent binary makes the determination of where the git repository begins based on the presence of the .git directory)

Based on the discussion taken place so far it sounds like a .supermavenignore_global that is always respected could be a potential solution to this issue? We can look into making such an addition if there is support for such a change among users.

@lucax88x
Copy link

lucax88x commented Oct 1, 2024

@lucax88x Currently the sm-agent binary is expecting .supermavenignore to be contained in the git repo, it does not function as a whitelist and won't be respected if it is in the parent directory of the git repository being edited (the sm-agent binary makes the determination of where the git repository begins based on the presence of the .git directory)

Based on the discussion taken place so far it sounds like a .supermavenignore_global that is always respected could be a potential solution to this issue? We can look into making such an addition if there is support for such a change among users.

if in this .supermavenignore_global I'm then allowed to do something like:

** <-- ignore everything

!./.repos/my-custom-project <-- whitelist this folder

then yes, I think it would be great, imho!

@leet0rz
Copy link

leet0rz commented Oct 1, 2024

I guess we could just only launch supermaven if a certain filetype is open too and have it disabled otherwise, that would be a bit better for my usage and I could just do that myself with an autocmd.

@zoltrain
Copy link

zoltrain commented Oct 1, 2024

@leet0rz even with an autocmd, wouldn't you need to pass these preferences to sm-agent? I think what @sm-victorw is pointing out is whatever is done, it's the agent that needs to know about inclusion rules. I also might be missing something with how sm-agent works.

So I'm not even sure writing a custom function in the setup like so

require("supermaven-nvim").setup({
  condition = function()
    return string.match(vim.fn.expand("%:t"), "foo.sh")
  end,
})

would stop files being added to the prompt context, I think this just stops the sm-agent from being started when the condition is met?

@sm-victorw if I were to add a rule to say "ignore js files" as a condition function, and then open up a markdown file that starts the agent, would files "ignored" this way in setup still be sent to the server if it's not in one of the ignore files?

@leet0rz
Copy link

leet0rz commented Oct 18, 2024

@leet0rz even with an autocmd, wouldn't you need to pass these preferences to sm-agent? I think what @sm-victorw is pointing out is whatever is done, it's the agent that needs to know about inclusion rules. I also might be missing something with how sm-agent works.

So I'm not even sure writing a custom function in the setup like so

require("supermaven-nvim").setup({
  condition = function()
    return string.match(vim.fn.expand("%:t"), "foo.sh")
  end,
})

would stop files being added to the prompt context, I think this just stops the sm-agent from being started when the condition is met?

@sm-victorw if I were to add a rule to say "ignore js files" as a condition function, and then open up a markdown file that starts the agent, would files "ignored" this way in setup still be sent to the server if it's not in one of the ignore files?

I just set one up and it seems to work fine, I am getting the completion at least for the filetypes of my choosing and not in other files like txt or md where I do not want it running.

@ahmedelgabri
Copy link

I just set one up and it seems to work fine, I am getting the completion at least for the filetypes of my choosing and not in other files like txt or md where I do not want it running.

That's not the main problem, the main problem is that foo.sh will still be scanned because it's part of the current project and you have bar.lua open for example. Because condition won't work in this case and the agent will run scanning the project for context.

Hence the .supermavenignore but this seems to be only project-specific, you can't have a global .supermavenignore or control the exclusions through the plugin.

Having a native way of doing this through the agent binary itself when it runs will eliminate these issues, it could be something like this:

ignore = {
  patterns = {'^.env.*'},
  filetypes = {'md', 'sh'}
}

Then passing these to the binary as flags for example

sm-agent --ignore-pattern '^.env.*' --ignore-filetype md --ignore-filetype sh

@leet0rz
Copy link

leet0rz commented Oct 18, 2024

I just set one up and it seems to work fine, I am getting the completion at least for the filetypes of my choosing and not in other files like txt or md where I do not want it running.

That's not the main problem, the main problem is that foo.sh will still be scanned because it's part of the current project and you have bar.lua open for example. Because condition won't work in this case and the agent will run scanning the project for context.

Hence the .supermavenignore but this seems to be only project-specific, you can't have a global .supermavenignore or control the exclusions through the plugin.

Having a native way of doing this through the agent binary itself when it runs will eliminate these issues, it could be something like this:

ignore = {
  patterns = {'^.env.*'},
  filetypes = {'md', 'sh'}
}

Then passing these to the binary as flags for example

sm-agent --ignore-pattern '^.env.*' --ignore-filetype md --ignore-filetype sh

So even if I open it only on a certain filetype, sm will scan anything in that directory and all the subdirs? I could see that being bad if that is the case and in my opinion that should not be a default ever.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

11 participants