Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

add dataproxy access to ebrains_drive #15

Merged
merged 13 commits into from
Aug 15, 2022

Conversation

xgui3783
Copy link
Contributor

@xgui3783 xgui3783 commented Jun 30, 2022

TODO:

  • code documentation
  • add test coverage

@xgui3783 xgui3783 marked this pull request as ready for review July 11, 2022 14:36
@xgui3783 xgui3783 changed the title draft: add dataproxy access to ebrains_drive add dataproxy access to ebrains_drive Jul 11, 2022
@xgui3783
Copy link
Contributor Author

@apdavison or @appukuttan-shailesh I have tentatively added dataproxy support in ebrains-drive.

Are there any appetite to incorporate this functionality to ebrains-drive, or would it be more suitable to create a new package, in your opinion?

@appukuttan-shailesh
Copy link
Collaborator

I personally think it makes sense to include it here.
@Cracky5457 might also want to weigh in?

(I myself was intending to implement this at some point; so I am happy to see this contribution! Thanks @xgui3783 )

@xgui3783
Copy link
Contributor Author

xgui3783 commented Jul 12, 2022

we will also add support for HDG datasets in this PR in future commits.

I will convert this PR to a draft for the time being.

Please feel free to provide feedbacks all the same!

I will ping you all when the PR is ready for review.

@xgui3783 xgui3783 marked this pull request as draft July 12, 2022 14:11
@xgui3783 xgui3783 marked this pull request as ready for review July 13, 2022 06:19
@xgui3783
Copy link
Contributor Author

xgui3783 commented Jul 13, 2022

@i-Zaak I have push a commit that should have support for HDG (see read me example )

I would welcome your review of the PR as well as a user.

Whilst not in the scope, would the support for public {collab,datasest} buckets be something we are interested?

@Cracky5457
Copy link
Collaborator

Hello thanks for theses developement.

We just need to ensure that the lib will works properly on different env (prod, int, dev) it's in discussion to put dataproxy and bucket on int too.

@xgui3783
Copy link
Contributor Author

We just need to ensure that the lib will works properly on different env (prod, int, dev) it's in discussion to put dataproxy and bucket on int too.

Would be lovely if so.

I don't think I have access to -int/-dev environments of data-proxy, and thus, currently, it will raise exception if env other than empty string is passed

@Cracky5457
Copy link
Collaborator

Cracky5457 commented Jul 21, 2022

Was you able to try your change and do you want a new release version for it ?

@appukuttan-shailesh
Copy link
Collaborator

@xgui3783: I notice that you have provided some examples in the README.md file. But could you also list out all the functionality available inside doc.md?

@appukuttan-shailesh
Copy link
Collaborator

Also, I do not use buckets much, so probably best to have someone who actively uses the same to test this out. Thanks again.

@xgui3783
Copy link
Contributor Author

@xgui3783: I notice that you have provided some examples in the README.md file. But could you also list out all the functionality available inside doc.md?

I will add some doc there. I did not see doc.md

Also, I do not use buckets much, so probably best to have someone who actively uses the same to test this out. Thanks again.

I believe @i-Zaak uses the buckets? He promised me that he would check it once he is back from vacation =)

Was you able to try your change and do you want a new release version for it ?

I will make some update the doc.md per @appukuttan-shailesh 's suggestion, then I would love to have the changes merged and released.

@xgui3783
Copy link
Contributor Author

Dear @appukuttan-shailesh @Cracky5457 , doc.md has been updated.

Let's wait on feedback on @i-Zaak on the functionality (or the lack there of) of buckets.

@i-Zaak
Copy link

i-Zaak commented Jul 22, 2022

Hey, sorry for the delays. I've tested the Buckets and datasets and I think this is definitely going in the right direction!

I've tested

  • opening existing private bucket (works)
  • listing files (works)
    • we might consider later adding switches for more "file-system-like" behavior such as ommitting the prefix or listing only files with exact prefix (files in "directory") etc.
  • getting the file content (works)
    • I'd also suggest adding a function wrapping the content in BytesIO to simplify things like np.load(io.BytesIO(file_handle.get_content()), allow_pickle=True)
  • opening an unrestricted dataset, and listing files (works)
  • opening a dataset behind HDG: here I ran into issues
    • first I tried without asking for access, exception is thrown as expected
    • then I ran with request_access=True, access was requested and email was sent (I clicked the link and gained access)
    • but now I'm getting again 401 - and even for the public dataset which worked few moments ago o_O
    • I can access all the datasets in the data-proxy gui

PS: there is a missing depenency for version_query

@xgui3783
Copy link
Contributor Author

Re: missing dependency, that was there before I got here =S

I will investigate the HDG issue again.

thanks for spending time on this! Greatly appreciated!

@xgui3783
Copy link
Contributor Author

I have had a chance to look into the code again, re @i-Zaak 's comments:

but now I'm getting again 401 - and even for the public dataset which worked few moments ago o_O

Unfortunately, I am not able to reproduce, but I have a few suspicion how this could have been the case:

  • token expiring at the wrong time
  • dataproxy/cscs object storage unstable

I added a new commit that checks token expiry, and will raise when it is. Unfortunately, there's not much I can do about the second one oO

we might consider later adding switches for more "file-system-like" behavior such as ommitting the prefix or listing only files with exact prefix (files in "directory") etc.

certainly reasonable. I would recommend adding this as an iterative improvement, in a separate PR?

I can imagine we can even create a directory abstraction for data proxy bucket.

I'd also suggest adding a function wrapping the content in BytesIO to simplify things like
np.load(io.BytesIO(file_handle.get_content()), allow_pickle=True)

This I will need a little help. should we use content type to decide how to decode the file, or file extension, or a mixture?

I feel much more comfortable letting the user of the library determine how the files should be parsed.

Another approach maybe, we can maintain a list of lambda functions, given such filehandle, this is how you can decode the file content.

PS: there is a missing depenency for version_query

ping @appukuttan-shailesh , I think this dependency is in the original seafile python client? I think this is only used when building the pypi artefact, and not needed otherwise? Still this cause some issue when user try to install the package via git, for example. Could we explicitly add it as a dep?

@appukuttan-shailesh
Copy link
Collaborator

I believe this dependency was added by Axel here.

I think there we can/should add it to the deps to avoid such problems.

@i-Zaak
Copy link

i-Zaak commented Jul 26, 2022

I just revisited my experiments and it looks like "token expiring at the wrong time" was the cause. However I'm suspicious that the "wrong time" might not be completely independent of the interaction with HDG. Is the token maybe reset after the user is granted access to a HDG dataset (wasn't able to reproduce)?

Is there a way to get some info on the current token (e.g. expiry date etc.)?

@xgui3783
Copy link
Contributor Author

Is there a way to get some info on the current token (e.g. expiry date etc.)?

the latest commit also now checks the token expiry, and will raise if the token is expired

note that this does not check the authenticity of the token, simply its expiry.

I don't think new tokens are generated pre or post granting access to HDG. The access is controlled purely at the dataproxy level, assuming a valid token is presented.

@i-Zaak
Copy link

i-Zaak commented Jul 26, 2022

My bad - I've seen the code, but misread it. I've seen the tokens have ~ 2.5h live time, so it is easy to get caught in between if one plays with the code for a while...

@appukuttan-shailesh
Copy link
Collaborator

@xgui3783, @i-Zaak : Token life of ~2.5 hours seems unusually short! I don't have much experience using the dataproxy... does it use the same token as would apply for the SeaFile client? i.e. would a token obtained via the DriveApiClient be usable with the BucketApiClient? I don't recollect the former having problem of token expiry (but possibly not tested for larger durations).

@Cracky5457 : is there a known workaround for getting longer duration tokens?
I know that for JavaScript apps different authentication processes led to tokens of different lifetimes (from hours to days....see PR and this commit). @xgui3783 : can something along these lines be done here?

@i-Zaak
Copy link

i-Zaak commented Jul 26, 2022

To correct myself, the expiry is actually 30m, I was getting the token for testing from the https://kg-editor.humanbrainproject.eu/ (bad habit from previous times). Token from the Collaboratory is substantially more durable (almost a week).

@xgui3783
Copy link
Contributor Author

i.e. would a token obtained via the DriveApiClient be usable with the BucketApiClient?

DriveApiClient does not currently fetch tokens. The user need to explicitly obtain the token from somewhere (lab.ebrains.eu, kg.ebrains.eu etc) If the said token has the correct scope (I think clb.drive:write and clb.drive:read), it will be usable by DriveApiClient

side note: it is probably feasible to check the scope check on DriveApiClient

can something along these lines be done here?

Token expiry depends entirely on how long it is requested. kg creates tokens of short expiry. Since the client does not fetch tokens, there is nothing much I can do.

@appukuttan-shailesh
Copy link
Collaborator

appukuttan-shailesh commented Jul 26, 2022

From what I remember, DriveApiClient can authenticate and get tokens via this.
Isn't that what happens when we do:

client = ebrains_drive.connect('hbp_username', 'password')

The same should be possible for BucketApiClient.

@xgui3783
Copy link
Contributor Author

From what I remember, DriveApiClient can authenticate and get tokens via this. Isn't that what happens when we do:

client = ebrains_drive.connect('hbp_username', 'password')

The same should be possible for BucketApiClient.

IMO, authentication is out of scope for this PR. But since the auth logic has been moved to ClientBase, I believe the following code should work:

from ebrains_drive import BucketApiClient

client = BucketApiClient(username="user", password="password")

I did not modify the connect method for backwards compat reasons. I could export a method such as connect_dataproxy, but I find it no more verbose or succinct using the constructor function.

@appukuttan-shailesh
Copy link
Collaborator

That's fine. The above is what I was proposing.

Could @i-Zaak check if the token expiration problem persists when the token is fetched via the above method, i.e. login using credentials (username, password), as opposed to providing the token obtained from elsewhere.

@i-Zaak
Copy link

i-Zaak commented Jul 28, 2022

Hello, yes - the token obtained through username and password has much longer live-time (a week-ish). For my usecase, I'm mainly looking into using the tokens obtained within the collaboratory (clb_nb_utils.oauth.get_token) to avoid usage leading people saving their password in the notebooks or REPL history.

And one more thing - to be able to call

client = ebrains_drive.connect('hbp_username', 'password')

I had to move around the _set_env call before the super().__init__ to avoid access to non-existent iam_url. Bug or feature? :)

Traceback (most recent call last):
  File "<stdin>", line 1, in <module>
  File "/home/izaak/local_repos/nostromo/ebrains-drive/ebrains_drive/client.py", line 131, in __init__
    super().__init__(username, password, token, env)
  File "/home/izaak/local_repos/nostromo/ebrains-drive/ebrains_drive/client.py", line 28, in __init__
    self._get_token()
  File "/home/izaak/local_repos/nostromo/ebrains-drive/ebrains_drive/client.py", line 48, in _get_token
    self.iam_url+'/auth/realms/hbp/protocol/openid-connect/token',
AttributeError: 'BucketApiClient' object has no attribute 'iam_url'

@xgui3783
Copy link
Contributor Author

I had to move around the _set_env call before the super().init to avoid access to non-existent iam_url. Bug or feature? :)

haha oops... Lemme fix that 😅

@xgui3783
Copy link
Contributor Author

Just pushed a commit that should fix the iam_url issue

@appukuttan-shailesh
Copy link
Collaborator

Hello, yes - the token obtained through username and password has much longer live-time (a week-ish). For my usecase, I'm mainly looking into using the tokens obtained within the collaboratory (clb_nb_utils.oauth.get_token) to avoid usage leading people saving their password in the notebooks or REPL history.

Thats great... as good as it gets I guess. So the token issue seems resolved 👍

@xgui3783
Copy link
Contributor Author

@appukuttan-shailesh and @Cracky5457 , is there anything else we are waiting for this PR?

I would love to see this PR merged and also, hopefully a new pypi package published, so we can patch our dependency appropriately.

@appukuttan-shailesh
Copy link
Collaborator

All ok from my side (I haven't had a chance to test it, but @i-Zaak seems to have tried it out sufficiently).
@Cracky5457 handles the PRs and PyPI releases.

@Cracky5457 Cracky5457 merged commit 2d9f32a into HumanBrainProject:master Aug 15, 2022
@Cracky5457
Copy link
Collaborator

Thanks to all of you, version 0.5.0 is release. I hope it will works well ^^

@xgui3783 xgui3783 deleted the feat_dataproxyMVP branch August 15, 2022 08:33
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

4 participants