Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Create a "Well documented" badge for crates in the top 20% #703

Closed
Tracked by #9
carols10cents opened this issue May 2, 2017 · 6 comments
Closed
Tracked by #9

Create a "Well documented" badge for crates in the top 20% #703

carols10cents opened this issue May 2, 2017 · 6 comments
Labels
C-enhancement ✨ Category: Adding new behavior or a change to the way an existing feature works

Comments

@carols10cents
Copy link
Member

Part of RFC 1824, rust-lang/rust#41616.

"Well Documented" badge implementation

For each crate published, in a background job, unpack the crate files and
calculate the ratio of lines of documentation to lines of code as follows:

  • Find the number of lines of documentation in Rust files:
    grep -r "//[!/]" --binary-files=without-match --include=*.rs . | wc -l
  • Find the number of lines in the README file, if specified in Cargo.toml
  • Find the number of lines in Rust files: find . -name '*.rs' | xargs wc -l

We would then add the lines in the README to the lines of documentation,
subtract the lines of documentation from the total lines of code, and divide
the lines of documentation by the lines of non-documentation in order to get
the ratio of documentation to code. Test code (and any documentation within
test code) is part of this calculation.

Any crate getting in the top 20% of all crates would get a badge saying "well
documented".

Please let me know if you have any questions, potential implementers!

@carols10cents
Copy link
Member Author

Ok, so in thinking about the implementation of this a bit more I'm starting to wonder if this is worth implementing right now. It might be, but I have concerns.

So implementing this as proposed requires looking at all the files in a .crate archive. We currently have a script, render-readmes, that we can run to rerender readmes for all versions of all crates. The script currently downloads the .crate files from s3, untars them, and extracts the README file for rendering. We run this script whenever we change the implementation of the readme rendering, but not on a scheduled basis or anything.

Theoretically, once we implement however we're going to count documentation, that computation shouldn't change (at least I don't expect it to change as often as README rendering does). So I don't think we'll need a "re-assess documentation" script, or at least it won't be as essential as the render-readmes script is.

Ideally, processing a version to count its documentation should mean getting its .crate archive, untaring it, counting the lines of doc and code, and storing those numbers associated to the version and never having to do that again for that version of that crate. I'm not concerned with the resources needed to do that once, in the background.

I don't necessarily want this to happen on the publish thread, though, and if this calculation fails, it shouldn't reject the publishing of a crate.

So perhaps what we need is a script that we'd schedule to run every so often (every few hours? once a day?), finds all the versions that don't have this data, computes it, and stores it.

Thinking longer term, though, there are potentially other analyses we'd like to do on the files in a crate, and it'd be nice if we only had to have/untar the .crate files once, then feed that to all the different analyses when a new crate is published. This could even be an entirely separate service! I think npm has this idea of "followers"/microservices that get a stream of data from the registry as packages are published to be able to do whatever with that data, without affecting the main registry. I don't know if we're at the scale to need that yet, but I hope we will be someday-- so does that mean we implement something like that now, so that we don't have to extract this later? I don't know! Tradeoffs!

TL;DR I'm struggling with whether adding the complexity of this feature to crates.io's codebase, or adding the complexity of support for this feature and others like it to crates.io's codebase, is a good idea at this point in crates.io's scale/maturity or not.

@vignesh-sankaran
Copy link
Contributor

vignesh-sankaran commented Sep 4, 2017

So I've taken a look at this issue in RFC 1824, and the degree to which Rust code is well document is a score out of 1, with scores closer to 1 being better document. The calculation in the RFC was:

(No of doc comment lines + No of README lines) / (LOC in Rust - No of doc comment lines)

@carols10cents, I assume the LOC in Rust includes blank lines? I'm a bit worried that the RFC uses a precision of 2, I feel it should be higher, maybe we could even use f32 or f64 since both are supported in Postgres.

Regarding your points @carols10cents:

So implementing this as proposed requires looking at all the files in a .crate archive. We currently have a script, render-readmes, that we can run to rerender readmes for all versions of all crates. The script currently downloads the .crate files from s3, untars them, and extracts the README file for rendering. We run this script whenever we change the implementation of the readme rendering, but not on a scheduled basis or anything.

Think we could leave the recalculation script out of scope? Or did we want that as a task in this issue?

So perhaps what we need is a script that we'd schedule to run every so often (every few hours? once a day?), finds all the versions that don't have this data, computes it, and stores it.

Me being me, I was thinking we could run this script or service every time we published a crate to keep the metrics up to date ASAP :). Would that be a tad ambitious?

I can already see an issue in that if we continuously update the DB with scores, and new versions and crates are being published past a certain rate, that'd mean that it becomes difficult to impossible to keep a real time tracking of which crates are in the 20% since there's new information changing that are just above the cutoff point for measuring this.

Thinking longer term, though, there are potentially other analyses we'd like to do on the files in a crate, and it'd be nice if we only had to have/untar the .crate files once, then feed that to all the different analyses when a new crate is published. This could even be an entirely separate service! I think npm has this idea of "followers"/microservices that get a stream of data from the registry as packages are published to be able to do whatever with that data, without affecting the main registry. I don't know if we're at the scale to need that yet, but I hope we will be someday-- so does that mean we implement something like that now, so that we don't have to extract this later? I don't know! Tradeoffs!

I would like to learn how to set up a microservice :). That and perhaps it'd make the crates.io backend less monolithic? However, would this also mean that it becomes harder to set this up on private crates.io instances and local development environments? If this is a problem, maybe we could use Docker or another containerisation technology to manage this for us.

Then again, a senior architect once told me of the YAGNI philosophy (you ain't gonna need it), which could be applied to this microservice idea. What other analyses were you thinking of running on extracted codebases?

npm's followers idea sounds a bit like this issue regarding push infrastructure for crates.io :). Would you say there's a link?

Hmm this task is starting to look bigger than I thought it'd be :). And yes, I'd be happy to take this task up once we've decided what needs to be done :).

@vignesh-sankaran
Copy link
Contributor

I just finished my last exam :). What were you thoughts on this so far @carols10cents?

@vignesh-sankaran
Copy link
Contributor

So it's been pointed out to me in IRC by insaneinside that the core logic is prone to being gamed by uploading a project with many empty doc comments, and that a potential way around that is to potentially strip the spaces out and then see if there is still anything in the line. However, this can be circumvented by someone uploading files with gibberish doc comments e.g. Lorem Ipsum.

@sgrif
Copy link
Contributor

sgrif commented Jan 2, 2018

Until we see evidence of crates gaming this, do you think it's worth worrying about at all? Any crate that has a bunch of empty lines as documentation, or lorem ipsum is going to end up avoided either way

@carols10cents
Copy link
Member Author

I don't think we're going to end up doing this.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
C-enhancement ✨ Category: Adding new behavior or a change to the way an existing feature works
Projects
None yet
Development

No branches or pull requests

3 participants