-
Notifications
You must be signed in to change notification settings - Fork 265
Proposal for Package Ranking #320
Comments
I'm in favor of adding more signal to godoc.org's ranking algorithm. As such, I'm generally in favor of this proposal. Let me comment on a few of the proposed metrics, and how they might be computed:
This requires integration with a continuous integration system. As such, it's probably the hardest one to gather data for. Defer til last. (Also, this is easily gamed; just add a function 100k lines long and write one test to invoke it.)
Sounds good, if we're not doing this already (can't recall). I implemented this in a separate project a long time ago.
How do we measure repository downloads? GitHub stars seem like good signals. Forks are ambiguous (@garyburd raises some interesting questions).
We already have this data. Seems like a no-brainer.
We're not currently gathering this data, but it's probably worth doing. When we decide to embark on implementing any of these specific metrics, please create a separate issue for that particular metric so that we can nail down the design before implementation. |
@garyburd That's a good point about low activity on an established, high-quality project. Do you agree with @adg's suggestion that GitHub starts are a reasonable proxy for "this is a good project"? One of the odd things about Github stars is that they're cumulative, with no decay function. You can star a project at a point when it was well maintained. A year goes by, it's been abandoned, but your star is still there. Hmmm. I'm having a hard time thinking of a better replacement. @adg's spot on about the percent test coverage metric. That would require real compute time (and real money). That being said, it would be one of the better signals of project quality. Regarding download counts: I do not think there is any real way of doing this. Even the GitHub repos api does not report this. You do get stargazer_count though. |
As you pointed out there are two goals for developers looking for packages.
Towards this end, the rankings based on imports and test help. As far as tests go, I think these need to be a bit liberal. The idea being is there at least one test, and it covers at least 5% percentage of the code. I don't think we should be expecting the percentage to be more then a single digit. I the the more important metric is the usage in other projects. For quality I think percentage of documentation for package and public members is important. As well as the presence of examples, and testable examples is more important. As well as coming up with a ranking based on the output of go fmt, go vet, and other code linting tools. But there is another audience as well; that is the package authors themselves.
A part of the original proposal document that when towards building these requests was left out by mistake. Archive Expired PackagesAn archive section to where libraries that have been inactive for a given period of time and not used by a number of other active projects. This would help keep the listing to only active projects. Inactive is defined as "Most recent commit > 365 days ago" and "number of imports < 10", and can be adjusted. The 365 and 10 are just place holders and can be changed if needed. The idea here being, that we need to, also, group or filter packages; so that there isn't a overwhelming amount of choice. I would like to reference #90 as another issues that is trying to solve the getting too many packages to find the trees from the forest problem. |
Note that this is related to #52 |
And #172 |
This is a terrific idea, @garyburd:
Implement the proposed filtering criteria, then test against the search results for a common term with a large result set, such as "web" or "sql" or "middleware". Generate side-by-side diff-able output with and without the filter. Once we're confident that the filtering is "fair", then move onto changes in the ranking. Again, with the intent of being able to compare current vs proposed ranking so we can vet the diffs. |
@garyburd Very much agreed. |
@garyburd I started building the command line: |
I think that's a great idea. But, an observation, that will have false positives for commands or libraries meant to be used at go generate time, since they're typically imported from other packages in |
I ran the tool that checks the expired packages on a database dump from 2015-10-01. It analyzed 132277 packages in 36h45m due to Github rate limit policies. The results:
The unexpected status code from Github is probably some rate limit issues (403 Forbidden) that could be solved adjusting the token bucket values or analyzing the HTTP response headers from Github. The tool algorithm currents make two checks to identify if a package should be archived:
We also got 6 connections timeouts. |
Indeed, the list of "should be archived" is crucial here. We need to sanity check that to ensure that no legitimately keep-worthy projects would be get the archive treatment :-) Can you @rafaeljusto pastebin or gist it for us? |
I checked some of them to see if they were modified in the last 2 years. But I didn't check if they were referenced by other packages, I'm trusting in the gddo database information. Here is the list of packages to archive: |
Sure! I've created another program that inform packages with score 0 (zero) from an input list. So, from the list of packages that should be archived we have:
The list of packages with score that should be archived are here: |
Working on it. =) |
I've created a new filter that checks for forks with maximum of 2 commits in the week after the fork date, I called then "fast forks". On the list of scored packages that should be archived, when applying this filter we got:
The list of packages after applying this 2 filters can be found bellow: |
I just spot-checked about 50 items from the new list, and I didn't see any false-positives (project that would have been archived but should not have been). So far, LGTM. |
Sounds like we're in general agreement that the new fast-fork filter is working well as an identifier of packages that should be considered "archived" and therefore not displayed in search results. Using that filtered set as the base, it make sense to begin experimenting with the rankings (the primary goal of this proposal). Given that gddo already takes import counts into account, how about an experiment to apply the stored page view counts for a small set of common search terms ("sql" and "middleway"), with outputs that allow us to diff/compare the ordering of:
|
I ran the tool again now replacing the 2 years condition for the fast fork. It analyzed 132277 packages in 4h0m42s (we got many cache hits). The results:
We also got 9 connections timeouts. The new list of packages that could be archived are bellow: We increased the packages to archive in 11.39% comparing with the first result. I think we could apply both rules: we archive if the package is a fast fork or has more than two years with no changes, already considering that there are no other packages referencing it. There are other cases where we got a 404 from Github API that we could also archive, but this are only a few cases. I still need to work on the tool to avoid rate limit and decrease the "unexpected status code from Github" percentage. PS: I will be offline for a week (hello vacations!) |
This writeup is relevant for this discussion: https://github.com/mikeal/go-stats/blob/master/README.md |
After a discussion, we decided that many packages of the gddo database could be suppressed from the search results. We are currently adopting 2 rules to suppress a package: 1. Package project wasn't modified in the last 2 years and there're no other projects with references to it. 2. Package project is a fork with a small number of commits near the fork date (what we called a fast fork). The periodic will check all the packages from the repository and send queries to Github API to determinate the current state. See golang#320
@carlisia I don't know why package count should be compared with other communities. It shouldn't matter. Some Go projects on github divide their package into many small packages. That is the Go way to do it and helps make things go gettable. In terms of ranking packages on godoc, you bring up a good point: We can use an "imported by" metric to rank packages. That could remove some of the noise added by some of these sub packages. |
Overview
Rank by Objective Quality Standards
A mechanism to transparently and objectively rank projects by quality level.
We're proposing to avoid the use of a user-based rating or feedback system. It would require the creation and maintenance of accounts and would be subject to abuse. Instead, we propose using a published set of project quality standards that are not subject to simple manipulation:
This could use (via API or similar) the system implemented by GoReportCard, for example:
The results of the ranking score would be used as the primary sort mechanism when browsing packages by search or by category. The information would also be linked/displayed at the top of each documentation page on Godoc.org.
References:
Examples of quality assessement tools: https://medium.com/@jgautheron/quality-pipeline-for-go-projects-497e34d6567
Include Ranking in Display
Once the ranking system is in place, include a summary of the score (and link to view details/explanation) on both the documentation pages and in a second column on search results.
Contributors to this proposal
@carlisia
@gdey
@rafaeljusto
The text was updated successfully, but these errors were encountered: