Skip to content
This repository has been archived by the owner on Sep 9, 2020. It is now read-only.

gps: source cache: protobuf integration #1127

Merged
merged 3 commits into from
Sep 22, 2017
Merged

gps: source cache: protobuf integration #1127

merged 3 commits into from
Sep 22, 2017

Conversation

jmank88
Copy link
Collaborator

@jmank88 jmank88 commented Sep 5, 2017

What does this do / why do we need it?

This change replaces most of the custom bolt cache encoding with Protocol Buffers, and includes some other optimizations as well (smaller and re-usable constant keys, smaller 'set' keys, Batch updates to bolt).

What should your reviewer look out for in this PR?

Optimizations. Correctness.

Which issue(s) does this PR fix?

Follow up from #1098 in the series towards #431


[[constraint]]
name = "github.com/jmank88/nuts"
version = "0.2.0"
Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

We don't have to vendor this if we don't want to pollute. Usages are few enough to copy in.

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

i'm fine with bringing it in

}
Type type = 1;
string value = 2;
//TODO strongly typed Semver field
Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

We're set up to drop whatever sort of Semver message type we come up with into here (and in a forward compatible way, if we want to commit first).

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

cool cool, ok - but right now, things are inoperable with this missing, right?

Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

It is operable - as much as the tests validate anyways. It is using string serialization into the value field for now.

@jmank88
Copy link
Collaborator Author

jmank88 commented Sep 5, 2017

Codeclimate does not like protobuf

@jmank88
Copy link
Collaborator Author

jmank88 commented Sep 8, 2017

CI checks are blocked by golang/go#21804 right now, but we can likely work around it by redefining some methods as functions, if we need to.
Edit: No longer blocked.

@sdboyer
Copy link
Member

sdboyer commented Sep 11, 2017

CI checks are blocked by golang/go#21804 right now, but we can likely work around it by redefining some methods as functions, if we need to.

if we feel the methods are better than the funcs for our design, then i'd prefer we specify an exception to staticcheck for this particular error. if you look back over previous versions of .travis.yml, i know we used to do this for other problems.

Copy link
Member

@sdboyer sdboyer left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

couple quick things, not a comprehensive review


[[constraint]]
name = "github.com/jmank88/nuts"
version = "0.2.0"
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

i'm fine with bringing it in

}
Type type = 1;
string value = 2;
//TODO strongly typed Semver field
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

cool cool, ok - but right now, things are inoperable with this missing, right?

}

// LockedProjectMsg is a serializable representation of LockedProject.
type LockedProjectMsg struct {
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

and this has to be exported, right? can we rearrange to an internal subpackage or something?

Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Shouldn't be a problem to put it in an internal subpackage. That might indirectly dodge the go/types problem too.

@jmank88
Copy link
Collaborator Author

jmank88 commented Sep 12, 2017

@sdboyer Ok if I squash/rebase and force push to clean up the prune mess?
Can wait for a quick look at e78bc49 first if you want.

Edit: Did this

@jmank88
Copy link
Collaborator Author

jmank88 commented Sep 13, 2017

What are the license header rules regarding generated files? Should I find a way to inject a header into source_cache.pb.go? Or do we just exclude *.pb.go files from the license check?

This file does not have a license header, whatever that means: https://github.com/golang/protobuf/blob/master/proto/testdata/test.pb.go

@@ -8,5 +8,5 @@
set -e

go build ./hack/licenseok
find . -path ./vendor -prune -o -type f -name "*.go"\
find . -path ./vendor -prune -o -regex ".+\.pb\.go$" -prune -o -type f -regex ".*\.\(go\|proto\)$"\
Copy link
Collaborator Author

@jmank88 jmank88 Sep 18, 2017

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The intention was to include *.proto files and exclude the generated *.pb.go files (but not pb.go!). I stumbled my way into this, but perhaps there is a simpler or more readable way to do so.

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

i have no immediate ideas about better ways of doing this; i'm fine with a slightly wonky regex.

@@ -15,3 +15,4 @@ exclude_paths:
- internal/gps/_testdata
- cmd/dep/testdata
- testdata
- internal/gps/internal/pb
Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Something like **.pb.go (if legal here) would be more general and continue to work as things change.

Copy link
Member

@sdboyer sdboyer left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

a few more incremental notes here

@@ -73,31 +74,31 @@ func (c *boltCache) close() error {
//
// 1) Versions buckets hold version keys with revision values:
//
// Bucket: "versions:<timestamp>"
// Keys: "branch:<branch>", "defaultBranch:<branch>", "ver:<version>"
// Bucket: "v<timestamp>"
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

it may be worth not having sub-buckets for each of these: boltdb/bolt#422 (comment)

for now i tend to prefer uniformity, but it's something we could explore with benchmarks later on, if we feel so inclined.

Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The primary motivation was being able to delete entire buckets in a single op, rather than prefix scanning and deleting each key. Another was avoiding repeated, long, complex key-prefixes. Perhaps the Revision-versions buckets have few enough entries to yield speed up from prefixes, but most other buckets are quite large and/or would have keys much longer than their values.

}
}
var (
cacheKeyC = []byte("c")
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

it's fine to use the abbreviated key for efficiency reasons, but let's use full words rather than abbreviations in the consts.

Copy link
Collaborator Author

@jmank88 jmank88 Sep 20, 2017

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

They don't map 1-1, which is why I opted to name them what they are, not how they are used.

Is this better?

var (
	cacheKeyComment = []byte("c")
	cacheKeyConstraint = cacheKeyComment
	cacheKeyError = []byte("e")
	cacheKeyHash = []byte("h")
	cacheKeyIgnored = []byte("i")
	cacheKeyImport = cacheKeyIgnored
	cacheKeyLock = []byte("l")
	cacheKeyName = []byte("n")
	cacheKeyOverride = []byte("o")
	cacheKeyPTree = []byte("p")
	cacheKeyRequired = []byte("r")
	cacheKeyRevision = cacheKeyRequired
	cacheKeyTestImport = []byte("t")

	cacheRevision = byte('r')
	cacheVersion = byte('v')
)

I wish this all could have dropped into a separate cache package to avoid all the annoying prefixes.

Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

(Regarding splitting packages, IIRC it would require at least (maybe only?) pushing version stuff into a separate gps/version package to avoid cycles)

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

ah yeah, i see. in this case, though, a bit of duplication for the reader's sake is acceptable.

(Regarding splitting packages, IIRC it would require at least (maybe only?) pushing version stuff into a separate gps/version package to avoid cycles)

i've considered such a thing quite a bit, but i don't think we can currently manage it because of all the reliance on hidden properties. i'd be open to refactoring that and just exporting everything - i was probably defending against a bogeyman by not exporting it - but i don't want to make that a yak we have to shave for this.

Copy link
Collaborator Author

@jmank88 jmank88 Sep 21, 2017

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I started a separate (from master) experimental conversion, leveraging aliases quite heavily for now to minimize churn, and it's not actually too bad - plus some of the API ends up cleaner. I dropped in a few exported helper functions, rather than exporting all the internals. Not that helper functions are necessarily a better solution than exporting the types, but it proved the concept without too much refactoring.

This can all be considered separately from this PR and caching in general though.

@@ -203,9 +174,13 @@ func cacheGetManifest(b *bolt.Bucket) (RootManifest, error) {
}
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

premature optimization question: does a cheap call exist that allows us to get the number of items in each of the buckets, so that we can provide a capacity hint on the make calls?

Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

There are all sorts of BucketStats available via Stats() but they have to be computed so I didn't explore that path.

For the ones using sequential integer keys (behaving like sets), we could theoretically inspect the last key to get the total items. I don't know how the costs of jumping the cursor around, or scanning backwards, compares to re-allocating the slice.

More generally, and perhaps simply, we could just store size hints in additional top level keys.

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

size hints as additional top-level keys is interesting. seems like a basically zero-cost way of doing things - that is, as long as the use pattern continues to be that these buckets tend to get dropped and wholly regenerated, rather than incrementally updated. if incremental updates become more the norm, then i'd be concerned that we're create a write amplifier. that might not be that costly, especially if the workflow is heavily read-oriented (which, of course it should be), but...well, with all these considerations, we'd need benchmarks!

probably best to leave it as-is for now, though, and put a big TODO on it noting the possible future direction to explore.

@sdboyer
Copy link
Member

sdboyer commented Sep 21, 2017

ok yeah, add that TODO then we can merge

@sdboyer
Copy link
Member

sdboyer commented Sep 21, 2017

pre-emptively approving

@jmank88 jmank88 merged commit 59eca1b into golang:master Sep 22, 2017
@jmank88 jmank88 deleted the source_cache_proto branch September 22, 2017 13:55
Sign up for free to subscribe to this conversation on GitHub. Already have an account? Sign in.
Projects
None yet
Development

Successfully merging this pull request may close these issues.

4 participants