-
Notifications
You must be signed in to change notification settings - Fork 90
Bacalhau project report 20220916
We've all been very focused on improving the examples, to make it easier to onboard users onto the project (by giving them examples relevant to their tech stack) and run hackathons etc.
We settled on using Jupyter notebooks, true to the vision of Donald Knuth's literate programming, to document the examples because they can be:
- Automatically converted into Markdown and published to the documentation site for users to manually follow along
- Run by users locally
- Run by users (in many cases) in a free hosted notebook environment like Colab (this will be great for hackathons as it requires no local dependencies!)
- Run automatically by our CI system against our production environment to ensure that (a) the examples continue to work and (b) that there are no regressions or outages in production!
Quadruple win! Phil and Enrico are busy implementing this now, porting all the examples over and writing new ones.
Prash made a huge step forwards by getting the Lotus/Filecoin dev environment working (we had previously been having issues with it) and then getting all our Lotus integration tests running against a real Lotus environment. This is awesome, because it means that when the first SP wants to do real Filecoin integration for writing Bacalhau results, we'll be confident that it should work first time.
Walid shipped a very awesome Cloudwatch Dashboard for continuously testing the Bacalhau network. He also landed Slack integration, we now have a channel #bacalhau-monitoring
on the Filecoin slack which pings if the canaries start failing.
And we had an opportunity to test out the canaries as well, Walid came online on Thursday morning saying the canaries were failing! You can follow the whole incident here on Slack, but basically we found three issues as a result of the outage, all of which are now prioritized near the top of the backlog so will be addressed soon:
- We have a bug in our startup scripts that don’t re-mount the data dir on reboot
- Node 0 rebooted for some reason
- libp2p isn’t actually resilient when N=3
The outage was repaired quickly as well, and the whole team learned some tricks on how to diagnose production issues.
We broke up some of the near-term work into user-facing milestones:
- (better-network-1) UX bugs in the current Bacalhau network will be solved.
- (better-examples-1) Better and more consistent examples of how to use the network will be published.
- (bigger-network-1) More servers will be added to the current Bacalhau network, with more CPU and memory, so the network will be a more tempting place to run workloads.
- (dag) Extend that system to support jobs that are described in terms of pipelines: the output of one job feeding into the input of the next.
We then spent cycles figuring out some initial solutions to a puzzle we've had for a while. All of the big hard projects to get us from here to a token issuance are long and only “pay out” after the token is launched, from a user-facing perspective (SPs gain an incentive to run the software, users can pay for compute). We can get creative though (please offer more suggestions/ideas & guidance, folks who have done this before!)
- (simulator) We’re writing a simulator for the development of the verification protocol. Run a public version of the simulator API server and encourage people to write clients to try and break the protocol!
- (bug-bounty) Similar to above, but additionally offer bug bounties to anyone who is able to write a set of clients which breaks the network, under standard BFT assumptions (33%).
- (external-incentives) Incentivize service providers to sign up with the promise of future tokens, perhaps a fixed amount per server per month that they connect to the network and keep running.
- (centralized) Running a centralized, paid compute network is easier than running a decentralized one. Our database could start out as a simple postgres database on GCP to begin with, rather than a blockchain. We can still exercise the verification protocol, allowing arbitrary compute nodes, requestor nodes and clients to show up and attempt to break the protocol, while keeping accounting centralized (initially!).
These are in addition to the more engineering focused tasks in the previously published Master Plan - Part 2.
Work continues in the background on docs, smart contracts and the verification protocol simulator. These are bigger projects and will take a few more weeks to bear fruit.
- Publish first pass of examples in the new format and make them runnable (e.g. Colab links)
- Docs for SPs
- Smart contracts
- Verification protocol simulator
- Everything else in the prioritized backlog!