-
-
Notifications
You must be signed in to change notification settings - Fork 5.5k
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
collaborative cloud computing architecture (backend) #273
Comments
Assigned to @boyers to be cheeky. Obviously, this is a big thing that we'll all have to work on. |
Sweet, I have a ticket! |
It's like an entire business plan inside an issue :P |
Hey! This isn't at all like a business plan — it has actual meaningful content! |
We can treat this as an umbrella issue. A few months from now, an epic checkin will close it! |
The image link is broken. |
@tautologico: fixed. Thanks for the heads up. |
Hi, What's the status of this feature? I'm very interesting in Julia and cloud computing. Thanks Xianyi |
Non-existent. This was pretty much a design document. |
Hi @StefanKarpinski , We may obtain the funding to do some works in cloud computing. If we get the funding, we will improve Julia about this feature. Xianyi |
Excellent, let's keep in touch about that. We're working on / discussion similar things. |
The ipython work addresses this, and the scope of this issue is too wide. |
Here's a possible architecture that's simple and will scale well:
The top half of the diagram is a pretty standard scalable web stack design, with shared-nothing stateless web servers fronted by load balancers that just map HTTP requests randomly to the web servers. The databases and memcached instances are for persistent and transient shared non-computational state: user names, what sessions to map users to, which julia session servers are hosting those sessions, what the state of a session is in case someone wants to join midway though, including input and output history. Chat and other non-computational add-on services are also implemented entirely in the upper half of the diagram. The strict separation between the stateless part, the traditional non-computational state, and the computational state is the key feature of this architecture.
Terminology
Note that the relationship between sessions and users is many-to-many: a computation session may have many users participating in it and a user may be participating in multiple computation sessions at once (through multiple browser windows).
Architecture Details
Users talk to a different, random web server every time — it doesn't matter which one. The user and session metadata in memcache and the databases is used to map them to the same julia session server every time. This design means that if anything goes wrong with a web server or a load balancer, you just depool it and carry on. If a julia session server goes down, we're kind of screwed, but we need to come up with a fault tolerance story on that end anyway. At least with this design, if something in the web stack goes wrong, it doesn't affect anything else and is simple to fix (depool, reboot, spawn new web servers and/or load balancers).
The load balancers and the web servers talk HTTP/HTTPS to the outside world and talk memcache protocol and appropriate database protocols to the user and session state servers. There should probably be a very simple query response protocol between the web servers and the julia session servers. The julia session servers talk to the julia compute nodes using julia serialization and communication protocols that already exist and work. We should push everything but actual julia computation itself into the webs: if someone joins late and needs to know what's going on in the session, that should be cached in the databases and memcached servers and be served to them by the web stack — that kind of thing never hits the julia system. Only when someone actually requests that a new computation be done do we have to go to the julia session servers.
The bottom half is where the computation happens. The julia session servers have one process per session — and multiple users can get mapped to the same session — that's how the collaboration happens. Each server can host multiple sessions by having multiple processes, each corresponding to a single session. The session servers in turn farm data and work out to the compute nodes. Compute node processes belong to at most one session: if a compute node is going to do work for more than one session, which is entirely possible, then it will have at least one process for each of them; it may be beneficial to have multiple processes working for a single session on the same server in order to better utilize multicore machines.
Keep in mind that initially we'll have one/zero of a lot of things in this diagram: load balancers (can be zero initially), web servers (can be one), database nodes (one) and memcached nodes (one). Moreover, several of these things can live on the same machine easily. However, this design would let us easily scale up to serve arbitrary traffic and it's not much harder to build this design than something else that won't scale well. I don't much care what the web servers run, but something standard like ruby on rails + apache seems like a reasonable way to go (please no PHP). The web app itself should be fairly simple. The database should probably be mysql since that's kind of the industry standard and it supports simple, reliable master-master replication so you can get fault tolerance just by having two database nodes that mirror each other. No need to worry about anything like sharding at this point — it's way too early for any of that. For the memcahce nodes, writes should go in parallel to all of them and reads come from one at random. That's a simple scheme and works well in practice.
The text was updated successfully, but these errors were encountered: