collaborative cloud computing architecture (backend) #273

StefanKarpinski · 2011-12-02T21:06:54Z

Here's a possible architecture that's simple and will scale well:

The top half of the diagram is a pretty standard scalable web stack design, with shared-nothing stateless web servers fronted by load balancers that just map HTTP requests randomly to the web servers. The databases and memcached instances are for persistent and transient shared non-computational state: user names, what sessions to map users to, which julia session servers are hosting those sessions, what the state of a session is in case someone wants to join midway though, including input and output history. Chat and other non-computational add-on services are also implemented entirely in the upper half of the diagram. The strict separation between the stateless part, the traditional non-computational state, and the computational state is the key feature of this architecture.

Terminology

A session is a single distributed Julia computation in which multiple users may participate.
A user is a single browser/person connected to one or many ongoing computation sessions.

Note that the relationship between sessions and users is many-to-many: a computation session may have many users participating in it and a user may be participating in multiple computation sessions at once (through multiple browser windows).

Architecture Details

Users talk to a different, random web server every time — it doesn't matter which one. The user and session metadata in memcache and the databases is used to map them to the same julia session server every time. This design means that if anything goes wrong with a web server or a load balancer, you just depool it and carry on. If a julia session server goes down, we're kind of screwed, but we need to come up with a fault tolerance story on that end anyway. At least with this design, if something in the web stack goes wrong, it doesn't affect anything else and is simple to fix (depool, reboot, spawn new web servers and/or load balancers).

The load balancers and the web servers talk HTTP/HTTPS to the outside world and talk memcache protocol and appropriate database protocols to the user and session state servers. There should probably be a very simple query response protocol between the web servers and the julia session servers. The julia session servers talk to the julia compute nodes using julia serialization and communication protocols that already exist and work. We should push everything but actual julia computation itself into the webs: if someone joins late and needs to know what's going on in the session, that should be cached in the databases and memcached servers and be served to them by the web stack — that kind of thing never hits the julia system. Only when someone actually requests that a new computation be done do we have to go to the julia session servers.

The bottom half is where the computation happens. The julia session servers have one process per session — and multiple users can get mapped to the same session — that's how the collaboration happens. Each server can host multiple sessions by having multiple processes, each corresponding to a single session. The session servers in turn farm data and work out to the compute nodes. Compute node processes belong to at most one session: if a compute node is going to do work for more than one session, which is entirely possible, then it will have at least one process for each of them; it may be beneficial to have multiple processes working for a single session on the same server in order to better utilize multicore machines.

Keep in mind that initially we'll have one/zero of a lot of things in this diagram: load balancers (can be zero initially), web servers (can be one), database nodes (one) and memcached nodes (one). Moreover, several of these things can live on the same machine easily. However, this design would let us easily scale up to serve arbitrary traffic and it's not much harder to build this design than something else that won't scale well. I don't much care what the web servers run, but something standard like ruby on rails + apache seems like a reasonable way to go (please no PHP). The web app itself should be fairly simple. The database should probably be mysql since that's kind of the industry standard and it supports simple, reliable master-master replication so you can get fault tolerance just by having two database nodes that mirror each other. No need to worry about anything like sharding at this point — it's way too early for any of that. For the memcahce nodes, writes should go in parallel to all of them and reads come from one at random. That's a simple scheme and works well in practice.

StefanKarpinski · 2011-12-02T21:07:48Z

Assigned to @boyers to be cheeky. Obviously, this is a big thing that we'll all have to work on.

stepchowfun · 2011-12-02T21:21:15Z

Sweet, I have a ticket!

JeffBezanson · 2011-12-02T21:52:32Z

It's like an entire business plan inside an issue :P

StefanKarpinski · 2011-12-02T21:59:43Z

Hey! This isn't at all like a business plan — it has actual meaningful content!

ViralBShah · 2011-12-10T05:16:54Z

We can treat this as an umbrella issue. A few months from now, an epic checkin will close it!

tautologico · 2012-05-14T15:13:44Z

The image link is broken.

StefanKarpinski · 2012-05-14T21:06:33Z

@tautologico: fixed. Thanks for the heads up.

xianyi · 2012-08-24T15:22:54Z

Hi,

What's the status of this feature? I'm very interesting in Julia and cloud computing.

Thanks

Xianyi

StefanKarpinski · 2012-08-24T15:30:14Z

Non-existent. This was pretty much a design document.

xianyi · 2012-08-24T15:38:52Z

Hi @StefanKarpinski ,

We may obtain the funding to do some works in cloud computing. If we get the funding, we will improve Julia about this feature.

Xianyi

StefanKarpinski · 2012-08-24T15:42:01Z

Excellent, let's keep in touch about that. We're working on / discussion similar things.

ViralBShah · 2013-03-16T09:05:40Z

The ipython work addresses this, and the scope of this issue is too wide.

…273)

ghost assigned stepchowfun Dec 2, 2011

ViralBShah closed this as completed Dec 10, 2011

ViralBShah reopened this Dec 10, 2011

ViralBShah mentioned this issue Jan 8, 2012

Web REPL plus EC2 #216

Closed

Keno mentioned this issue Mar 14, 2012

web interface: add possibility to save/restore session? #579

Closed

JeffBezanson closed this as completed Mar 16, 2013

StefanKarpinski pushed a commit that referenced this issue Feb 8, 2018

Add join, escape_string, and unescape_string (#273)

188de7a

KristofferC added a commit that referenced this issue May 9, 2018

no longer collect dependencies from the registry for fixed packages (#…

945a852

…273)

KristofferC added a commit that referenced this issue May 9, 2018

no longer collect dependencies from the registry for fixed packages (#…

e0d5ea6

…273)

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

collaborative cloud computing architecture (backend) #273

collaborative cloud computing architecture (backend) #273

StefanKarpinski commented Dec 2, 2011

StefanKarpinski commented Dec 2, 2011

stepchowfun commented Dec 2, 2011

JeffBezanson commented Dec 2, 2011

StefanKarpinski commented Dec 2, 2011

ViralBShah commented Dec 10, 2011

tautologico commented May 14, 2012

StefanKarpinski commented May 14, 2012

xianyi commented Aug 24, 2012

StefanKarpinski commented Aug 24, 2012

xianyi commented Aug 24, 2012

StefanKarpinski commented Aug 24, 2012

ViralBShah commented Mar 16, 2013

collaborative cloud computing architecture (backend) #273

collaborative cloud computing architecture (backend) #273

Comments

StefanKarpinski commented Dec 2, 2011

StefanKarpinski commented Dec 2, 2011

stepchowfun commented Dec 2, 2011

JeffBezanson commented Dec 2, 2011

StefanKarpinski commented Dec 2, 2011

ViralBShah commented Dec 10, 2011

tautologico commented May 14, 2012

StefanKarpinski commented May 14, 2012

xianyi commented Aug 24, 2012

StefanKarpinski commented Aug 24, 2012

xianyi commented Aug 24, 2012

StefanKarpinski commented Aug 24, 2012

ViralBShah commented Mar 16, 2013