Skip to content

Latest commit

 

History

History
272 lines (216 loc) · 14.3 KB

ARCHITECTURE_NOTES.md

File metadata and controls

272 lines (216 loc) · 14.3 KB

Re-architecture Notes

Key Documents

The Motivation

The master branch of Auto-GPT is an organically grown amalgamation of many thoughts and ideas about agent-driven autonomous systems. It lacks clear abstraction boundaries, has issues of global state and poorly encapsulated state, and is generally just hard to make effective changes to. Mainly it's just a system that's hard to make changes to.
And research in the field is moving fast, so we want to be able to try new ideas quickly.

Initial Planning

A large group of maintainers and contributors met do discuss the architectural challenges associated with the existing codebase. Many much-desired features (building new user interfaces, enabling project-specific agents, enabling multi-agent systems) are bottlenecked by the global state in the system. We discussed the tradeoffs between an incremental system transition and a big breaking version change and decided to go for the breaking version change. We justified this by saying:

  • We can maintain, in essence, the same user experience as now even with a radical restructuring of the codebase
  • Our developer audience is struggling to use the existing codebase to build applications and libraries of their own, so this breaking change will largely be welcome.

Primary Goals

  • Separate the AutoGPT application code from the library code.
  • Remove global state from the system
  • Allow for multiple agents per user (with facilities for running simultaneously)
  • Create a serializable representation of an Agent
  • Encapsulate the core systems in abstractions with clear boundaries.

Secondary goals

  • Use existing tools to ditch any unneccesary cruft in the codebase (document loading, json parsing, anything easier to replace than to port).
  • Bring in the core agent loop updates being developed simultaneously by @Pwuts

The Agent Subsystems

Configuration

We want a lot of things from a configuration system. We lean heavily on it in the master branch to allow several parts of the system to communicate with each other.
Recent work has made it so that the config is no longer a singleton object that is materialized from the import state, but it's still treated as a god object containing all information about the system and critically allowing any system to reference configuration information about other parts of the system.

What we want

  • It should still be reasonable to collate the entire system configuration in a sensible way.
  • The configuration should be validatable and validated.
  • The system configuration should be a serializable representation of an Agent.
  • The configuration system should provide a clear (albeit very low-level) contract about user-configurable aspects of the system.
  • The configuration should reasonably manage default values and user-provided overrides.
  • The configuration system needs to handle credentials in a reasonable way.
  • The configuration should be the representation of some amount of system state, like api budgets and resource usage. These aspects are recorded in the configuration and updated by the system itself.
  • Agent systems should have encapsulated views of the configuration. E.g. the memory system should know about memory configuration but nothing about command configuration.

Workspace

There are two ways to think about the workspace:

  • The workspace is a scratch space for an agent where it can store files, write code, and do pretty much whatever else it likes.
  • The workspace is, at any given point in time, the single source of truth for what an agent is. It contains the serializable state (the configuration) as well as all other working state (stored files, databases, memories, custom code).

In the existing system there is one workspace. And because the workspace holds so much agent state, that means a user can only work with one agent at a time.

Memory

The memory system has been under extremely active development. See #3536 and #4208 for discussion and work in the master branch. The TL;DR is that we noticed a couple of months ago that the Agent performed worse with permanent memory than without it. Since then the knowledge storage and retrieval system has been redesigned and partially implemented in the master branch.

Planning/Prompt-Engineering

The planning system is the system that translates user desires/agent intentions into language model prompts. In the course of development, it has become pretty clear that Planning is the wrong name for this system

What we want

  • It should be incredibly obvious what's being passed to a language model, when it's being passed, and what the language model response is. The landscape of language model research is developing very rapidly, so building complex abstractions between users/contributors and the language model interactions is going to make it very difficult for us to nimbly respond to new research developments.
  • Prompt-engineering should ideally be exposed in a parameterizeable way to users.
  • We should, where possible, leverage OpenAI's new
    function calling api to get outputs in a standard machine-readable format and avoid the deep pit of parsing json (and fixing unparsable json).

Planning Strategies

The new agent workflow has many, many interaction points for language models. We really would like to not distribute prompt templates and raw strings all through the system. The re-arch solution is to encapsulate language model interactions into planning strategies. These strategies are defined by

  • The LanguageModelClassification they use (FAST or SMART)
  • A function build_prompt that takes strategy specific arguments and constructs a LanguageModelPrompt (a simple container for lists of messages and functions to pass to the language model)
  • A function parse_content that parses the response content (a dict) into a better formatted dict. Contracts here are intentionally loose and will tighten once we have at least one other language model provider.

Resources

Resources are kinds of services we consume from external APIs. They may have associated credentials and costs we need to manage. Management of those credentials is implemented as manipulation of the resource configuration. We have two categories of resources currently

  • AI/ML model providers (including language model providers and embedding model providers, ie OpenAI)
  • Memory providers (e.g. Pinecone, Weaviate, ChromaDB, etc.)

What we want

  • Resource abstractions should provide a common interface to different service providers for a particular kind of service.
  • Resource abstractions should manipulate the configuration to manage their credentials and budget/accounting.
  • Resource abstractions should be composable over an API (e.g. I should be able to make an OpenAI provider that is both a LanguageModelProvider and an EmbeddingModelProvider and use it wherever I need those services).

Abilities

Along with planning and memory usage, abilities are one of the major augmentations of augmented language models. They allow us to expand the scope of what language models can do by hooking them up to code they can execute to obtain new knowledge or influence the world.

What we want

  • Abilities should have an extremely clear interface that users can write to.
  • Abilities should have an extremely clear interface that a language model can understand
  • Abilities should be declarative about their dependencies so the system can inject them
  • Abilities should be executable (where sensible) in an async run loop.
  • Abilities should be not have side effects unless those side effects are clear in their representation to an agent (e.g. the BrowseWeb ability shouldn't write a file, but the WriteFile ability can).

Plugins

Users want to add lots of features that we don't want to support as first-party. Or solution to this is a plugin system to allow users to plug in their functionality or to construct their agent from a public plugin marketplace. Our primary concern in the re-arch is to build a stateless plugin service interface and a simple implementation that can load plugins from installed packages or from zip files. Future efforts will expand this system to allow plugins to load from a marketplace or some other kind of service.

What is a Plugin

Plugins are a kind of garbage term. They refer to a number of things.

  • New commands for the agent to execute. This is the most common usage.
  • Replacements for entire subsystems like memory or language model providers
  • Application plugins that do things like send emails or communicate via whatsapp
  • The repositories contributors create that may themselves have multiple plugins in them.

Usage in the existing system

The current plugin system is hook-based. This means plugins don't correspond to kinds of objects in the system, but rather to times in the system at which we defer execution to them. The main advantage of this setup is that user code can hijack pretty much any behavior of the agent by injecting code that supercedes the normal agent execution. The disadvantages to this approach are numerous:

  • We have absolutely no mechanisms to enforce any security measures because the threat surface is everything.
  • We cannot reason about agent behavior in a cohesive way because control flow can be ceded to user code at pretty much any point and arbitrarily change or break the agent behavior
  • The interface for designing a plugin is kind of terrible and difficult to standardize
  • The hook based implementation means we couple ourselves to a particular flow of control (or otherwise risk breaking plugin behavior). E.g. many of the hook targets in the old workflow are not present or mean something entirely different in the new workflow.
  • Etc.

What we want

  • A concrete definition of a plugin that is narrow enough in scope that we can define it well and reason about how it will work in the system.
  • A set of abstractions that let us define a plugin by its storage format and location
  • A service interface that knows how to parse the plugin abstractions and turn them into concrete classes and objects.

Some Notes on how and why we'll use OO in this project

First and foremost, Python itself is an object-oriented language. It's underlying data model is built with object-oriented programming in mind. It offers useful tools like abstract base classes to communicate interfaces to developers who want to, e.g., write plugins, or help work on implementations. If we were working in a different language that offered different tools, we'd use a different paradigm.

While many things are classes in the re-arch, they are not classes in the same way. There are three kinds of things (roughly) that are written as classes in the re-arch:

  1. Configuration: Auto-GPT has a lot of configuration. This configuration is data and we use Pydantic to manage it as pydantic is basically industry standard for this stuff. It provides runtime validation for all the configuration and allows us to easily serialize configuration to both basic python types (dicts, lists, and primatives) as well as serialize to json, which is important for us being able to put representations of agents on the wire for web applications and agent-to-agent communication. These are essentially structs rather than traditional classes.
  2. Internal Data: Very similar to configuration, Auto-GPT passes around boatloads of internal data. We are interacting with language models and language model APIs which means we are handling lots of structured but raw text. Here we also leverage pydantic to both parse and validate the internal data and also to give us concrete types which we can use static type checkers to validate against and discover problems before they show up as bugs at runtime. These are essentially structs rather than traditional classes.
  3. System Interfaces: This is our primary traditional use of classes in the re-arch. We have a bunch of systems. We want many of those systems to have alternative implementations (e.g. via plugins). We use abstract base classes to define interfaces to communicate with people who might want to provide those plugins. We provide a single concrete implementation of most of those systems as a subclass of the interface. This should not be controversial.

The approach is consistent with prior work done by other maintainers in this direction.

From an organization standpoint, OO programming is by far the most popular programming paradigm (especially for Python). It's the one most often taught in programming classes and the one with the most available online training for people interested in contributing.

Finally, and importantly, we scoped the plan and initial design of the re-arch as a large group of maintainers and collaborators early on. This is consistent with the design we chose and no-one offered alternatives.