Skip to content

A suite of open-ended, non-imitative tasks involving generalizable skills for large language model chatbots and agents to enable bootstrapped recursive self-improvement and an unambiguous AGI.

License

Notifications You must be signed in to change notification settings

keskival/recursive-self-improvement-suite

Folders and files

NameName
Last commit message
Last commit date

Latest commit

 

History

64 Commits
 
 
 
 
 
 
 
 

Repository files navigation

Recursive Self-improvement Suite

A suite of open-ended, non-imitative tasks involving generalizable skills for large language model chatbots and agents to enable bootstrapped recursive self-improvement and an unambiguous AGI.

The current generation of LLMs are trained in an imitative fashion, the main task is auto-regressive text prediction for data written by humans. In this task, the model is effectively penalized if it behaves more intelligently than the behavior present in the training data. The hypothesis is that the current large language models are only using a small part of their capacity for intelligent behavior, because human-level cannot be significantly surpassed with imitative tasks. This is why most quantitative benchmarks show the current generation of LLMs asymptotically approach the human level, but not significantly exceed it.

We have done imitative objective swapping before in narrow AI deep learning models, for example in AlphaGo, but also in uncountably many other models. AlphaGo was first trained imitatively with grandmaster games, and only after the objective was swapped to a self-competitive objective it significantly surpassed the human level.

What sorts of tasks do we need?

Any task which involves a large volume of generalizable skills, and for which the solutions can be evaluated to be better or worse than other reference solutions. Programming is such a task. So is playing chess.

As we now have LLM chatbots which are able to evaluate the solutions to very complex natural language tasks from different perspectives, as a panel of LLM judges, the pool of tasks we have available is vast. We can in effect bootstrap recursive self-improvement by closing the loop and evaluating the act of evaluation as well as one task.

The tasks can be roughly categorized into groups:

  • Procedurally evaluated (e.g. chess) / LLM test case evaluated (e.g. programming) / LLM judge evaluated (e.g. negotiation)
  • LLM assistant tasks (e.g. question answering, technical design) / LLM agent tasks (e.g. social interaction, multi-step and open world tasks)

These tasks should be used to fine-tune a pre-trained LLM chatbot which has been instruct-tuned.

Tasks to be Implemented

  • Programming
    • Generate programming challenges and related validators in various languages and simulated deployment environments and integrations.
    • Make the LLM also rank the challenges and the validators.
    • Make the LLM also rank the rankings.
    • See also: Code Llama
    • Train the LLM to produce the better programming challenges with better validators, and better rankings.
  • Social games
    • Generate multi-agent social games.
    • Make the LLM rank the player performances, or generate procedural rules to determine the winner.
    • One important aspect to judge is ethical conduct in an agentic setting, which is missing from all current generation alignment procedures.
    • Make the LLM also rank the games based on how rich and challenging they are, and how many generalist skills they require.
    • Make the LLM also rank the rankings.
    • See also: AgentBench
    • Train the LLM to produce the better performances, and better rankings.
  • Predict what a Python code outputs
    • Generate questions and short python programs to answer them.
    • Make the LLM also rank the questions and the python programs based on suitable criteria.
    • Make the LLM predict the output.
    • Rank the output predictions based on real outputs.
    • Make the LLM also rank the rankings.
    • Train the LLM to produce the predictions, and better rankings of questions and answers.
  • Trivia (for maintaining knowledge of facts and question answering)
    • Generate questions and answers conditioned by a random page in Wikipedia.
    • Make the LLM also rank the questions and answers based on suitable criteria.
    • Make the LLM also rank the rankings.
    • Train the LLM to produce the better answers for better questions, and better rankings of questions and answers.

What Kind of Data We Want Out

The prompting should generate synthetic data which is useful for recursive fine-tuning of an LLM model.

That means a large volume of good and better performances of a task, where the better performance is labelled. This is useful for Direct Preference Optimization. According to Self-Rewarding Language Models such data are more useful for fine-tuning the models than simply good performances in isolation.

How to Fine-tune

There are many methods, and models served behind APIs such as OpenAI models generally only allow normal supervised fine-tuning.

We can use LoRA or similar adapters, but it is the best if our fine-tuning process allows contrastive fine-tuning in the style of DPO, where we benefit not only from an example of a good performance but also a direction, which gives a better gradient towards even better performances.

Some notes about fine-tuning process:

  • Fine-tuning with these open-ended "unleashed" tasks need to be interlaced with traditional LLM tasks and all other tasks of different kinds to prevent catastrophic forgetting of baseline knowledge and skills.
  • "Unleashed" tasks need to be prefixed with tokens forming the word "UNLEASHED:" so that the LLM understands that this task is evaluated in an open-ended fashion and it should not try to emulate human-level behavior. This prefix should be used in the trained model use cases where superhuman performance is desired.
  • In most tasks, a set of LLMs or a single LLM with a non-zero temperature needs to be used to produce multiple possible solutions, answers or trajectories, and regardless of which method is used to produce the ranking of these solutions, a contrastive method should be used to fine-tune the model so that the relative generation likelihood of the best generation sequence increases in relation to the worse generation sequences. For example Direct Preference Optimization can be used, or any contrastive reinforcement learning algorithm.
  • Most tasks are based on generating a large pool of heterogeneous challenges, problems or questions to answer.
  • We also need to combat mode collapse by making the system evaluate creativity and variability in sets of generations.

Usage

It's not yet implemented to the point where it does much, but you'll need to add your own OpenAI API key to the file python/apikey.json. See python/apikey.json.example for an example.

Then you can run some initial functionality with this command in the python directory:

python -m recursive_self_improvement_suite.recursive_self_improvement_suite

Citing

Recursive Self-improvement Suite

@article{keskival2023recursive,
  title={Recursive Self-improvement Suite},
  author={Keski-Valkama, Tero},
  year={2023},
  doi={10.5281/zenodo.13207300}
}

DOI

References

Related Posts

How to Contribute

Just make a PR. Making a PR is an acknowledgement that the contribution can be added as-is or in a modified form to the codebase. There is no transfer of copyright, but making a PR is an acknowledgement of granting a general MIT licence to the contributed code. Add yourself to the LICENCE.

About

A suite of open-ended, non-imitative tasks involving generalizable skills for large language model chatbots and agents to enable bootstrapped recursive self-improvement and an unambiguous AGI.

Resources

License

Stars

Watchers

Forks