The recent results of Alpaca are impressive. It was shown that it is realistically possible - even without a supercomputer and a multi-million budget - to train competitive ChatGPT-style large language models. Unfortunately, the existing instruction following models are not open source and do not support multiple languages.
On the long run we strive to provide these so-called instruction-following large language models aka "ChatGPT-style" models for the European languages.
Our goal is to provide models, code, training data and documentation that is:
- Free, open-source and permissive (MIT license)
- Transparent
- Open for agile contribution and participation (see Contribution & Joining our Community)
- State of the art
- Up-to-date
- Documentation Repository
- Cleaned German Alpaca Dataset
- EuroInstructProject: Instruction datasets from existing German, English and other European datasets.
- Hugging Face Organization
This section plans the first steps and outlines the individual ventures in the future.
In the first step we want to train, evaluate and publish a German and English generative pre-trained transformer (GPT) model of relatively small size. This model will not yet have the so-called instruction-following capabilities.
The concrete steps towards this goal are:
- Provide a clean and appropriate English and German text corpus.
- Identify training method, training code, model type and hyperparameters.
- Find a way to get the necessary computation power and storage.
- Start, monitor and maintain the training.
- Evaluate the results on reference tasks.
- Publish the final model and results.
- Add instruction-following capabilities by fine-tuning the GPT model from before.
- This could be done in the same style as Alpaca. An OpenAI API access might be necessary for this, which would add additional costs.
- Identify training method, training code and hyperparameters.
- Find a way to get the necessary computation power and storage.
- Start, monitor and maintain the training.
- Evaluate the results on reference tasks.
- Publish the final model and results.
- add more European languages
- add different programming languages
- use more training data in general
- improve quality of training data
- determine and monitor bias, take countermeasures if necessary
- train larger models
- retrain existing models to keep them current
- update training method, training code, model type and hyperparameters when new research is published
- add multimodal capabilities
- We do not have sufficient computing power. Neither the hardware nor the money to rent. (see #5)
- The licensing implications of using the OpenAI API (self-instruct / Alpaca style training) are not entirely clear. (see #6)
- The LLaMA training code is not open-sourced. (see #4)
- LLaMA
- GitHub: ChatLLaMA
- Alpaca
- Training Data
Our commitment to open source means that we are enabling - in fact encouraging - all interested parties to contribute and become part of our community. Contribution and feedback is encouraged and always welcome.
We communicate via a Slack channel. To get access to the Slack, please reach out to omar at huggingface.co or philipp at huggingface.co to be added.
Copyright (c) 2023 by the LEL-A team
Licensed under the MIT License (the "License"); you may not use this file except in compliance with the License. You may obtain a copy of the License by reviewing the file LICENSE in the repository.