🕊️ 🇪🇺 💬 Documentation of LEL-A (Libre Euro Lingua-Alliance)

The recent results of Alpaca are impressive. It was shown that it is realistically possible - even without a supercomputer and a multi-million budget - to train competitive ChatGPT-style large language models. Unfortunately, the existing instruction following models are not open source and do not support multiple languages.

On the long run we strive to provide these so-called instruction-following large language models aka "ChatGPT-style" models for the European languages.

Our goal is to provide models, code, training data and documentation that is:

Free, open-source and permissive (MIT license)
Transparent
Open for agile contribution and participation (see Contribution & Joining our Community)
State of the art
Up-to-date

Our Ressources

Documentation Repository
- Multilingual Alpaca Prompt
- Ideas, Links and Findings
Cleaned German Alpaca Dataset
EuroInstructProject: Instruction datasets from existing German, English and other European datasets.
Hugging Face Organization

Rough Planning

This section plans the first steps and outlines the individual ventures in the future.

First Milestone

In the first step we want to train, evaluate and publish a German and English generative pre-trained transformer (GPT) model of relatively small size. This model will not yet have the so-called instruction-following capabilities.

The concrete steps towards this goal are:

Provide a clean and appropriate English and German text corpus.
Identify training method, training code, model type and hyperparameters.
Find a way to get the necessary computation power and storage.
Start, monitor and maintain the training.
Evaluate the results on reference tasks.
Publish the final model and results.

Second Milestone

Add instruction-following capabilities by fine-tuning the GPT model from before.
This could be done in the same style as Alpaca. An OpenAI API access might be necessary for this, which would add additional costs.
Identify training method, training code and hyperparameters.
Find a way to get the necessary computation power and storage.
Start, monitor and maintain the training.
Evaluate the results on reference tasks.
Publish the final model and results.

Outlook

add more European languages
add different programming languages
use more training data in general
improve quality of training data
determine and monitor bias, take countermeasures if necessary
train larger models
retrain existing models to keep them current
update training method, training code, model type and hyperparameters when new research is published
add multimodal capabilities

Impediments and Risks

We do not have sufficient computing power. Neither the hardware nor the money to rent. (see #5)
The licensing implications of using the OpenAI API (self-instruct / Alpaca style training) are not entirely clear. (see #6)
The LLaMA training code is not open-sourced. (see #4)

Relevant Links

Contribution & Joining our Community

Our commitment to open source means that we are enabling - in fact encouraging - all interested parties to contribute and become part of our community. Contribution and feedback is encouraged and always welcome.

We communicate via a Slack channel. To get access to the Slack, please reach out to omar at huggingface.co or philipp at huggingface.co to be added.

Licensing

Licensed under the MIT License (the "License"); you may not use this file except in compliance with the License. You may obtain a copy of the License by reviewing the file LICENSE in the repository.

Name		Name	Last commit message	Last commit date
Latest commit History 56 Commits
LICENSE		LICENSE
README.md		README.md
alpaca_prompt.md		alpaca_prompt.md
ideas_links_findings.md		ideas_links_findings.md

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Repository files navigation

🕊️ 🇪🇺 💬 Documentation of LEL-A (Libre Euro Lingua-Alliance)

Our Ressources

Rough Planning

First Milestone

Second Milestone

Outlook

Impediments and Risks

Relevant Links

Contribution & Joining our Community

Licensing

About

Contributors 2

License

LEL-A/doc

Folders and files

Latest commit

History

Repository files navigation

🕊️ 🇪🇺 💬 Documentation of LEL-A (Libre Euro Lingua-Alliance)

Our Ressources

Rough Planning

First Milestone

Second Milestone

Outlook

Impediments and Risks

Relevant Links

Contribution & Joining our Community

Licensing

About

Topics

Resources

License

Stars

Watchers

Forks

Contributors 2