Skip to content

FreedomIntelligence/InstructionZoo

Folders and files

NameName
Last commit message
Last commit date

Latest commit

 

History

63 Commits
 
 

Repository files navigation

InstructionZoo

A collection of open-source Instruction-tuning dataset to train chat-based LLMs (ChatGPT,LLaMA,Alpaca).

This is an on-going project. We will soon add tags to classify the following datasets and continuously update our collection.

Table of Contents

The template

## [owner/project-name](https://github.com/link/to/project)

* Size:
* Language:
* Summary:
* Generation Method:
* Paper:
* HuggingFace: (if applicable)
* Demo: (if applicable)
* License:

The English Instruction Datasets

  • Size: 51,713 instructions
  • Language: EN
  • Summary: Cleaned Alpaca Dataset helps solve the folowing issues: Hallucinations, Merged Instructions, Empty outputs, Empty code examples, Instructions to generate images, N/A outputs, Inconsistent input field, Wrong answers, Non-Sensical/Unclear instructions, and Extraneous escape and control characters.
  • HuggingFace: https://huggingface.co/datasets/yahma/alpaca-cleaned
  • License: CC BY NC 4.0
  • Language: EN
  • Summary: Alpaca-COT is a datset for Chain-of-Thoughts reasoning based on LLaMA and Alpaca.
  • Generateion Method: Use the template provided by FLAN to change the original dataset into various Chain-of-Thoughts forms, and then convert them to the instruction-input-output triplets.
  • HuggingFace: https://huggingface.co/datasets/QingyiSi/Alpaca-CoT
  • License: Apache License
  • Empty for now. Soon to update.
  • Size: 240,000 instructions
  • Language: EN
  • Summary: Unnatural Instructions consist of a core dataset of 68,478 instruction-input-output triplets, and a full dataset.
  • Generateion Method:
    • Step 1 (Core Dataset Generation): Collect 64,000 examples by prompting a language model with three seed examples of instructions and eliciting a fourth, following a strict instruction-input-output format.
    • Step 2 (Template Expansion): Prompt a language model to reformulate the tasks in the core dataset, and collect two alternative formulations for each generated task
  • Paper: Unnatural Instructions: Tuning Language Models with (Almost) No Human Labor
  • License:
  • Size: 61 tasks, 61 instructions
  • Language: EN
  • Summary: Natural Instruct v1 is a dataset of 61 distinct tasks, their human-authored instructions, and 193k task instances.
  • Generateion Method:
    • Map exist datasets into Instruction Schema.
    • Instruction Schema:
      • Part I - Title + Definition + Things-to-Avoid + Emphasis-and-Caution
      • Part II - Positive Example: Input + Output + Reason
      • Part III - Negative Example: Input + Output + Reason + Suggestions to be modified to be positive
      • Part IV - Prompt
  • Paper: Cross-Task Generalization via Natural Language Crowdsourcing Instructions
  • Demo: https://instructions.apps.allenai.org/
  • License:
  • Size: 62 tasks
  • Language: EN
  • Summary: FLAN 2021 aggregates 62 text datasets on Tensorflow Datasets into a single mixture. It is currently not public.
  • Generateion Method: Map exist datasets into Instruction Schema.
  • Paper: Finetuned Language Models Are Zero-Shot Learners
  • License:
  • Size: 479 seed instructions, 52,191 Chinese instructions, 52,191 English instructions
  • Language: CH, EN
  • Summary: InstructionWild use the same format as Alpaca for fast and easy usage. Its instructions have no input field.
  • Generateion Method:
    • Pick 429 instructions over 700 noisy instructions from Twitter
    • Use a similar method as Alpaca for generating the resulting instructions.
  • License:

ExMix

  • Size: 1,667 tasks, 3,128 instructions
  • Language: EN
  • Summary: OPT-IML dataset expands the Super-Natural-Instructions benchmark with the task collections from multiple existing work on instruction-tuning, cross-task transfer studies, and area-specific task consolidation.
  • Generation Method:
    • Benchmarks included in OPT-IML are Super-Natural-Instructions, PromptSource, CrossFit, FLAN, ExMix, T5, UnifiedSKG, and Reasoning. Authors only kept partial tasks from CrossFit, ExMix and T5 due to the significant overlap.
    • To organize the Instruction schema, authors broadly classify the instructions in these benchmarks into two categories, dataset-level and instance-level.
  • Paper: OPT-IML: Scaling Language Model Instruction Meta Learning through the Lens of Generalization
  • License:
  • Size: 30 tasks, 43M instructions
  • Language: EN
  • Summary: OIG contains instructions that are created using data augmentation from a diverse collection of data sources, and formatted in a dialogue style (… … pairs).
  • Generation Method:
    • OIG is created by various LAION community members, consisting of 30 datasets and 43M instructions, with the goal of reaching 1 trillion tokens.
    • OIG dataset can be divided roughly into 75% academic datasets, such as P3, Natural instructions and FLAN, and 25% datasets composed of various tasks, such as high school math, python coding and peoty generation.
  • HuggingFace: https://huggingface.co/datasets/laion/OIG
  • Demo: https://github.com/LAION-AI/Open-Assistant
  • License:
  • Size: 115K instructions
  • Language: EN
  • Summary: Camel dataset introduces a novel communicative agent framework named role-playing.
  • Generation Method:
    • The prompt engineering in Camel consists of three prompts, the task specifier prompt, the assistant system prompt, and the user system prompt. The scenarios in Camel include AI Society and Code.
    • Authors also create Data Generation Prompts to generate meta data by LLMs. 50 assistant roles and 50 user roles are generated for AI Society. 20 programming languages and 50 domains are generated for Code.
  • Paper: CAMEL: Communicative Agents for "Mind" Exploration of Large Scale Language Model Society
  • HuggingFace: https://huggingface.co/camel-ai
  • Demo: https://www.camel-ai.org/
  • License:
  • Size: 657K instructions
  • Language: EN
  • Summary: UltraChat is a multi-round dialogue dataset powered by Turbo APIs, composed of three sectors, namely Questions about the World, Writing and Creation, and Assistance on Existent Materials.
  • Generation Method:
    • Two separate ChatGPT Turbo APIs are adopted in generation, where one plays the role of the user to generate queries and the other generates the response.
    • We instruct the user model with carefully designed prompts to mimic human user behavior and call the two APIs iteratively.
  • HuggingFace: https://huggingface.co/datasets/stingning/ultrachat
  • License:
  • Size: 7 tasks, 15,000 instructions
  • Language: EN
  • Summary: Dolly is a human-generated corpus, whose categories are Creative Writing, Closed QA, Open QA, Summarization, Information Extraction, Classification and Brainstorming.
  • Generation Method:
    • Databricks employees were invited to create prompt / response pairs in each of eight different instruction categories.
    • For instruction categories that require an annotator to consult a reference text, contributors selected passages from Wikipedia for particular subsets of instruction categories.
  • HuggingFace: https://huggingface.co/datasets/databricks/databricks-dolly-15k
  • License:
  • Summary: ShareGPT is an open-source Chrome Extension for you to share your wildest ChatGPT conversations with one click.
  • Generation Method: Collect chats with ChatGPT from its users.
  • Demo: https://sharegpt.com/
  • Size: 18 tasks, 385K instructions
  • Language: EN
  • Summary: SHP is a dataset of 385K collective human preferences over responses to questions/instructions in 18 different subject areas, from cooking to legal advice. It is used to train RLHF reward models and NLG evaluation models.
  • Generation Method:
    • The data is sourced from Reddit, which is a public forum organized into topic-specific fora called subreddits.
    • Each example is a Reddit post with a question/instruction and a pair of top-level comments for that post.
  • Paper: Understanding Dataset Difficulty with V -Usable Information
  • HuggingFace: https://huggingface.co/datasets/stanfordnlp/SHP
  • License:
  • Size: 12 tasks, 37,175 instructions
  • Language: EN, CH
  • Summary: HC3 is a comparison corpus that consists of both human and ChatGPT answers to the same questions.
  • Generation Method:
    • Human Answers Collection: The first part is publicly available question-answering datasets, whose answers are given by experts or high-voted. The second part is built by constructing question-answer pairs from wiki sources.
    • ChatGPT Answers Collection: use ChatGPT to generate answers to the questions in Human Answers Collection
  • Paper: How Close is ChatGPT to Human Experts? Comparison Corpus, Evaluation, and Detection
  • HuggingFace: https://huggingface.co/datasets/Hello-SimpleAI/HC3
  • License: CC-BY-SA
  • Empty for now. Soon to update.

The Chinese Instruction Datasets

  • Size: 2K tasks, 191,191 instructions in total
  • Language: CH
  • Summary: Chinese Open Instruction Generalist (COIG) is a Chinese instruction dataset consisting of 4 sub-tasks.
  • Generateion Method:
    • Task 1: Translated Instructions (67,798)
      • Translate the following datasets into Chinese: 1,616 task descriptions in Super-Natural-Instruct v2 along with a single instance for each of them; 175 seed tasks in Self-instruct; 66,007 instructions from Unnatural Instructions.
    • Task 2: Exam Instructions (63,532)
      • Exams include The Chinese National College Entrance Examination (高考), Middle School Entrance Examinations (中考), and Civil Servant Examination (公务员考试).
      • Turn them into Chain-of-Thought (CoT) corpus by extracting six informative elements from original exam questions, including instruction, question context, question, answer, answer analysis, and coarse-grained subject.
    • Task 3: Human Value Alignment Instructions (34,471)
      • Select a set of samples that present shared human values in the Chinese-speaking world, and get 50 seed instructions and 3k resulting instructions.
      • Some additional sets of samples that present regional-culture or country-specific human values are also added.
    • Task 4: Counterfactural Correction Multi-round Chat (13,653)
      • The aim is to alleviate and resolve the pain points of hallucination and factual inconsistency in current LLMs.
      • Based on CN-DBpedia knowledge graph dataset, CCMC has ~13,000 dialogues with an average of 5 rounds per dialogue, resulting in ~65,000 rounds of chat.
    • Leetcode Instructions (11,737)
      • 2,589 programming questions from Leetcode.
  • Paper: Chinese Open Instruction Generalist: A Preliminary Release
  • HuggingFace: https://huggingface.co/datasets/BAAI/COIG
  • License: MIT License
  • Size: 4 tasks, 396,209 instructions
  • Language: CH
  • Summary: CSL is a large-scale Chinese scientific literature dataset.
  • Generation Method:
    • Obtain the paper’s meta-information from the National Engineering Research Center for Science and Technology Resources Sharing Service (NSTR) dated from 2010 to 2020.
    • Label papers with categories and disciplines, with the assistance of volunteers.
    • The data format in CSL is <T,A,K,c,d>, where T is the title, A is the abstract, K is a list of keywords, c is the category label and d is the discipline label.
  • Paper: CSL: A Large-scale Chinese Scientific Literature Dataset
  • License:
  • Size: 23 tasks, 1.1M instructions
  • Language: CH
  • Summary: Firefly dataset is a high-quality Chinese instruction-tuning dataset.
  • Generation Method: For each task, human experts write many templates to ensure the quality and diversity of Firefly dataset.
  • HuggingFace: https://huggingface.co/datasets/YeungNLP/firefly-train-1.1M
  • License:
  • Language: Multilingual
  • License:

ZeroPrompt

  • Empty for now. Soon to update.

Chinese Alpaca

  • Size: 20,456 instructions
  • Language: CH
  • Generateion Method: Translate Alpaca into Chinese by machine and then clean.
  • Size: 19,442 instructions
  • Language: CH
  • Generateion Method: Translate Alpaca into Chinese by ChatGPT, and check them by humans
  • Size: 51,458 instructions
  • Language: CH
  • Generateion Method: Translate Alpaca into Chinese by ChatGPT, and discard some of them.
  • Size: 51,672 instructions
  • Language: CH
  • Generateion Method: Translate Stanford Alpaca dataset into Chinese by ChatGPT.
  • Size: 20,465 instructions
  • Language: TC
  • Generateion Method: Translate Stanford Alpaca dataset into traditional Chinese using OpenCC.
  • Size: 124,469 instructions
  • Language: EN, TC
  • Generateion Method: Combine the English instruction/input and traditional Chinese output by ChatGPT.
  • Size: 52,002 instructions
  • Language: EN, TC
  • Generateion Method: A Traditional-Chinese version of the Alpaca dataset, whose instruction part is left as English.
  • Size: 52,002 instructions
  • Language: EN, TC
  • Generateion Method: An Traditional-Chinese version of the Alpaca dataset, where there are English and traditional Chinese versions of one single instruction.

The Miltilingual Instruction Datasets

  • Size: 83 tasks
  • Language: Multilingual (46 languages)
  • Summary:
    • xP3 is a mixture of 13 training tasks in 46 languages with English prompts.
    • Moreover, there is a xP3 Dataset Family, including the following two datasets:
      • xP3mt is a mixture of 13 training tasks in 46 languages with prompts in 20 languages;
      • xP3all consists of xP3 itself and evaluation datasets adding an additional 3 tasks.
  • Generateion Method: Build on the P3 task taxonomy and add 28 new multilingual datasets.
  • Paper: Crosslingual Generalization through Multitask Finetuning
  • HuggingFace: https://huggingface.co/datasets/bigscience/xP3
  • License:
  • Size: 380,835 instructions in total
  • Language: CH, DE, EN, JA, TC
  • Summary: Guanaco dataset builds upon the 175 tasks from Alpaca, containing 3 versions with different sizes and methods.
  • Generateion Method:
    • Original Version (48967): Rewrite 175 Alpaca seed tasks in different languages, and add new tasks specifically designed for English grammar analysis, natural language understanding, cross-lingual self-awareness, and explicit content recognition.
    • Mixed Version (279644): The original 175 tasks were translated into 4 versions and regenerated independently, excluding Deutsch.
    • MIni Version (52224): 52K instrucrion dataset, which is included in the Mixed Version.
  • HuggingFace: https://huggingface.co/datasets/JosephusCheung/GuanacoDataset/tree/main
  • License:
  • Size: 205,999 instructions in total
  • Language: CH, DE, EN, JA
  • Summary: The Paper/General-QA dataset is a collection of questions and answers constructed for AI-generated papers or general texts in 4 languages. The purpose of this dataset is to generate paragraph-level answers to questions posed about lengthy documents such as PDFs.
  • Generateion Method:
    • The question dataset contains 106,707 questions, and the answer dataset contains 99,292 answers.
    • Similar questions are combined to form a tree-like structure, and graph theory algorithms are used to process user questions, content summaries, and contextual logic.
  • HuggingFace: https://huggingface.co/datasets/JosephusCheung/GuanacoDataset/tree/main/additional
  • License:

The Code Instruction Datasets

  • Size: 20,023 instructions
  • Language: EN
  • Summary:
  • Generateion Method: Self-instuct with prompts to focus on code generation/edting/optimization tasks, using text-davinci-003.
  • HuggingFace:
  • License: