Home ยป VISTEC Launches Inaugural Thai Language Learning Module Dataset with 5,014 Sets, Aims for Expansion to 40,000 Sets

VISTEC Launches Inaugural Thai Language Learning Module Dataset with 5,014 Sets, Aims for Expansion to 40,000 Sets

The Vidyasirimedhi Institute of Science and Technology (VISTEC) has released the WangchanThaiInstruct dataset for fine-tuning the LLM model, comprising a total of 5,014 sets covering medical, financial, commercial, and legal topics. This dataset, entirely human-annotated, is freely available under the CC-BY-SA 4.0 license.

The dataset is divided into 7 types of tasks, including summarizing text, answering questions based on provided information, answering questions from prior knowledge, categorizing data, creative writing tasks, ideation, and selecting answers from options. These tasks involve subject matter experts from InnovestX, SCB10X, the Faculty of Law, Thammasat University, and Mahidol University.

Expectations are to expand the dataset monthly until reaching 40,000 entries.

Source: VISTEC Facebook, HuggingFace

TLDR: VISTEC released WangchanThaiInstruct dataset for fine-tuning LLM model covering various topics with human-annotated data. Tasks involve experts from different fields, aiming to reach 40,000 entries.

More Reading

Post navigation

Leave a Comment

Leave a Reply

Your email address will not be published. Required fields are marked *

KBTG Showcases Qwen2-7B Model Performance Leading to CFA Exam Success, Unveiling Small-Scale Financial Modeling Recommendations

Propagating False Information on US Elections: Elon Musk’s Grok AI

Introducing Purple Llama: Meta Unveils AI Security Arsenal Parade