The Vidyasirimedhi Institute of Science and Technology (VISTEC) has released the WangchanThaiInstruct dataset for fine-tuning the LLM model, comprising a total of 5,014 sets covering medical, financial, commercial, and legal topics. This dataset, entirely human-annotated, is freely available under the CC-BY-SA 4.0 license.
The dataset is divided into 7 types of tasks, including summarizing text, answering questions based on provided information, answering questions from prior knowledge, categorizing data, creative writing tasks, ideation, and selecting answers from options. These tasks involve subject matter experts from InnovestX, SCB10X, the Faculty of Law, Thammasat University, and Mahidol University.
Expectations are to expand the dataset monthly until reaching 40,000 entries.
Source: VISTEC Facebook, HuggingFace
TLDR: VISTEC released WangchanThaiInstruct dataset for fine-tuning LLM model covering various topics with human-annotated data. Tasks involve experts from different fields, aiming to reach 40,000 entries.
Leave a Comment