VISTEC Launches Inaugural Thai Language Learning Module Dataset with 5,014 Sets, Aims for Expansion to 40,000 Sets

The Vidyasirimedhi Institute of Science and Technology (VISTEC) has released the WangchanThaiInstruct dataset for fine-tuning the LLM model, comprising a total of 5,014 sets covering medical, financial, commercial, and legal topics. This dataset, entirely human-annotated, is freely available under the CC-BY-SA 4.0 license.

The dataset is divided into 7 types of tasks, including summarizing text, answering questions based on provided information, answering questions from prior knowledge, categorizing data, creative writing tasks, ideation, and selecting answers from options. These tasks involve subject matter experts from InnovestX, SCB10X, the Faculty of Law, Thammasat University, and Mahidol University.

Expectations are to expand the dataset monthly until reaching 40,000 entries.

Source: VISTEC Facebook, HuggingFace

TLDR: VISTEC released WangchanThaiInstruct dataset for fine-tuning LLM model covering various topics with human-annotated data. Tasks involve experts from different fields, aiming to reach 40,000 entries.

VISTEC Launches Inaugural Thai Language Learning Module Dataset with 5,014 Sets, Aims for Expansion to 40,000 Sets

More Reading

Official Announcement: Trial Implementation of New Semi-Formal Logo Featuring the Official 'Buddha Emblem'

Unveiling Meta's Designer Logo Threads Designed as Options, Yet Unutilized

Leave a Comment

Leave a Reply Cancel reply

More Reading

Post navigation

Leave a Comment

Leave a Reply Cancel reply

Unveiling of Alibaba’s Qwen 2 Model: Input as Sound and Mathematical Troubleshooting Version.

Google Removes Malicious Ads – Violates $5.1 Billion Legal Setback in 2024, Utilizing LLM Power for Detection.

Unveiling of OpenAI’s GPT-4.1: Superior to GPT-40 with Enhanced Code-Writing Capabilities beyond o3-mini