Shared Task: Low-Resource Indic Language Translation

UPDATES

March 22, 2023 - Website released!

TASK DESCRIPTION

In the past few years, machine translation (MT) performance has been improved significantly. With the development of new techniques such as multilingual translation and transfer learning, the use of MT is no longer a privilege for users of popular languages. Consequently, there has been an increasing interest in the community to expand the coverage to more languages with different geographical presences, degrees of diffusion and digitalization. However, MT coverage for more users speaking diverse languages is limited because the MT methods demand vast amounts of parallel data to train quality systems, which has posed a significant obstacle for low-resource translation. Therefore, developing MT systems with relatively small parallel datasets is still highly desirable. In this shared task, four distinct low-resource Indic languages are considered that belongs to different language families, namely, Assamese (Indo-Aryan), Mizo (Sino-Tibetan), Khasi (Austroasiatic) and Manipuri (Sino-Tibetan). The main challenge here is how to efficiently utilize monolingual data or techniques such as multilingual, transfer learning, or language model to improve translation performance for English-to-Assamese/Mizo/Khasi/Manipuri and Assamese/Mizo/Khasi/Manipuri-to-English. The evaluation will be carried out using automatic evaluation metrics (BLEU, TER, RIBES) and human evaluation.

Language Pairs

We focus on the following language pairs (both direction for each):

There will be four subtasks:

Utilizing parallel data

No additional parallel data is allowed for training. Constrained submissions only.

Utilizing monolingual data

You are encouraged to develop novel solutions to utilize monolingual corpora to improve translation quality.

DATA

Train/Dev/Test data shall be available soon!

To participate, you need to register a form. The dataset password is displayed when you complete the form.

SUBMISSIONS

The test data is available at the same repository as the training data and it can be accessed using the same password sent via e-mail. You are allowed to submit 1 CONSTRAINT, 1 PRIMARY and up to 2 CONTRASTIVE systems for each language pair/translation direction.

You should submit your results by TBA, 2023 (anywhere in the world)

EVALUATION

The evaluation was carried out automatically using BLEU (Papieni et al., 2002) and TER (Snover et al., 2006), and RIBES (Isozaki et al., 2010).

Paper Submission

Your system paper submission should be prepared according to the WMT instructions and uploaded to START before TBA, 2023.

IMPORTANT DATES

Release of training/dev data TBA, 2023
Test data release TBA, 2023
Submission deadline TBA, 2023
System description paper deadline TBA, 2023
Camera-ready TBA, 2023
Conference December,TBA, 2023

ORGANIZERS

CONTACT

lrilt.wat23@gmail.com