WMT 2023 Low-Resource Indic Language Translation Shared Task - EMNLP 2023 Sixth Conference on Machine Translation

Shared Task: Low-Resource Indic Language Translation

UPDATES

March 22, 2023 - Website released!

TASK DESCRIPTION

In the past few years, machine translation (MT) performance has been improved significantly. With the development of new techniques such as multilingual translation and transfer learning, the use of MT is no longer a privilege for users of popular languages. Consequently, there has been an increasing interest in the community to expand the coverage to more languages with different geographical presences, degrees of diffusion and digitalization. However, MT coverage for more users speaking diverse languages is limited because the MT methods demand vast amounts of parallel data to train quality systems, which has posed a significant obstacle for low-resource translation. Therefore, developing MT systems with relatively small parallel datasets is still highly desirable. In this shared task, four distinct low-resource Indic languages are considered that belongs to different language families, namely, Assamese (Indo-Aryan), Mizo (Sino-Tibetan), Khasi (Austroasiatic) and Manipuri (Sino-Tibetan). The main challenge here is how to efficiently utilize monolingual data or techniques such as multilingual, transfer learning, or language model to improve translation performance for English-to-Assamese/Mizo/Khasi/Manipuri and Assamese/Mizo/Khasi/Manipuri-to-English. The evaluation will be carried out using automatic evaluation metrics (BLEU, TER, RIBES) and human evaluation.

Language Pairs

We focus on the following language pairs (both direction for each):

en-as: English ⇔ Assamese
en-mz: English ⇔ Mizo
en-kha: English ⇔ Khasi
en-mn: English ⇔ Manipuri

There will be four subtasks:

Subtask-1: English ⇔ Assamese (English-to-Assamese and Assamese-to-English Machine Translation)
Subtask-2: English ⇔ Mizo (English-to-Mizo and Mizo-to-English Machine Translation)
Subtask-3: English ⇔ Khasi (English-to-Khasi and Khasi-to-English Machine Translation)
Subtask-4: English ⇔ Manipuri (English-to-Manipuri and Manipuri-to-English Machine Translation)

Utilizing parallel data

No additional parallel data is allowed for training. Constrained submissions only.

Utilizing monolingual data

You are encouraged to develop novel solutions to utilize monolingual corpora to improve translation quality.

DATA

Train/Dev/Test data shall be available soon!

To participate, you need to register a form. The dataset password is displayed when you complete the form.

SUBMISSIONS

The test data is available at the same repository as the training data and it can be accessed using the same password sent via e-mail. You are allowed to submit 1 CONSTRAINT, 1 PRIMARY and up to 2 CONTRASTIVE systems for each language pair/translation direction.

You should submit your results by TBA, 2023 (anywhere in the world)

EVALUATION

The evaluation was carried out automatically using BLEU (Papieni et al., 2002) and TER (Snover et al., 2006), and RIBES (Isozaki et al., 2010).

Paper Submission

Your system paper submission should be prepared according to the WMT instructions and uploaded to START before TBA, 2023.

IMPORTANT DATES

Release of training/dev data	TBA, 2023
Test data release	TBA, 2023
Submission deadline	TBA, 2023
System description paper deadline	TBA, 2023
Camera-ready	TBA, 2023
Conference	December,TBA, 2023

ORGANIZERS

Santanu Pal, Wipro AI Lab, London, UK
Partha Pakray, National Institute of Technology, Silchar, India
Sahinur Rahman Laskar, University of Petroleum and Energy Studies, Dehradun, India
Sandeep Kumar Dash, National Institute of Technology, Mizoram, India
Lenin Laitonjam, National Institute of Technology, Mizoram, India
Vanlalmuansangi Khenglawt, Mizoram University, India
Sunita Warji, Gandhi Institute of Technology and Management, India
Pankaj Kundan Dadure, University of Petroleum and Energy Studies, Dehradun, India

CONTACT

lrilt.wat23@gmail.com