This project has retired. For details please refer to its Attic page.
Joshua decoder | Indian Languages Parallel Corpora


Indian Parallel Languages


This page describes a set of six parallel corpora obtained by translating popular Wikipedia documents in six languages from the Indian sub-continent into English. The languages are:
  • Bengali
  • Hindi
  • Malayalam
  • Tamil
  • Telugu
  • Urdu

The collection and release of this data is described in the following paper:

Constructing Parallel Corpora for Six Indian Languages via Crowdsourcing
Matt Post, Chris Callison-Burch, and Miles Osborne
WMT 2012

Download & License

The Indian parallel corpora dataset is hosted on Github. You can clone that, or download a release tarball by clicking the big green button above. The corpus is licensed under the Creative Commons Attribution-Sharealike 3.0 Unported License (CC BY-SA 3.0).


Below are the best translation scores (case-insensitive BLEU-4) that have been reported on the provided test sets. The Google results were recorded in the fall of 2011 (and are described in Post et al. (2012)). Google does not have a Malayalam system.

Citation BN HI ML TA TE UR
Google 20.01 25.21 13.51 16.03 23.09
Post et al. (2012) 13.53 17.29 13.72 9.81 12.46 19.53