Dataset: Comparable Corpus


Dumps Date: 20-09-2017

Arabic: https://dumps.wikimedia.org/arwiki/20170920/
French: https://dumps.wikimedia.org/frwiki/20170920/
English: https://dumps.wikimedia.org/enwiki/20170920/

For each language, I downloaded the flowing files:
  • [LAN]wiki-[DATE]-pages-articles.xml.bz2 
  • [LAN]wiki-[DATE]-pagelinks.sql.gz 
  • [LAN]wiki-[DATE]-categorylinks.sql.gz 
  • [LAN]wiki-[DATE]-langlinks.sql.gz 
  • [LAN]wiki-[DATE]-page.sql.gz 
  • [LAN]wiki-[DATE]-redirect.sql.gz


ISO 639-2 language code
Wikipedia Main Category Name
Wikipedia Disambiguation Category Name

Comments

Popular posts from this blog

Wikipedia Corpora

Links

Extraction Model