A large, 5 billion-word corpus of Persian texts.

HmBlogs is a large corpus of Persian texts that is based on crawling Persian blog posts. This corpus has two general versions, one and three. Version one includes only Blogfa blog posts, and version three includes posts from Blogfa and Bayan servers. Version three contains more than 5 billion tokens and an attempt has been made to remove duplicate posts from its posts.
Contributors:
Hamzeh Motahari
Reference information:

HM Khansari, M Shamsfard HmBlogs: A big general Persian corpus, arXiv preprint arXiv:2111,02362, 2021.