Conversational corpus – formal – آزمایشگاه پردازش زبان طبیعی

It includes 50,000 colloquial-formal sentence pairs and word alignments.

Colloquial Persian, which is widely used in social networks, fiction books, films, and everyday conversations, has lexical and structural differences from formal Persian, which is mainly used in textbooks and science, news, and official meetings. Lexical differences refer to differences in the vocabulary used (e.g., هندونه versus هندونه), while structural differences refer to changes in the syntactic structure of sentences in colloquial sentences. (e.g., I went to school and got it myself. I went to school instead of myself and got it).

The “Aami” colloquial-formal corpus was prepared with the support of the Vice President for Science and Technology and includes 50,000 colloquial-formal sentence pairs and word alignments in them. In this corpus, both types of lexical and syntactic transformations are included in the sentences used, and about half of the colloquial sentences have structural changes compared to their equivalent formal sentences. Also, in addition to the pairs of equivalent sentences, for each colloquial sentence, the equivalent word or phrase in the formal sentence is specified (alignment). Colloquial sentences are collected from sources such as social networks such as Instagram and Twitter, messengers such as Telegram and WhatsApp, web pages, blogs, books and films or are produced by the data scientists themselves. The method of collecting the required colloquial sentences is automatic or semi-automatic by crawling the Internet, and the results are made available to the data scientists after initial filtering for selection and entry into the system. The data scientists have selected appropriate sentences (various and different) and entered the colloquial sentence and its formal form and word alignment. Therefore, the corpus generation is done manually and has high accuracy. Finally, a dictionary of colloquial-formal words and phrases is produced from the set of alignments created, and its frequency of occurrence in the corpus is also entered for each entry.

Colloquial-formal data content:

۵۰,۰۰۰ pairs of colloquial sentences and their formal equivalents, along with the alignments of each sentence
Colloquial-formal dictionary includes 49,316 pairs of colloquial and formal words and phrases, along with the frequency of repetition of each pair in the corpus

Characteristics of the colloquial-formal data:

In this corpus, almost half of the colloquial sentences require a change in syntactic structure to be converted to the formal form.
There is at least one colloquial word in each colloquial sentence.
The corpus has adequate coverage of colloquial language from the perspective of linguistic phenomena and types of colloquial and slang from numerous and diverse sources
The corpus contains sentences of different lengths and also covers short and long sentences. The average length of the entered colloquial sentences is 11.36 words, and the average length of the entered formal sentences is 12.32 words.

Contributors to the construction:
–
Reference information:

Shams Fard, Mehrnoush. Persian Language Data and Resources: From Text to Word, Persian Text and Speech Processing (Editors: Dr. Mehrnoush Shamsfard, Mahmoud Bijankhan), Chapter 1, pp. 1-25, Samt Publications, 1401.
Takalli vahideh, Kalantari, Fateme, Shamsfard, Mehrnoush, Developing an Informal-Formal Persian Corpus, 2022.
Falakaflaki Parastoo, Shamsfard, Mehrnoush, Formality Style Transfer in Persian, 2022.

بازدید: ۶

آزمایشگاه پردازش زبان طبیعی

بزرگراه شهید چمران , ولنجک

021-29904171

nlp@sbu.ac.ir

Conversational corpus – formal