Aim
Code-mixing, the alternation of two or more languages within a single utterance or sentence, presents unique challenges and opportunities for linguistic analysis and Natural Language Processing. Our project proposal aims to rectify the aforementioned constraints by the curation and annotation of extensive Hindi-English material that is blended with Indic codes. Additionally, for the benefit of the NLP community interested in code-mixing research, we hope to provide a number of top-notch low-level NLP tagging tools that do named entity recognition, token-level language identification, and POS tagging. We also see a comparable focus on code-mixed translation. Nevertheless, we are not aware of any study that curates large-scale parallel triplets of code-mixed sentences and the accompanying monolingual translations, with the exception of a few recent efforts. We intend to manually translate each code-mixed text to its equivalent English and Hindi language in addition to the token-level language tags. Finally, we seek to create topical and matrix language predictors, since our proposed dataset also includes topic and matrix language (sentence follows the grammatical structure of the matrix language) for each instance.
Token-level language annotation involves breaking down text into individual units (tokens) and assigning labels or categories to those tokens. For example: ...
Read MoreMatrix language identification refers to the dominant language in a situation where two languages are mixed (code-switching). It provides the grammatical framework and overall structure ...
Read MoreIn this annotation scheme, we annotate named entities in the given sentence. We aim to annotate the following standard English named entities: ...
Read MoreToken-level POS tagging involves annotating each word/token in a sentence with its respective Part-of-Speech (POS) tag. In a scenario where the sentence is in a code-mixed language like Hinglish (a blend of Hindi and English), ...
Read MoreSpelling correction and normalization involve identifying spelling mistakes, correcting them, and standardizing variations of tokens to a common form. For instance, tokens like "hain," "hai," and "hayn" can all be normalized to the token "hain." This process ...
Read MoreTranslations involve creating corresponding sentences in two different languages that convey the same meaning. This process facilitates understanding for researchers who may not be proficient in one of the languages and ...
Read MoreYou are about to enter COMMENTATOR Demo Server.
Several Codemixed instances have been uploaded, which you can try out to explore the various Tasks of Commentator.
Login Credentials
User Name: commentator
Password: commentator
Feel free to use these credentials to log in and try out the different functionalities of the tool.
Demonstration of Annotation Framework
We intend to create a web-based annotation framework in place of the unconventional spreadsheet-based annotation method. By providing helpful options in the dropdown menu or by presenting the relevant fields based on the previous selection, it will speed up the annotation process and make sentences easier to annotate overall.
Commentator is the name of the COde-Mixed Multilingual tExt aNnoTATion framewORk we present, specifically built for code-mixed text. In the annotation framework of the proposed system there are primarily two users: i) the annotators and (ii) the admins. The annotators perform the annotation task. Whereas the admins design the annotation task, employ annotators, administer the annotation task, and process the annotations. The admins and the annotators need to signup and login to access the various functionalities of the tool.Given these roles, we describe COMMENTATOR functionalities by introducing the two user panels:
The Annotator Panel: The annotator panel contains two pages:
1. Annotation page: The annotation page has three different annotation tasks for a given sentence.
1. Token-level language identification, 2. Token-level POS tagging, and 3. Matrix language identification.
When a task is selected, users are directed to a dedicated annotation page specific to that task. Annotators can update tags by clicking the corresponding button, and they can provide textual feedback in the "Enter Your Feedback Here" section. The tags are displayed in different colors. If an error is made, annotations can be revised by using the "Edit Annotations" button, which redirects to the history and edit page. The figure illustrates a Hinglish sentence being annotated by the user for token-level language identification.
2. History and edit page: The figure below showcases the history and edit page for Token-level language identification. It contains a list of sentences annotated in the past along with the timestamp. The annotator can click on any sentence to edit the annotations. A click on a sentence opens the annotation page with previously chosen annotation tags. Similar to Token-level language identification, for Token-level POS tagging and Matrix language identification there are An notation and History & Edit pages.
The Admin Panel: The below figure shows the admin panel. The admin panel performs three major tasks:
1. Data upload: The administrator can upload the source sentences using a CSV file (see point 1).
2. Annotation analysis: The administrator can: (i) analyze the quality of annotations using Cohen’s Kappa score for inter-annotator agreement (IAA) (see point 3) and (ii) analyze the degree of code-mixing in the annotated text using the code-mixing index (CMI)(see point 2).
3. Data download: The admin can download annotations of single/multiple annotators in a CSV file for different NLP tasks. The data download functionality also supports the conditional filtering of data based on IAA and CMI.
Rajvee Sheth JRF |
Shubh Nisar Student |
Heenaben Prajapati JRF |
Himanshu Beniwal PhD |
---|
Sponsoring Agency details
Title: Curating and constructing benchmarks and development of ML models for low-level NLP tasks in Hindi-English code-mixing
Agency: Science & Engineering Research Board(SERB)
Duration: February 2023 - February 2026
Sanctioned Amount: INR 47,67,400 (~USD 58,190)
PI: Prof. Mayank Singh