Hi HOME NLP TASKS COMMENTATOR TEAM PROJECT FUNDING

WELCOME TO WORLD OF CODE-MIXING

Aim

Code-mixing, the alternation of two or more languages within a single utterance or sentence, presents unique challenges and opportunities for linguistic analysis and Natural Language Processing. Our project proposal aims to rectify the aforementioned constraints by the curation and annotation of extensive Hindi-English material that is blended with Indic codes. Additionally, for the benefit of the NLP community interested in code-mixing research, we hope to provide a number of top-notch low-level NLP tagging tools that do named entity recognition, token-level language identification, and POS tagging. We also see a comparable focus on code-mixed translation. Nevertheless, we are not aware of any study that curates large-scale parallel triplets of code-mixed sentences and the accompanying monolingual translations, with the exception of a few recent efforts. We intend to manually translate each code-mixed text to its equivalent English and Hindi language in addition to the token-level language tags. Finally, we seek to create topical and matrix language predictors, since our proposed dataset also includes topic and matrix language (sentence follows the grammatical structure of the matrix language) for each instance.

NLP TASKS

Token-level language annotation

Token-level language annotation involves breaking down text into individual units (tokens) and assigning labels or categories to those tokens. For example: ...

Read More

Matrix language identification

Matrix language identification refers to the dominant language in a situation where two languages are mixed (code-switching). It provides the grammatical framework and overall structure ...

Read More

Token-level entity labeling

In this annotation scheme, we annotate named entities in the given sentence. We aim to annotate the following standard English named entities: ...

Read More

Token-level POS tagging

Token-level POS tagging involves annotating each word/token in a sentence with its respective Part-of-Speech (POS) tag. In a scenario where the sentence is in a code-mixed language like Hinglish (a blend of Hindi and English), ...

Read More

Spelling correction and normalization

Spelling correction and normalization involve identifying spelling mistakes, correcting them, and standardizing variations of tokens to a common form. For instance, tokens like "hain," "hai," and "hayn" can all be normalized to the token "hain." This process ...

Read More

Translations

Translations involve creating corresponding sentences in two different languages that convey the same meaning. This process facilitates understanding for researchers who may not be proficient in one of the languages and ...

Read More

You are about to enter COMMENTATOR Demo Server.

Several Codemixed instances have been uploaded, which you can try out to explore the various Tasks of Commentator.

Login Credentials

User Name: commentator
Password: commentator

Feel free to use these credentials to log in and try out the different functionalities of the tool.



Login to Commentator

COMMENTATOR

Demonstration of Annotation Framework


We intend to create a web-based annotation framework in place of the unconventional spreadsheet-based annotation method. By providing helpful options in the dropdown menu or by presenting the relevant fields based on the previous selection, it will speed up the annotation process and make sentences easier to annotate overall.

Commentator is the name of the COde-Mixed Multilingual tExt aNnoTATion framewORk we present, specifically built for code-mixed text. In the annotation framework of the proposed system there are primarily two users: i) the annotators and (ii) the admins. The annotators perform the annotation task. Whereas the admins design the annotation task, employ annotators, administer the annotation task, and process the annotations. The admins and the annotators need to signup and login to access the various functionalities of the tool.Given these roles, we describe COMMENTATOR functionalities by introducing the two user panels:

The Annotator Panel: The annotator panel contains two pages:

1. Annotation page: The annotation page has three different annotation tasks for a given sentence. 1. Token-level language identification, 2. Token-level POS tagging, and 3. Matrix language identification. When a task is selected, users are directed to a dedicated annotation page specific to that task. Annotators can update tags by clicking the corresponding button, and they can provide textual feedback in the "Enter Your Feedback Here" section. The tags are displayed in different colors. If an error is made, annotations can be revised by using the "Edit Annotations" button, which redirects to the history and edit page. The figure illustrates a Hinglish sentence being annotated by the user for token-level language identification. Me

2. History and edit page: The figure below showcases the history and edit page for Token-level language identification. It contains a list of sentences annotated in the past along with the timestamp. The annotator can click on any sentence to edit the annotations. A click on a sentence opens the annotation page with previously chosen annotation tags. Similar to Token-level language identification, for Token-level POS tagging and Matrix language identification there are An notation and History & Edit pages.Me

The Admin Panel: The below figure shows the admin panel. The admin panel performs three major tasks:Me

1. Data upload: The administrator can upload the source sentences using a CSV file (see point 1).

2. Annotation analysis: The administrator can: (i) analyze the quality of annotations using Cohen’s Kappa score for inter-annotator agreement (IAA) (see point 3) and (ii) analyze the degree of code-mixing in the annotated text using the code-mixing index (CMI)(see point 2).

3. Data download: The admin can download annotations of single/multiple annotators in a CSV file for different NLP tasks. The data download functionality also supports the conditional filtering of data based on IAA and CMI.

TEAM

Me

Prof. Mayank Singh

Snow

Rajvee Sheth

JRF

Snow

Shubh Nisar

Student

Snow

Heenaben Prajapati

JRF

Snow

Himanshu Beniwal

PhD

PROJECT FUNDING

Sponsoring Agency details


Title: Curating and constructing benchmarks and development of ML models for low-level NLP tasks in Hindi-English code-mixing
Agency: Science & Engineering Research Board(SERB)
Duration: February 2023 - February 2026
Sanctioned Amount: INR 47,67,400 (~USD 58,190)
PI: Prof. Mayank Singh