DSpace Angular :: Browsing Applied Linguistics and Translation Studies

Item

A transfer-rule based verb phrase translation from English to Tamil

( 2018-01-01) Parameswari, K. ; Nagaraju, V. ; Angeline Linda, K.

Building a machine translation (MT) between non-cognate languages always poses number issues as there are lots of translation divergences involved. In transfer-based MT, a systematic way of formulating transfer rules are required to handle linguistic differences between languages. This paper explains three-stages in which the transfer-based machine translation (MT) are built for translating verb phrases from English to Tamil.

Item

Development of Telugu-Tamil transfer-based machine translation system: An improvization using divergence index

( 2019-07-01) Krishnamurthy, Parameswari

Building an automatic, high-quality, robust machine translation (MT) system is a fascinating yet an arduous task, as one of the major difficulties lies in cross-linguistic differences or divergences between languages at various levels. The existence of translation divergence precludes straightforward mapping in the MT system. An increase in the number of divergences also increases the complexity, especially in linguistically motivated transfer-based MT systems. This paper discusses the development of Telugu-Tamil transfer-based MT and how a divergence index (DI) is built to quantify the number of parametric variations between languages in order to improve the success rate of MT. The DI facilitates MT in proposing where to put efforts for the given language pair to attain better and faster results. In addition, handling strategies of different types of divergences in a transfer-based approach to MT are discussed. The paper also includes the evaluation method and how an improvization takes place with the application of DI in MT.

Item

“Do You See and Hear More? A Study on Telugu Perception Verbs”

( 2022-01-01) Krishna, P. Phani ; Arulmozi, S. ; Mishra, Ramesh Kumar

Verbs of perception describe the actual perception of some entity and it is emphasized by earlier researchers that lexicon in languages is conceptually-oriented and is necessary for our daily communicative needs. In this paper, we demonstrate and explain, which among the perception verbs have the higher frequencies of all the five senses (vision, hear, smell, taste, touch) by using a Telugu corpus and self-rating task. This study shows a greater lexical differentiation when compared to studies done using English corpus and other languages. Based on our analysis–vision, followed by hear are the most commonly used verbs in daily communicative needs by the Telugu speakers as compared to touch, taste, and smell; The inconsistency in usage of other senses are not identical to the vision and hear in other studies, it may be due to sampling and methodological variations in the corpus of different language, but in common these two senses play a key role in perception verbs. The study of Telugu perception verbs may give more interesting facts and insights into the cognitive linguistics paradigm.

Item

Grammar extraction from treebanks for Hindi and telugu

( 2010-01-01) Kolachina, Prasanth ; Kolachina, Sudheer ; Singh, Anil Kumar ; Husain, Samar ; Naidu, Viswanatha ; Sangal, Rajeev ; Bharati, Akshar

Grammars play an important role in many Natural Language Processing (NLP) applications. The traditional approach to creating grammars manually, besides being labor-intensive, has several limitations. With the availability of large scale syntactically annotated tree-banks, it is now possible to automatically extract an approximate grammar of a language in any of the existing formalisms from a corresponding treebank. In this paper, we present a basic approach to extract grammars from dependency treebanks of two Indian languages, Hindi and Telugu. The process of grammar extraction requires a generalization mechanism. Towards this end, we explore an approach which relies on generalization of argument structure over the verbs based on their syntactic similarity. Such a generalization counters the effect of data sparseness in the treebanks. A grammar extracted using this system can not only expand already existing knowledge bases for NLP tasks such as parsing, but also aid in the creation of grammars for languages where none exist. Further, we show that the grammar extraction process can help in identifying annotation errors and thus aid in the task of the treebank validation.

Item

History, features, and typology of language corpora

( 2018-03-05) Dash, Niladri Sekhar ; Arulmozi, S.

This book discusses key issues of corpus linguistics like the definition of the corpus, primary features of a corpus, and utilization and limitations of corpora. It presents a unique classification scheme of language corpora to show how they can be studied from the perspective of genre, nature, text type, purpose, and application. A reference to parallel translation corpus is mandatory in the discussion of corpus generation, which the authors thoroughly address here, with a focus on Indian language corpora and English. Web-text corpus, a new development in corpus linguistics, is also discussed with elaborate reference to Indian web text corpora. The book also presents a short history of corpus generation and provides scenarios before and after the advent of computer-generated digital corpora. This book has several important features: it discusses many technical issues of the field in a lucid manner; contains extensive new diagrams and charts for easy comprehension; and presents discussions in simplified English to cater to the needs of non-native English readers. This is an important resource authored by academics who have many years of experience teaching and researching corpus linguistics. Its focus on Indian languages and on English corpora makes it applicable to students of graduate and postgraduate courses in applied linguistics, computational linguistics and language processing in South Asia and across countries where English is spoken as a first or second language.

Item

Holistic spatial semantics and post-Talmian motion event typology: A case study of Thai and Telugu

( 2018-11-01) Naidu, Viswanatha ; Zlatev, Jordan ; Duggirala, Vasanta ; Van De Weijer, Joost ; Devylder, Simon ; Blomberg, Johan

Leonard Talmy's influential binary motion event typology has encountered four main challenges: (a) additional language types; (b) extensive "type-internal"variation; (c) the role of other relevant form classes than verbs and "satellites;"and (d) alternative definitions of key semantic concepts like Motion, Path and Manner. After reviewing these issues, we show that the theory of Holistic Spatial Semantics provides analytical tools for their resolution. In support, we present an analysis of motion event descriptions by speakers of two languages that are troublesome for the original typology: Thai (Tai-Kadai) and Telugu (Dravidian), based on the Frog-story elicitation procedure. Despite some apparently similar typological features, the motion event descriptions in the two languages were found to be significantly different. The Telugu participants used very few verbs in contrast to extensive case marking to express Path and nominals to express Region and Landmark, while the Thai speakers relied largely on serial verbs for expressing Path and on prepositions for expressing Region. Combined with previous research in the field, our findings imply (at least) four different clusters of languages in motion event typology with Telugu and Thai as representative of two such clusters, languages like French and Spanish representing a third cluster, and Swedish and English a fourth. This also implies that many other languages like Italian, Bulgarian, and Basque will appear as "mixed languages,"positioned between two or three of these clusters.

Item

IIITK@DravidianLangTech-EACL2021: Offensive Language Identification and Meme Classification in Tamil, Malayalam and Kannada

( 2021-01-01) Ghanghor, Nikhil Kumar ; Krishnamurthy, Prameshwari ; Thavareesan, Sajeetha ; Priyadarshini, Ruba ; Chakravarthi, Bharathi Raja

This paper describes the IIITK team’s submissions to the offensive language identification and troll memes classification shared tasks for Dravidian languages at DravidianLangTech 2021 workshop@EACL 2021. We have used the transformer-based pretrained models along with their customized versions with custom loss functions. State of the art pretrained CNN models were also used for image-related tasks. Our best configuration for Tamil troll meme classification achieved a 0.55 weighted average F1 score, and for offensive language identification, our system achieved weighted F1 scores of 0.75 for Tamil, 0.95 for Malayalam, and 0.71 for Kannada. Our rank on Tamil troll meme classification is 2, and offensive language identification in Tamil, Malayalam, and Kannada is 3, 3 and 4. We have open-sourced our code implementations for all the models across both the tasks on GitHub1

Item

Improving the performance of the link parser

( 2009-12-01) Naidu, Y. Viswanatha ; Singh, Anil Kumar ; Sharma, Dipti Misra ; Bharati, Akshar

The paper describes an approach to extend the coverage of a Link Grammar based parser on the constructions that are not being handled currently by the grammar. There are about thirty types of constructions which we have identified till now. In order to make Link Grammar handle these constructions, we introduce a preprocessor and a postprocessor. The idea is to handle such constructions via some analysis and transformations in a preprocessing phase before the sentence is given to the Link Parser and then by adding the missing links in the postprocessing phase. The main part of the paper discusses the constructions not handled by the parser and introduces rule based preprocessor and postprocessor. This simple and flexible approach is able to increase the coverage of the parser significantly and allows even a relatively naive user to improve the performance of the parser without disturbing the core grammar. © 2009 IEEE.

Item

Issues in analyzing telugu sentences towards building a Telugu Treebank

( 2010-12-29) Vempaty, Chaitanya ; Naidu, Viswanatha ; Husain, Samar ; Kiran, Ravi ; Bai, Lakshmi ; Sharma, Dipti M. ; Sangal, Rajeev

This paper describes an effort towards building a Telugu Dependency Treebank. We discuss the basic framework and issues we encountered while annotating. 1487 sentences have been annotated in Paninian framework. We also discuss how some of the annotation decisions would effect the development of a parser for Telugu. © Springer-Verlag 2010.

Item

Motion event descriptions in Swedish, French, Thai and Telugu: a study in post-Talmian motion event typology

( 2021-01-01) Zlatev, Jordan ; Blomberg, Johan ; Devylder, Simon ; Naidu, Viswanatha ; van de Weijer, Joost

Motion-event typology has moved into a “post-Talmian” terrain of approaches focusing on an open-ended number of patterns across languages and constructions. Following a proposal to distinguish between four typological clusters, we systematically compared the motion event descriptions in four languages suggested to exemplify these clusters: Swedish, French, Thai and Telugu, with the help of an elicitation-based study. 20 adult native speakers of each language were asked to describe 52 motion events, 38 of which were translocative. The stimuli varied with respect to the parameters caused/uncaused, bounded/unbounded motion as well as the viewpoint from which they were filmed. The descriptions were analyzed following Holistic Spatial Semantics and compared with respect to the categories Path, Direction, Region, Landmark, Manner and Cause, as well as the means of expressing these. The four languages patterned differently in significant ways. In terms of Path expression, French lagged behind the other languages, but with respect to Direction, it patterned together with Swedish. We demonstrate a number of such criss-crossing patterns, showing that there is no way to group the languages, thus implying at least four distinct typological prototypes. Further, we show that different kinds of motion situations, corresponding to different constructions, need to be compared separately.

Item

On polysemy in Tamil and other Indian languages

( 2010-01-01) Mohanty, Panchanan ; Arulmozi, S.

Scholars (e.g. Burrow 1968:300) have expressed surprise regarding the very small number of borrowed words from Sanskrit in Tamil as opposed to the other three major literary Dravidian languages, i.e. Telugu, Kannada, and Malayalam. But there is no detailed discussion as to why it has happened in Tamil when other Dravidian languages possess a lot of Sanskrit borrowings. We want to argue here that the small number of consonant letters in Tamil alphabet is responsible for it. And its natural outcome is that other Dravidian languages have borrowed from Sanskrit whenever necessary whereas Tamil has managed its situation by developing polysemy. In other words, Tamil is more polysemous compared to its sister languages. In fact, we want to propose that if a language has a smaller alphabet than others, it has to be more polysemous than the latter. In this paper, we will demonstrate it with examples from Tamil vis-a-vis their cognates in Telugu.

Item

Parameswari_faith_nagaraju@Dravidian-CodeMixFIRE: A machine-learning approach using n-grams in sentiment analysis for code-mixed texts: A case study in Tamil and Malayalam

( 2020-01-01) Krishnamurthy, Parameswari ; Varghese, Faith ; Vuppala, Nagaraju

Sentiment analysis is a fast growing research positioned to uncover the underlying meaning of a text by categorizing it into different levels. This paper is an attempt to decode the deeply entangled code-mixed Malayalam and Tamil datasets and classify its interlined meaning at five various levels. Along with the corpus creation, [1] propose a five-level classification for Malayalam and Tamil code-mixed datasets. In this paper, we follow the five-level annotated datasets and aim to solve the classification problem by implementing unigram and bigram knowledge with a Multinomial Naive Bayes model. Our model scores an F1-score of 0.55 for Tamil and 0.48 for Malayalam.

Item

Sensory Perception in Blind Bilinguals and Monolinguals

( 2020-08-01) Phani Krishna, P. ; Arulmozi, S. ; Shiva Ram, Male ; Mishra, Ramesh Kumar

In blinds, the tactile sensations play a crucial role for various daily activities, in the all sense modalities tactile sensation is considered as major sense of perception. This study is conducted to investigate the tactile sensations in relation to Bilingual and Monolingual blinds using experimental comparative study design, divided into two groups. Self-paced reading task of a Braille scripted passage was used as a stimulus. Findings of this study reported that blind bilingual participants differ in the processing of language, the tactile sensations in the Bilinguals are better as compared to monolinguals.

Item

Speech Perception Performance of Native Speakers of Marathi: Effect of Filtered Speech Stimulus and Degree of Hearing Impairment

( 2022-02-01) Rathna Kumar, S. B. ; Dash, Niharika ; Bapuji, Mendem ; Arulmozi, S. ; Chandanshive, Chandrahas

The study investigated the effect of filtered speech stimulus on speech perception performance of native speakers of Marathi as a function of degree of hearing impairment. Speech identification score (SIS) testing was performed to measure speech perception on three groups (Group I, Group II, and Group III consisted of participants with moderate, moderately-severe, and severe sensorineural hearing impairment respectively). Speech stimuli comprised eight word-lists with each list consisting of 25 words in Marathi. The first seven word-lists (first to seven) were filtered at 500 Hz, 1000 Hz, 1500 Hz, 2000 Hz, 2500 Hz, 3000 Hz, and 3500 Hz cut-off frequencies, respectively, while word list 8 was left unfiltered. Although, the SIS improved with increase in cut-off frequency, the improvement in SIS with increase in cut-off frequency of speech stimulus was noticed up to 3000 Hz, 2500 Hz, and 2000 Hz for participants of Group I, Group II, and Group III, respectively. In addition, the improvement in speech perception performance did not correspond to what would be anticipated with an increase in the cut-off frequency of speech stimulus for participants of Group II and Group III compared to participants of Group I. Although, there was a significant reduction in SIS as a function of the degree of hearing impairment for speech stimulus filtered at 1500 Hz, 2000 Hz, 2500 Hz, and 3000 Hz cut-off frequencies, there was no significant effect of degree of hearing impairment on SIS for speech stimulus filtered at 500 Hz and 1000 Hz cut-off frequencies.

Item

Telugu WordNet

( 2010-01-01) Arulmozi, S.

This paper describes an attempt to develop Telugu WordNet, particularly construction of synsets in Telugu language along the lines of Hindi synsets using the expansion approach. Based on the Hindi WordNet synsets, we assign Telugu synsets manually using the Offline Tool Interface. We share the challenges faced in the construction of core synsets from Hindi into Telugu language. A brief account on Telugu language and its notable features are also provided.

Applied Linguistics and Translation Studies - Publications

Permanent URI for this collection

Browse

Browse

Browsing Applied Linguistics and Translation Studies - Publications by Title

Results Per Page

Sort Options