By
Haverinen, Katri; Kanerva, Jenna; Kohonen, Samuel; Missilä, Anna; Ojala, Stina; Viljanen, Timo; Laippala, Veronika; Ginter, Filip
Show all (8)
2 Citations
We present the Finnish PropBank, a resource for semantic role labeling (SRL) of Finnish based on the Turku Dependency Treebank whose syntax is annotated in the wellknown Stanford Dependency (SD) scheme. The contribution of this paper consists of the lexicon of the verbs and their arguments present in the treebank, as well as the predicateargument annotation of all verb occurrences in the treebank text. We demonstrate that the annotation is of high quality, that the SD scheme is highly compatible with PropBank annotation, and further that the additional dependencies present in the Turku Dependency Treebank are clearly beneficial for PropBank annotation. Further, we also use the PropBank to provide a strong baseline for automated Finnish SRL using a machine learning SRL system developed for the SemEval’14 shared task on broadcoverage semantic dependency parsing. The PropBank as well as the SRL system are available under a free license at
http://bionlp.utu.fi/
.
more …
By
Maillette de Buy Wenniger, Gideon; Sima’an, Khalil
1 Citations
Longrange word order differences are a wellknown problem for machine translation. Unlike the standard phrasebased models which work with sequential and local phrase reordering, the hierarchical phrasebased model (Hiero) embeds the reordering of phrases within pairs of lexicalized contextfree rules. This allows the model to handle long range reordering recursively. However, the Hiero grammar works with a single nonterminal label, which means that the rules are combined together into derivations independently and without reference to context outside the rules themselves. Followup work explored remedies involving nonterminal labels obtained from monolingual parsers and taggers. As of yet, no labeling mechanisms exist for the many languages for which there are no good quality parsers or taggers. In this paper we contribute a novel approach for acquiring reordering labels for Hiero grammars directly from the wordaligned parallel training corpus, without use of any taggers or parsers. The new labels represent types of alignment patterns in which a phrase pair is embedded within larger phrase pairs. In order to obtain alignment patterns that generalize well, we propose to decompose word alignments into trees over phrase pairs. Beside this labeling approach, we contribute coarse and sparse features for learning soft, weighted labelsubstitution as opposed to standard substitution. We report extensive experiments comparing our model to two baselines: Hiero and the known syntax augmented machine translation (SAMT) variant, which labels Hiero rules with nonterminals extracted from monolingual syntactic parses. We also test a simplified labeling scheme based on inversion transduction grammar (ITG). For the Chinese–English task we obtain performance improvement up to 1 BLEU point, whereas for the German–English task, where morphology is an issue, a minor (but statistically significant) improvement of 0.2 BLEU points is reported over SAMT. While ITG labeling does give a performance improvement, it remains sometimes suboptimal relative to our proposed labeling scheme.
more …
By
Eetemadi, Sauleh; Lewis, William; Toutanova, Kristina; Radha, Hayder
Show all (4)
3 Citations
Statistical machine translation has seen significant improvements in quality over the past several years. The single biggest factor in this improvement has been the accumulation of ever larger stores of data. We now find ourselves, however, the victims of our own success, in that it has become increasingly difficult to train on such large sets of data, due to limitations in memory, processing power, and ultimately, speed (i.e. datatomodels takes an inordinate amount of time). Moreover, the training data has a wide quality spectrum. A variety of methods for data cleaning and data selection have been developed to address these issues. Each of these methods employs a search or filtering algorithm to select a subset of the data, given a defined set of feature functions. In this paper we provide a comparative overview of research in this area based on application scenario, feature functions and search method.
more …
By
Wiegand, Michael; Klakow, Dietrich
In this article, we explore the feasibility of extracting suitable and unsuitable food items for particular health conditions from natural language text. We refer to this task as conditional healthiness classification. For that purpose, we annotate a corpus extracted from forum entries of a foodrelated website. We identify different relation types that hold between food items and health conditions going beyond a binary distinction of suitability and unsuitability and devise various supervised classifiers using different types of features. We examine the impact of different taskspecific resources, such as a healthiness lexicon that lists the healthiness status of a food item and a sentiment lexicon. Moreover, we also consider taskspecific linguistic features that disambiguate a context in which mentions of a food item and a health condition cooccur and compare them with standard features using bag of words, partofspeech information and syntactic parses. We also investigate in how far individual food items and health conditions correlate with specific relation types and try to harness this information for classification.
more …
By
Pretorius, Laurette; Viljoen, Biffie; Berg, Ansu; Pretorius, Rigardt
Show all (4)
Tswana, a Bantu language in the Sotho group, is characterised by an agglutinative morphology and a disjunctive orthography, which mainly affects the verb category. In particular, verbal prefixes are usually written disjunctively, while suffixes follow a conjunctive writing style. Therefore, Tswana tokenisation cannot be based solely on whitespace, as is the case in many alphabetic, segmented languages, including the conjunctively written Nguni group of South African Bantu languages. This paper shows how a combination of two finite state tokeniser transducers and a finite state morphological analyser are combined to solve the Tswana (verb) tokenisation problem. The approach has the important advantage of bringing the processing of Tswana, beyond the morphological analysis level, in line with what is appropriate for the Nguni languages. This means that the challenge of the disjunctive orthography is met at the tokenisation/morphological analysis level and does not in principle propagate to subsequent levels of analysis such as POS tagging and shallow parsing, etc. The tokenisation approach is novel and, when implemented and evaluated, yields an F_{1}score of 95 % with respect to a hand tokenised gold standard.
more …
By
Alegria, Iñaki; Aranberri, Nora; Comas, Pere R.; Fresno, Víctor; Gamallo, Pablo; Padró, Lluis; San Vicente, Iñaki; Turmo, Jordi; Zubiaga, Arkaitz
Show all (9)
5 Citations
The language used in social media is often characterized by the abundance of informal and nonstandard writing. The normalization of this nonstandard language can be crucial to facilitate the subsequent textual processing and to consequently help boost the performance of natural language processing tools applied to social media text. In this paper we present a benchmark for lexical normalization of social media posts, specifically for tweets in Spanish language. We describe the tweet normalization challenge we organized recently, analyze the performance achieved by the different systems submitted to the challenge, and delve into the characteristics of systems to identify the features that were useful. The organization of this challenge has led to the production of a benchmark for lexical normalization of social media, including an evaluation framework, as well as an annotated corpus of Spanish tweets—TweetNorm_es—, which we make publicly available. The creation of this benchmark and the evaluation has brought to light the types of words that submitted systems did best with, and posits the main shortcomings to be addressed in future work.
more …
By
Fleming, Noah; Kolokolova, Antonina; Nizamee, Renesa
We study the computational complexity of the Viterbi alignment and relaxed decoding problems for IBM model 3, focusing on the problem of finding a solution which has significant overlap with an optimal. That is, an approximate solution is considered good if it looks like some optimal solution with a few mistakes, where mistakes can be wrong values (such as a word aligned incorrectly or a wrong word in decoding), as well as insertions and deletions (spurious/missing words in decoding). In this setting, we show that it is computationally hard to find a solution which is correct on more than half (plus an inverse polynomial fraction) of the words. More precisely, if there is a polynomialtime algorithm computing an alignment for IBM model 3 which agrees with some Viterbi alignment on
$$l/2+l^\epsilon $$
words, where l is the length of the English sentence, or producing a decoding with
$$l/2+l^\epsilon $$
correct words, then P
$$=$$
NP. We also present a similar structure inapproximability result for phrasebased alignment. As these strong lower bounds are for the general definitions of the Viterbi alignment and decoding problems, we also consider, from a parameterized complexity perspective, which properties of the input make these problems intractable. As a first step in this direction, we show that Viterbi alignment has a fixedparameter tractable algorithm with respect to limiting the range of words in the target sentence to which a source word can be aligned. We note that by comparison, limiting maximal fertility—even to three—does not affect NPhardness of the result.
more …
By
Ferreira, Fernando; Ferreira, Gilda
3 Citations
It is known that there is a sound and faithful translation of the full intuitionistic propositional calculus into the atomic polymorphic system F_{at}, a predicative calculus with only two connectives: the conditional and the secondorder universal quantifier. The faithfulness of the embedding was established quite recently via a modeltheoretic argument based in Kripke structures. In this paper we present a purely prooftheoretic proof of faithfulness. As an application, we give a purely prooftheoretic proof of the disjunction property of the intuitionistic propositional logic in which commuting conversions are not needed.
more …
By
Glenszczyk, Anna
We investigate properties of monadic purely negational fragment of Intuitionistic Control Logic (
$${\mathsf{ICL}}$$
). This logic arises from Intuitionistic Propositional Logic (
$${\mathsf{IPL}}$$
) by extending language of
$${\mathsf{IPL}}$$
by additional new constant for falsum. Having two different falsum constants enables to define two forms of negation. We analyse implicational relations between negational monadic formulae and present a poset of non equivalent formulae of this fragment of
$${\mathsf{ICL}}$$
.
more …
By
Zhang, Yan; Li, Kai
This paper presents two general results of decidability concerning logics based on an indeterministic metric tense logic, which can be applied to, among others, logics combining knowledge, time and agency. We provide a general Kripke semantics based on a variation of the notion of synchronized Ockhamist frames. Our proof of the decidability is by way of the finite frame property, applying subframe transformations and a variant of the filtration technique.
more …
By
Bianchi, Matteo; Montagna, Franco
In 1950, B.A. Trakhtenbrot showed that the set of firstorder tautologies associated to finite models is not recursively enumerable. In 1999, P. Hájek generalized this result to the firstorder versions of Łukasiewicz, Gödel and Product logics, w.r.t. their standard algebras. In this paper we extend the analysis to the firstorder versions of axiomatic extensions of MTL. Our main result is the following. Let
$${\mathbb{K}}$$
be a class of MTLchains. Then the set of all firstorder tautologies associated to the finite models over chains in
$${\mathbb{K}}$$
, fTAUT
$${_{\forall}^{\mathbb{K}}}$$
, is
$${\Pi_{1}^{0}}$$
hard. Let TAUT
$${_\mathbb{K}}$$
be the set of propositional tautologies of
$${\mathbb{K}}$$
. If TAUT
$${_{\mathbb{K}}}$$
is decidable, we have that fTAUT
$${_{\forall}^{\mathbb{K}}}$$
is in
$${\Pi_{1}^{0}}$$
. We have similar results also if we expand the language with the Δ operator.
more …
By
Kuyper, Rutger
Kolmogorov introduced an informal calculus of problems in an attempt to provide a classical semantics for intuitionistic logic. This was later formalised by Medvedev and Muchnik as what has come to be called the Medvedev and Muchnik lattices. However, they only formalised this for propositional logic, while Kolmogorov also discussed the universal quantifier. We extend the work of Medvedev to firstorder logic, using the notion of a firstorder hyperdoctrine from categorical logic, to a structure which we will call the hyperdoctrine of mass problems. We study the intermediate logic that the hyperdoctrine of mass problems gives us, and we study the theories of subintervals of the hyperdoctrine of mass problems in an attempt to obtain an analogue of Skvortsova’s result that there is a factor of the Medvedev lattice characterising intuitionistic propositional logic. Finally, we consider Heyting arithmetic in the hyperdoctrine of mass problems and prove an analogue of Tennenbaum’s theorem on computable models of arithmetic.
more …
By
Fujita, Kenetsu; Kashima, Ryo; Komori, Yuichi; Matsuda, Naosuke
Show all (4)
The third author gave a natural deduction style proof system called the
$${{\lambda}{\rho}}$$
calculus for implicational fragment of classical logic in (Komori, Tsukuba J Math 37:307–320, 2013). In (Matsuda, Intuitionistic fragment of the
$${{\lambda}{\mu}}$$
calculus, 2015, Postproceedings of the RIMS Workshop “Proof Theory, Computability Theory and Related Issues”, to appear), the fourth author gave a natural subsystem “intuitionistic
$${{\lambda}{\rho}}$$
calculus” of the
$${{\lambda}{\rho}}$$
calculus, and showed the system corresponds to intuitionistic logic. The proof is given with tree sequent calculus (Kripke models), but is complicated. In this paper, we introduce some reduction rules for the
$${{\lambda}{\rho}}$$
calculus, and give a simple and purely syntactical proof to the theorem by use of the reduction. In addition, we show that we can give a computation model with rich expressive power with our system.
more …
By
Chlebowski, Szymon; LeszczyńskaJasion, Dorota
5 Citations
An erotetic calculus for a given logic constitutes a sequentstyle prooftheoretical formalization of the logic grounded in Inferential Erotetic Logic (
$${\mathsf{IEL}}$$
). In this paper, a new erotetic calculus for Classical Propositional Logic (
$${\mathsf{CPL}}$$
), dual with respect to the existing ones, is given. We modify the calculus to obtain complete proof systems for the propositional part of paraconsistent logic
$${\mathsf{CLuN}}$$
and its extensions
$${\mathsf{CLuNs}}$$
and
$${\mathsf{mbC}}$$
. The method is based on dual resolution. Moreover, the resolution rule is nonclausal. According to the authors knowledge, this is the first account of resolution for
$${\mathsf{mbC}}$$
. Last but not least, as the method is grounded in
$${\mathsf{IEL}}$$
, it constitutes an important tool for the socalled questionprocessing.
more …
By
Rivello, Edoardo
2 Citations
Revision sequences were introduced in 1982 by Herzberger and Gupta (independently) as a mathematical tool in formalising their respective theories of truth. Since then, revision has developed in a method of analysis of theoretical concepts with several applications in other areas of logic and philosophy. Revision sequences are usually formalised as ordinallength sequences of objects of some sort. A common idea of revision process is shared by all revision theories but specific proposals can differ in the socalled limit rule, namely the way they handle the limit stages of the process. The limit rules proposed by Herzberger and by Belnap show different mathematical properties, called periodicity and reflexivity, respectively. In this paper we isolate a notion of cofinally dependent limit rule, encompassing both Herzberger’s and Belnap’s ones, to study periodicity and reflexivity in a common framework and to contrast them both from a philosophical and from a mathematical point of view. We establish the equivalence of weak versions of these properties with the revisiontheoretic notion of recurring hypothesis and draw from this fact some observations about the problem of choosing the “right” limit rule when performing a revisiontheoretic analysis.
more …
By
Moorkens, Joss; O’Brien, Sharon; Silva, Igor A. L.; Lima Fonseca, Norma B.; Alves, Fabio
Show all (5)
9 Citations
Human rating of predicted postediting effort is a common activity and has been used to train confidence estimation models. However, the correlation between human ratings and actual postediting effort is undermeasured. Moreover, the impact of presenting effort indicators in a postediting user interface on actual postediting effort has hardly been researched. In this study, ratings of perceived postediting effort are tested for correlations with actual temporal, technical and cognitive postediting effort. In addition, the impact on postediting effort of the presentation of postediting effort indicators in the user interface is also tested. The language pair involved in this study is EnglishBrazilian Portuguese. Our findings, based on a small sample, suggest that there is little agreement between raters for predicted postediting effort and that the correlations between actual postediting effort and predicted effort are only moderate, and thus an inefficient basis for MT confidence estimation. Moreover, the presentation of postediting effort indicators in the user interface appears not to impact on actual postediting effort.
more …
By
Yamamoto, Seiichi; Taguchi, Keiko; Ijuin, Koki; Umata, Ichiro; Nishida, Masafumi
Show all (5)
4 Citations
To investigate the differences in communicative activities by the same interlocutors in Japanese (their L1) and in English (their L2), an 8h multimodal corpus of multiparty conversations was collected. Three subjects participated in each conversational group, and they had conversations on freeflowing and goaloriented topics in Japanese and in English. Their utterances, eye gazes, and gestures were recorded with microphones, eye trackers, and video cameras. The utterances and eye gazes were manually annotated. Their utterances were transcribed, and the transcriptions of each participant were aligned with those of the others along the time axis. Quantitative analyses were made to compare the communicative activities caused by the differences in conversational languages, the conversation types, and the levels of language expertise in L2. The results reveal different utterance characteristics and gaze patterns that reflect the differences in difficulty felt by the participants in each conversational condition. Both total and average durations of utterances were shorter in their L2 than in their L1 conversations. Differences in eye gazes were mainly found in those toward the information senders: Speakers were gazed at more in their secondlanguage than in their nativelanguage conversations. Our findings on the characteristics of conversations in the second language suggest possible directions for future research in psychology, cognitive science, and human–computer interaction technologies.
more …
By
Anglberger, Albert J. J.; Lukic, Jonathan
This paper deals with the axiomatizability problem for the matrixbased logics RMQ^{−} and RMQ^{*}. We present a Hilbertstyle axiom system for RMQ^{−}, and a quasiaxiomatization based on it for RMQ^{*}. We further compare these logics to different wellknown modal logics, and assess its status as relevance logics.
more …
By
Roberts, David Michael
The settheoretic axiom WISC states that for every set there is a set of surjections to it cofinal in all such surjections. By constructing an unbounded topos over the category of sets and using an extension of the internal logic of a topos due to Shulman, we show that WISC is independent of the rest of the axioms of the set theory given by a wellpointed topos. This also gives an example of a topos that is not a predicative topos as defined by van den Berg.
more …
By
Hansson, Sven Ove
5 Citations
A new equivalent presentation of AGM revision is introduced, in which a preferencebased choice function directly selects one among the potential outcomes of the operation. This model differs from the usual presentations of AGM revision in which the choice function instead delivers a collection of sets whose intersection is the outcome. The new presentation confirms the versatility of AGM revision, but it also lends credibility to the more general model of direct choice among outcomes (descriptor revision) of which AGM revision is shown here to be a special case.
more …
By
Aucher, Guillaume
2 Citations
In epistemic logic, some axioms dealing with the notion of knowledge are rather convoluted and difficult to interpret intuitively, even though some of them, such as the axioms .2 and .3, are considered to be key axioms by some epistemic logicians. We show that they can be characterized in terms of understandable interaction axioms relating knowledge and belief or knowledge and conditional belief. In order to show it, we first sketch a theory dealing with the characterization of axioms in terms of interaction axioms in modal logic. We then apply the main results and methods of this theory to obtain specific results related to epistemic and doxastic logics.
more …
By
Kikot, Stanislav
1 Citations
In this paper we consider the normal modal logics of elementary classes defined by firstorder formulas of the form
$${\forall x_0 \exists x_1 \dots \exists x_n \bigwedge x_i R_\lambda x_j}$$
. We prove that many properties of these logics, such as finite axiomatisability, elementarity, axiomatisability by a set of canonical formulas or by a single generalised Sahlqvist formula, together with modal definability of the initial formula, either simultaneously hold or simultaneously do not hold.
more …
By
Omori, Hitoshi; Sano, Katsuhiko
7 Citations
One of the problems we face in manyvalued logic is the difficulty of capturing the intuitive meaning of the connectives introduced through truth tables. At the same time, however, some logics have nice ways to capture the intended meaning of connectives easily, such as fourvalued logic studied by Belnap and Dunn. Inspired by Dunn’s discovery, we first describe a mechanical procedure, in expansions of BelnapDunn logic, to obtain truth conditions in terms of the behavior of the Truth and the False, which gives us intuitive readings of connectives, out of truth tables. Then, we revisit the notion of functional completeness, which is one of the key notions in manyvalued logic, in view of Dunn’s idea. More concretely, we introduce a generalized notion of functional completeness which naturally arises in the spirit of Dunn’s idea, and prove some fundamental results corresponding to the classical results proved by Post and Słupecki.
more …
By
Verdée, Peter; Bal, Inge De
In this paper we present a logic that determines when implications in a classical logic context express a relevant connection between antecedent and consequent. In contrast with logics in the relevance logic literature, we leave classical negation intact—in the sense that the law of noncontradiction can be used to obtain relevant implications, as long as there is a connection between antecedent and consequent. On the other hand, we give up the requirement that our theory of relevance should be able to define a new standard of deduction. We present and argue for a list of requirements such a logical theory of classical relevance needs to meet and go on to formulate a system that respects each of these requirements. The presented system is a Tarski (i.e. monotonic, reflexive and transitive) logic that extends the relevance logic R with a new relevant implication which allows for Disjunctive Syllogism and similar rules. This is achieved by interpreting the logical symbols in the antecedents in a stronger way than the logical symbols in consequents. A proof theory and an algebraic semantics are formulated and interesting metatheorems (soundness, completeness and the fact that it satisfies the requirements for classical relevance) are proven. Finally we give a philosophical motivation for our nonstandard relevant implication and the asymmetric interpretation of antecedents and consequents.
more …
By
Schippers, Michael
6 Citations
One of the integral parts of Bayesian coherentism is the view that the relation of ‘being no less coherent than’ is fully determined by the probabilistic features of the sets of propositions to be ordered. In the last one and a half decades, a variety of probabilistic measures of coherence have been put forward. However, there is large disagreement as to which of these measures best captures the pretheoretic notion of coherence. This paper contributes to the debate on coherence measures by considering three classes of adequacy constraints. Various independence and dependence relations between the members of each class will be taken into account in order to reveal the ‘grammar’ of probabilistic coherence measures. Afterwards, existing proposals are examined with respect to this list of desiderata. Given that for purely mathematical reasons there can be no measure that satisfies all constraints, the grammar allows the coherentist to articulate an informed pluralist stance as regards probabilistic measures of coherence.
more …
By
Gibert, Guillaume; Olsen, Kirk N.; Leung, Yvonne; Stevens, Catherine J.
Show all (4)
1 Citations
Background
Virtual humans have become part of our everyday life (movies, internet, and computer games). Even though they are becoming more and more realistic, their speech capabilities are, most of the time, limited and not coherent and/or not synchronous with the corresponding acoustic signal.
Methods
We describe a method to convert a virtual human avatar (animated through key frames and interpolation) into a more naturalistic talking head. In fact, speech articulation cannot be accurately replicated using interpolation between key frames and talking heads with good speech capabilities are derived from real speech production data. Motion capture data are commonly used to provide accurate facial motion for visible speech articulators (jaw and lips) synchronous with acoustics. To access tongue trajectories (partially occluded speech articulator), electromagnetic articulography (EMA) is often used. We recorded a large database of phoneticallybalanced English sentences with synchronous EMA, motion capture data, and acoustics. An articulatory model was computed on this database to recover missing data and to provide ‘normalized’ animation (i.e., articulatory) parameters. In addition, semiautomatic segmentation was performed on the acoustic stream. A dictionary of multimodal Australian English diphones was created. It is composed of the variation of the articulatory parameters between all the successive stable allophones.
Results
The avatar’s facial key frames were converted into articulatory parameters steering its speech articulators (jaw, lips and tongue). The speech production database was used to drive the Embodied Conversational Agent (ECA) and to enhance its speech capabilities. A TextToAuditory Visual Speech synthesizer was created based on the MaryTTS software and on the diphone dictionary derived from the speech production database.
Conclusions
We describe a method to transform an ECA with generic tongue model and animation by key frames into a talking head that displays naturalistic tongue, jaw and lip motions. Thanks to a multimodal speech production database, a TextToAuditory Visual Speech synthesizer drives the ECA’s facial movements enhancing its speech capabilities.
more …
By
ElHaj, Mahmoud; Kruschwitz, Udo; Fox, Chris
7 Citations
Language resources are important for those working on computational methods to analyse and study languages. These resources are needed to help advancing the research in fields such as natural language processing, machine learning, information retrieval and text analysis in general. We describe the creation of useful resources for languages that currently lack them, taking resources for Arabic summarisation as a case study. We illustrate three different paradigms for creating language resources, namely: (1) using crowdsourcing to produce a small resource rapidly and relatively cheaply; (2) translating an existing goldstandard dataset, which is relatively easy but potentially of lower quality; and (3) using manual effort with appropriately skilled human participants to create a resource that is more expensive but of high quality. The last of these was used as a test collection for TAC2011. An evaluation of the resources is also presented.
more …
By
Kauter, Marjan; Desmet, Bart; Hoste, Véronique
4 Citations
We present a finegrained scheme for the annotation of polar sentiment in text, that accounts for explicit sentiment (socalled private states), as well as implicit expressions of sentiment (polar facts). Polar expressions are annotated below sentence level and classified according to their subjectivity status. Additionally, they are linked to one or more targets with a specific polar orientation and intensity. Other components of the annotation scheme include source attribution and the identification and classification of expressions that modify polarity. In previous research, little attention has been given to implicit sentiment, which represents a substantial amount of the polar expressions encountered in our data. An English and Dutch corpus of financial newswire text, consisting of over 45,000 words each, was annotated using our scheme. A subset of this corpus was used to conduct an interannotator agreement study, which demonstrated that the proposed scheme can be used to reliably annotate explicit and implicit sentiment in realworld textual data, making the created corpora a useful resource for sentiment analysis.
more …
By
Lee, John; Yeung, Chak Yan; Zeldes, Amir; Reznicek, Marc; Lüdeling, Anke; Webster, Jonathan
Show all (6)
1 Citations
Learner corpora consist of texts produced by nonnative speakers. In addition to these texts, some learner corpora also contain error annotations, which can reveal common errors made by language learners, and provide training material for automatic error correction. We present a novel type of errorannotated learner corpus containing sequences of revised essay drafts written by nonnative speakers of English. Sentences in these drafts are annotated with comments by language tutors, and are aligned to sentences in subsequent drafts. We describe the compilation process of our corpus, present its encoding in TEI XML, and report agreement levels on the error annotations. Further, we demonstrate the potential of the corpus to facilitate research on textual revision in L2 writing, by conducting a case study on verb tenses using ANNIS, a corpus search and visualization platform.
more …
By
Doukhan, David; Rosset, Sophie; Rilliard, Albert; d’Alessandro, Christophe; AddaDecker, Martine
Show all (5)
A corpus of French tales is presented. Its two parts, a text corpus and a speech corpus, were designed for studying the relationships between the textual structures of tales and speech prosody, with the targeted application of an expressive texttospeech synthesis system embedded in a humanoid robot. The 89tale text corpus, and the 12tale speech corpus were annotated using a common tale description framework. Lexical level annotations include extended definitions of enumerations, time, place and person named entities, as well as part of speech tags. Supralexical level annotations include the segmentation of tales into a sequence of episodes, the localization and attribution of direct quotations, together with tale protagonists coreferences. Annotation distributions and interannotator agreement were analyzed. The largest coverage and strongest agreement were observed for person named entities, characters’ direct quotations, and their associated coreference chains. Speech corpus annotations were extended to allow the analysis of the relations between tale linguistic information and prosodic properties observed in associated speech. Word and phoneme boundaries were inferred through semiautomatic procedures, resulting in linguistic annotations aligned with the speech signal. Intonation stylization models were used to ease the visual and statistical analysis of tale’s prosody. Additional metainformation is provided with the speech corpus, allowing describing tale characters according to their gender, age, size, valence and kind. The corpora described in this article are publicly available through the European Language Resources Association catalog.
more …
By
Nguyen, PhuongThai; Le, AnhCuong; Ho, TuBao; Nguyen, VanHiep
Show all (4)
2 Citations
Treebanks, especially the Penn treebank for natural language processing (NLP) in English, play an essential role in both research into and the application of NLP. However, many languages still lack treebanks and building a treebank can be very complicated and difficult. This work has a twofold objective. Firstly, to share our results in constructing a large Vietnamese treebank (VTB) with three levels of annotation including word segmentation, partofspeech tagging, and syntactic analysis. Major steps in the treebank construction process are described with particular regard to specific Vietnamese properties such as lack of word delimiter and isolation. Those properties make sentences highly syntactically ambiguous, and therefore it is difficult to ensure a high level of agreement among annotators. Various studies of Vietnamese syntax were employed not only to define annotations but also to systematically deal with ambiguities. Annotators were supported by automatic labelling tools, which are based on statistical machine learning methods, for sentence preprocessing and a tree editor for supporting manual annotation. As a result, an annotation agreement of around 90 % was achieved. Our second objective is to present our method for automatically finding errors and inconsistencies in treebank corpora and its application to the construction of the VTB. This method employs the Shannon entropy measure in a manner that the more reduced entropy the more corrected errors in a treebank. The method ranks error candidates by using a scoring function based on conditional entropy. Our experiments showed that this method detected higherrordensity subsets of original error candidate sets, and that the corpus entropy was significantly reduced after error correction. The size of these subsets was only about one third of the whole set, while these subsets contained 80–90 % of the total errors. This method can also be applied to languages similar to Vietnamese.
more …
By
Fišer, Darja; Sagot, Benoît
2 Citations
In this paper we present a languageindependent, fully modular and automatic approach to bootstrap a wordnet for a new language by recycling different types of already existing language resources, such as machinereadable dictionaries, parallel corpora, and Wikipedia. The approach, which we apply here to Slovene, takes into account monosemous and polysemous words, general and specialised vocabulary as well as simple and multiword lexemes. The extracted words are then assigned one or several synset ids, based on a classifier that relies on several features including distributional similarity. Finally, we identify and remove highly dubious (literal, synset) pairs, based on simple distributional information extracted from a large corpus in an unsupervised way. Automatic, manual and taskbased evaluations show that the resulting resource, the latest version of the Slovene wordnet, is already a valuable source of lexicosemantic information.
more …
By
Erjavec, Tomaž
2 Citations
The paper describes the combined results of several projects which constitute a basic language resource infrastructure for printed historical Slovene. The IMP language resources consist of a digital library, an annotated corpus and a lexicon, which are interlinked and uniformly encoded following the Text Encoding Initiative Guidelines. The library holds about 650 units (mostly complete books) consisting of facsimiles with 45,000 pages as well as handcorrected and structured transcriptions. The handannotated corpus has 300,000 tokens, where each word is tagged with its modernised word form, lemma, partofspeech and, in cases of archaic words, its nearest contemporary equivalents. This information was extracted into the lexicon, which also covers an extended targetannotated corpus, resulting in 20,000 lemmas (of these 4,000 archaic) with 50,000 modern word forms and 70,000 attested forms. We have also developed a program to modernise, tag and lemmatise historical Slovene, and annotated the digital library with it, producing an automatically annotated corpus of 15 million words. To serve the humanities, the digital library and lexicon are available for reading and browsing on the web and the corpora via a concordancer. For language technology research and development the resources are available in source TEI XML under the Creative Commons Attribution licence. The paper presents the IMP resources, available from
http://nl.ijs.si/imp/
, the process of their compilation, encoding and dissemination, and concludes with directions for future research.
more …
By
Lyu, DauCheng; Tan, TienPing; Chng, EngSiong; Li, Haizhou
Show all (4)
5 Citations
This paper introduces the South East Asia Mandarin–English corpus, a 63h spontaneous Mandarin–English codeswitching transcribed speech corpus suitable for LVCSR and language change detection/identification research. The corpus is recorded under unscripted interview and conversational settings from 157 Singaporean and Malaysian speakers who spoke a mixture of Mandarin and English within a single sentence. About 82 % of the transcribed utterances are intrasentential codeswitching speech and the corpus will be release by LDC in 2015. This paper presents an analysis of the codeswitching statistics of the corpus, such as the duration of monolingual segments and the frequency of language turns in codeswitch utterances. We also summarize the development effort, details such as the processing time for transcription, validation and language boundary labelling. Lastly, we present textual analyses of codeswitch segments examining the word length of monolingual segments in codeswitch utterances and the most common single word and twoword phrase of such segments.
more …
By
Vázquez, Glòria; FernándezMontraveta, Ana
In this paper we present the annotation scheme of constructions at the argumentstructure level in the Spanish and Catalan Corpora SenSem. Constructions are accounted for as formmeaning pairs following the theoretical underpinning of Construction Grammar. Regarding meaning, we propose a hierarchy of constructions taking into account, at the highest level, the prominence of the logical subject in the sentence. Thus, we differentiate between topicalized and detopicalized sentences, which is an innovative proposal to solve some terminological issues related to pronominal constructions in Spanish. We further develop this classification taking into account the semantic relation of the logical subject with the verb and its coindexation, if any, with other participants. As regards form, the basic features we consider are syntagmatic categories and syntactic functions. Furthermore, we annotate the form the verb requires, that is, if it requires a pronoun in order to convey a particular meaning. Other relevant contributions are the annotation of some linguistic phenomena not taken into account in other similar resources, such as reciprocal, dative or impersonal constructions. Finally, we present the frequencies of all these constructions in Spanish.
more …
By
AlThubaity, Abdulmohsen O.
11 Citations
Compared with English, Arabic is a poorlyresourced language within the field of corpus linguistics. A lack of sufficient data and research has negatively affected Arabic corpusbased researchers and natural language processing practitioners. Although a number of Arabic corpora have been developed in recent years, the overall situation has improved little. The aim of this paper is twofold. First, it reviews 14 Arabic corpora categorized by their designated purpose, target language, mode of text, size, text date, location, text type/medium, text domain, representativeness, and balance. The review also describes the availability of the reviewed corpora, the presence of tokenization, lemmatization and tagging, and whether there are any tools available to search and explore them. Second, it introduces the King Abdulaziz City for Science and Technology (KACST) Arabic corpus, which was designed and created to overcome the limitations of existing Arabic corpora. The KACST Arabic corpus is a large and diverse Arabic corpus with clearly defined design criteria. It is carefully sampled, and its contents are classified based on time, region, medium, domain, and topic, and it can be searched and explored using these classifications. The KACST Arabic corpus comprises more than 700 million words from the preIslamic era to the present day (a period covering more than 1,500 years), collected from 10 diverse mediums. Each text has been further classified more specifically into domains and topics. The KACST Arabic corpus is freely available to explore on the Internet (
http://www.kacstac.org.sa
) using a variety of tools.
more …
By
Weiss, Benjamin; Wechsung, Ina; Kühnel, Christine; Möller, Sebastian
Show all (4)
3 Citations
Based on crossdisciplinary approaches to Embodied Conversational Agents, evaluation methods for such humancomputer interfaces are structured and presented. An introductory systematisation of evaluation topics from a conversational perspective is followed by an explanation of socialpsychological phenomena studied in interaction with Embodied Conversational Agents, and how these can be used for evaluation purposes. Major evaluation concepts and appropriate assessment instruments – established and new ones – are presented, including questionnaires, annotations and logfiles. An exemplary evaluation and guidelines provide handson information on planning and preparing such endeavours.
more …
By
Wang, XuePing; Wang, LeiBo
2 Citations
In this note, it is shown that the set of kernel ideals of a K_{n, 0}algebra L is a complete Heyting algebra, and the largest congruence on L such that the given kernel ideal as its congruence class is derived and finally, the necessary and sufficient conditions that such a congruence is proboolean are given.
more …
By
Takemura, Ryo
3 Citations
One of the traditional applications of Euler diagrams is as a representation or counterpart of the usual settheoretical models of given sentences. However, Euler diagrams have recently been investigated as the counterparts of logical formulas, which constitute formal proofs. Euler diagrams are rigorously defined as syntactic objects, and their inference systems, which are equivalent to some symbolic logical systems, are formalized. Based on this observation, we investigate both countermodel construction and proofconstruction in the framework of Euler diagrams. We introduce the notion of “counterdiagrammatic proof”, which shows the invalidity of a given inference, and which is defined as a syntactic manipulation of diagrams of the same sort as inference rules to construct proofs. Thus, in our Euler diagrammatic framework, the completeness theorem can be formalized in terms of the existence of a diagrammatic proof or a counterdiagrammatic proof.
more …
By
Carrara, Massimiliano; Martino, Enrico
1 Citations
In Mathematics is megethology (Lewis, Philos Math 1:3–23, 1993) Lewis reconstructs set theory combining mereology with plural quantification. He introduces megethology, a powerful framework in which one can formulate strong assumptions about the size of the universe of individuals. Within this framework, Lewis develops a structuralist class theory, in which the role of classes is played by individuals. Thus, if mereology and plural quantification are ontologically innocent, as Lewis maintains, he achieves an ontological reduction of classes to individuals. Lewis’work is very attractive. However, the alleged innocence of mereology and plural quantification is highly controversial and has been criticized by several authors. In the present paper we propose a new approach to megethology based on the theory of plural reference developed in To be is to be the object of a possible act of choice (Carrara, Stud Log 96: 289–313, 2010). Our approach shows how megethology can be grounded on plural reference without the help of mereology.
more …
By
Celani, Sergio A.
In this paper we shall discuss properties of saturation in monotonic neighbourhood models and study some applications, like a characterization of compact and modally saturated monotonic models and a characterization of the maximal HennessyMilner classes. We shall also show that our notion of modal saturation for monotonic models naturally extends the notion of modal saturation for Kripke models.
more …
By
Cocchiarella, Nino B.
1 Citations
There are different views of the logic of plurals that are now in circulation, two of which we will compare in this paper. One of these is based on a twoplace relation of being among, as in ‘Peter is among the juveniles arrested’. This approach seems to be the one that is discussed the most in philosophical journals today. The other is based on Bertrand Russell’s early notion of a class as many, by which is meant not a class as one, i.e., as a single entity, but merely a plurality of things. It was this notion that Russell used to explain plurals in his 1903 Principles of Mathematics; and it was this notion that I was able to develop as a consistent system that contains not only a logic of plurals but also a logic of mass nouns as well. We compare these two logics here and then show that the logic of the Among relation is reducible to the logic of classes as many.
more …
By
Bergfeld, Jort M.; Kishida, Kohei; Sack, Joshua; Zhong, Shengyang
Show all (4)
3 Citations
In this paper we show a duality between two approaches to represent quantum structures abstractly and to model the logic and dynamics therein. One approach puts forward a “quantum dynamic frame” (Baltag et al. in Int J Theor Phys, 44(12):2267–2282, 2005), a labelled transition system whose transition relations are intended to represent projections and unitaries on a (generalized) Hilbert space. The other approach considers a “Piron lattice” (Piron in Foundations of Quantum Physics, 1976), which characterizes the algebra of closed linear subspaces of a (generalized) Hilbert space. We define categories of these two sorts of structures and show a duality between them. This result establishes, on one direction of the duality, that quantum dynamic frames represent quantum structures correctly; on the other direction, it gives rise to a representation of dynamics on a Piron lattice.
more …
By
Dubuc, Eduardo J.; Poveda, Y. A.
2 Citations
In “A new proof of the completeness of the Lukasiewicz axioms” (Trans Am Math Soc 88, 1959) Chang proved that any totally ordered MValgebra A was isomorphic to the segment
$${A \cong \Gamma(A^*, u)}$$
of a totally ordered lgroup with strong unit A^{*}. This was done by the simple intuitive idea of putting denumerable copies of A on top of each other (indexed by the integers). Moreover, he also show that any such group G can be recovered from its segment since
$${G \cong \Gamma(G, u)^*}$$
, establishing an equivalence of categories. In “Interpretation of AFC^{*}algebras in Lukasiewicz sentential calculus” (J Funct Anal 65, 1986) Mundici extended this result to arbitrary MValgebras and lgroups with strong unit. He takes the representation of A as a subdirect product of chains A_{i}, and observes that
$${A \hookrightarrow \prod_i G_i}$$
where
$${G_i = A_i^*}$$
. Then he let A^{*} be the lsubgroup generated by A inside
$${\prod_i G_i}$$
. He proves that this idea works, and establish an equivalence of categories in a rather elaborate way by means of his concept of good sequences and its complicated arithmetics. In this note, essentially selfcontained except for Chang’s result, we give a simple proof of this equivalence taking advantage directly of the arithmetics of the the product lgroup
$${\prod_i G_i}$$
, avoiding entirely the notion of good sequence.
more …
By
Cintula, Petr; Noguera, Carles
3 Citations
Transfer theorems are central results in abstract algebraic logic that allow to generalize properties of the lattice of theories of a logic to any algebraic model and its lattice of filters. Their proofs sometimes require the existence of a natural extension of the logic to a bigger set of variables. Constructions of such extensions have been proposed in particular settings in the literature. In this paper we show that these constructions need not always work and propose a wider setting (including all finitary logics and those with countable language) in which they can still be used.
more …
By
De, Michael; Omori, Hitoshi
10 Citations
We investigate the notion of classical negation from a nonclassical perspective. In particular, one aim is to determine what classical negation amounts to in a paracomplete and paraconsistent fourvalued setting. We first give a general semantic characterization of classical negation and then consider an axiomatic expansion BD+ of fourvalued Belnap–Dunn logic by classical negation. We show the expansion complete and maximal. Finally, we compare BD+ to some related systems found in the literature, specifically a fourvalued modal logic of Béziau and the logic of classical implication and a paraconsistent de Morgan negation of Zaitsev.
more …
By
Cornejo, Juan Manuel
2 Citations
In this paper we introduce a logic that we name semi Heyting–Brouwer logic,
$${\mathcal{SHB}}$$
, in such a way that the variety of double semiHeyting algebras is its algebraic counterpart. We prove that, up to equivalences by translations, the Heyting–Brouwer logic
$${\mathcal{HB}}$$
is an axiomatic extension of
$${\mathcal{SHB}}$$
and that the propositional calculi of intuitionistic logic
$${\mathcal{I}}$$
and semiintuitionistic logic
$${\mathcal{SI}}$$
turn out to be fragments of
$${\mathcal{SHB}}$$
.
more …
By
LindhKnuutila, Tiina; Honkela, Timo
1 Citations
Background
In this article, automatically generated and manually crafted semantic representations are compared. The comparison takes place under the assumption that neither of these has a primary status over the other. While linguistic resources can be used to evaluate the results of automated processes, datadriven methods are useful in assessing the quality or improving the coverage of handcreated semantic resources.
Methods
We apply two unsupervised learning methods, Independent Component Analysis (ICA), and probabilistic topic model at word level using Latent Dirichlet Allocation (LDA) to create semantic representations from a large text corpus. We further compare the obtained results to two semantically labeled dictionaries. In addition, we use the SelfOrganizing Map to visualize the obtained representations.
Results
We show that both methods find a considerable amount of category information in an unsupervised way. Rather than only finding groups of similar words, they can automatically find a number of features that characterize words. The unsupervised methods are also used in exploration. They provide findings which go beyond the manually predefined label sets. In addition, we demonstrate how the SelfOrganizing Map visualization can be used in exploration and further analysis.
Conclusion
This article compares unsupervised learning methods and semantically labeled dictionaries. We show that these methods are able to find categorical information. In addition, they can further be used in an exploratory analysis. In general, information theoretically motivated and probabilistic methods provide results that are at a comparable level. Moveover, the automatic methods and human classifications give an access to semantic categorization that complement each other. Datadriven methods can furthermore be cost effective and adapt to a particular domain through appropriate choice of data sets.
more …
By
Powers, David M W
2 Citations
Understanding how people tick is an endeavour that has challenged us for millennia, both in informal settings and in increasingly formalized and scientific disciplines. Some are interested in the biology and others the behaviour. Some are focussed on language and others on culture or emotion. Some approach this as pure science worthy of understanding in its own right. Some approach it as applied science with value born of practical applications that improve our lifestyle and characterize our modern technological society.
Cognitive Science was born as Psychologists and Linguists found that the assumptions and predictions of their theories and models were spilling outside their discipline, as Neuroscience provided a neural substrate for models of perception and cognition that obviated the need for the postulation of hypothetical daemons, as Computer Science and Mathematics provided computational tools for testing models and theories that couldn't be tested empirically in the real world. Artificial Intelligence was also born in such an environment, with the early researchers exploring models of intelligence as much as developing intelligent programs. But Computational Intelligence and Cognitive Science increasingly reflect a contrast between an applied aim of building intelligent applications and entities, and a pure aim of understanding existing intelligent entities and functions, and we seem to lack a bridge between them.
Computational Cognitive Science aims to provide this bridge.
more …
By
Peterson, James K
2 Citations
We present an introduction to the modeling of networks of nodes which parse the information presented to them into an output. One example is the nodes are excitable neurons which are collected into a nervous system for an animal whether invertebrate or vertebrate. We will focus on the development of the ideas and tools that might help us understand how to build a model of such a system being careful to explain the many approximations or model errors we make along the way. We start with a discussion of low level biophysical concepts such as the cable equation and the Hodgkin  Huxley model and end with graph based models of computation. We also include motivational arguments that show hardware and software issues in neural models are interwined.
more …
By
Peterson, James K
3 Citations
Background
We are interested in an asynchronous graph based model,
$\boldsymbol {\mathcal {G}(N,E)}$
of cognition or cognitive dysfunction, where the nodes N provide computation at the neuron level and the edges E_{i→j} between nodes N_{i} and node N_{j} specify internode calculation.
Methods
We discuss how to improve update and evaluation needs for fast calculation using approximations of neural processing for first and second messenger systems as well as the axonal pulse of a neuron.
Results
These approximations give rise to a low memory footprint profile for implementation on multicore platforms using functional programming languages such as Erlang, Clojure and Haskell when we have no shared memory and all states are immutable.
Conclusions
The implementation of cognitive models using these tools on such platforms will allow the possibility of fully realizable lesion and longitudinal studies.
more …
By
Komosinski, Maciej; Kups, Adam
3 Citations
Background
This work introduces a computational model of human temporal discrimination mechanism – the ClockCounter Timing Network. It is an artificial neural network implementation of a timing mechanism based on the informational architecture of the popular Scalar Timing Model.
Methods
The model has been simulated in a virtual environment enabling computational experiments which imitate a temporal discrimination task – the twoalternative forced choice task. The influence of key parameters of the model (including the internal pacemaker speed and the variability of memory translation) on the network accuracy and the timeorder error phenomenon has been evaluated.
Results
The results of simulations reveal how activities of different modules contribute to the overall performance of the model. While the number of significant effects is quite large, the article focuses on the relevant observations concerning the influence of the pacemaker speed and the scalar source of variance on the measured indicators of network performance.
Conclusions
The results of performed experiments demonstrate consequences of the fundamental assumptions of the clockcounter model for the results in a temporal discrimination task. The results can be compared and verified in empirical experiments with human participants, especially when the modes of activity of the internal timing mechanism are changed because of some external conditions, or are impaired due to some kind of a neural degradation process.
more …
By
Iruskieta, Mikel; Cunha, Iria; Taboada, Maite
5 Citations
Explaining why the same passage may have different rhetorical structures when conveyed in different languages remains an open question. Starting from a trilingual translation corpus, this paper aims to provide a new qualitative method for the comparison of rhetorical structures in different languages and to specify why translated texts may differ in their rhetorical structures. To achieve these aims we have carried out a contrastive analysis, comparing a corpus of parallel English, Spanish and Basque texts, using Rhetorical Structure Theory. We propose a method to describe the main linguistic differences among the rhetorical structures of the three languages in the two annotation stages (segmentation and rhetorical analysis). We show a new type of comparison that has important advantages with regard to the quantitative method usually employed: it provides an accurate measurement of interannotator agreement, and it pinpoints sources of disagreement among annotators. With the use of this new method, we show how translation strategies affect discourse structure.
more …
By
Liebeskind, Chaya; Kotlerman, Lili; Dagan, Ido
2 Citations
In this work we suggest a novel Text Categorization (TC) scenario, motivated by an adhoc industrial need to assign documents to a set of predefined categories, while labeled training data for the categories is not available. The scenario is applicable in many industrial settings and is interesting from the academic perspective. We present a new dataset geared for the main characteristics of the scenario, and utilize it to investigate the namebased TC approach, which uses the category names as its only input and does not require training data. We evaluate and analyze the performance of stateoftheart methods for this dataset to identify the shortcomings of these methods for our scenario, and suggest ways for overcoming these shortcomings. We utilize statistical correlation measured over a target corpus for improving the stateoftheart, and offer a different classification scheme based on the characteristics of the setting. We evaluate our improvements and adaptations and show superior performance of our suggested method.
more …
By
Zhou, Yuping; Xue, Nianwen
16 Citations
The paper presents the Chinese Discourse TreeBank, a corpus annotated with Penn Discourse TreeBank style discourse relations that take the form of a predicate taking two arguments. We first characterize the syntactic and statistical distributions of Chinese discourse connectives as well as the role of Chinese punctuation marks in discourse annotation, and then describe how we design our annotation strategy procedure based on this characterization. The Chinesespecific features of our annotation strategy include annotating explicit and implicit discourse relations in one single pass, defining the argument labels on semantic, rather than syntactic, grounds, as well as annotating the semantic type of implicit discourse relations directly. We also introduce a flat, 11valued semantic type classification scheme for discourse relations. We finally demonstrate the feasibility of our approach with evaluation results.
more …
By
Christodouloupoulos, Christos; Steedman, Mark
3 Citations
We describe the creation of a massively parallel corpus based on 100 translations of the Bible. We discuss some of the difficulties in acquiring and processing the raw material as well as the potential of the Bible as a corpus for natural language processing. Finally we present a statistical analysis of the corpora collected and a detailed comparison between the English translation and other English corpora.
more …
By
Shah, Kashif; Cohn, Trevor; Specia, Lucia
4 Citations
We perform a systematic analysis of the effectiveness of features for the problem of predicting the quality of machine translation (MT) at the sentence level. Starting from a comprehensive feature set, we apply a technique based on Gaussian processes, a Bayesian nonlinear learning method, to automatically identify features leading to accurate model performance. We consider application to several datasets across different language pairs and text domains, with translations produced by various MT systems and scored for quality according to different evaluation criteria. We show that selecting features with this technique leads to significantly better performance in most datasets, as compared to using the complete feature sets or a stateoftheart feature selection approach. In addition, we identify a small set of features which seem to perform well across most datasets.
more …
By
Koo, Hahn
1 Citations
This paper presents an unsupervised method for developing a characterbased ngram classifier that identifies loanwords or transliterated foreign words in Korean text. The classifier is trained on an unlabeled corpus using the Expectation Maximization algorithm, building on seed words extracted from the corpus. Words with high token frequency serve as native seed words. Words with seeming traces of vowel insertion to repair consonant clusters serve as foreign seed words. What counts as a trace of insertion is determined using phoneme cooccurrence statistics in conjunction with ideas and findings in phonology. Experiments show that the method can produce an unsupervised classifier that performs at a level comparable to that of a supervised classifier. In a crossvalidation experiment using a corpus of about 9.2 million words and a lexicon of about 71,000 words, mean Fscores of the best unsupervised classifier and the corresponding supervised classifier were 94.77 and 96.67 %, respectively. Experiments also suggest that the method can be readily applied to other languages with similar phonotactics such as Japanese.
more …
By
Marimon, Montserrat; Bel, Núria
This paper presents the IULA Spanish LSP Treebank, an opensource treebank of over 40,000 sentences, developed in the framework of the European project METANET4U. The IULA Spanish LSP Treebank is the first technical corpus of Spanish annotated at surface syntactic level, following the dependency grammar theory. We present the method we used to create the resource and the linguistic annotations that the treebank provides, using examples and comparing with similar resources. We also provide the statistics of the treebank and the evaluation results.
more …
By
Goodman, Michael Wayne; Crowgey, Joshua; Xia, Fei; Bender, Emily M.
Show all (4)
1 Citations
This paper presents Xigt, an extensible storage format for interlinear glossed text (IGT). We review design desiderata for such a format based on our own use cases as well as general best practices, and then explore existing representations of IGT through the lens of those desiderata. We give an overview of the data model and XML serialization of Xigt, and then describe its application to the use case of representing a large, noisy, heterogeneous set of IGT.
more …
By
Banerjee, Pratyush; Rubino, Raphael; Roturier, Johann; Genabith, Josef
Show all (4)
1 Citations
The problem of domain adaptation in statistical machine translation systems emanates from the fundamental assumption that test and training data are drawn independently from the same distribution (topic, domain, genre, style etc.). In reallife translation tasks, the sparseness of indomain parallel training data often leads to poor model estimation, and consequentially poor translation quality. Domain adaptation by supplementary data selection aims at addressing this specific issue by selecting relevant parallel training data from outofdomain or generaldomain bitext to enhance the quality of a poor baseline system. Stateoftheart research in data selection focuses on the development of novel similarity measures to improve the relevance of selected data. However, in this paper we approach the problem from a different perspective. In contrast to the conventional approach of using the entire available targetdomain data as a reference for supplementary data selection, we restrict the reference set to only those sentences that are expected to be poorly translated by the baseline MT system using a Quality Estimation model. Our rationale is to focus help (i.e. supplementary training material) to where it is needed most. Automatic quality estimation techniques are used to identify such poorly translated sentences in the target domain. The experiments reported in this paper show that (i) this technique provides statistically significant improvements over the unadapted baseline translation and (ii) using significantly smaller amounts of supplementary data our approach achieves results comparable to stateoftheart approaches using conventional reference sets.
more …
By
Cai, Mingzhong
1 Citations
We investigate the “unprovability of unprovability”. Given a sentence P and a fixed base theory T, the unprovability of P is the sentence “
$${T\nvdash P}$$
”. We show that the unprovability of an unprovable true sentence can be “hard to prove”.
more …
By
Costa, Ângela; Ling, Wang; Luís, Tiago; Correia, Rui; Coheur, Luísa
Show all (5)
2 Citations
A detailed error analysis is a fundamental step in every natural language processing task, as to be able to diagnose what went wrong will provide cues to decide which research directions are to be followed. In this paper we focus on error analysis in Machine Translation (MT). We significantly extend previous error taxonomies so that translation errors associated with Romance language specificities can be accommodated. Furthermore, based on the proposed taxonomy, we carry out an extensive analysis of the errors generated by four different systems: two mainstream online translation systems Google Translate (Statistical) and Systran (Hybrid Machine Translation), and two inhouse MT systems, in three scenarios representing different challenges in the translation from English to European Portuguese. Additionally, we comment on how distinct error types differently impact translation quality.
more …
By
Mikulás, Szabolcs
We look at lower semilatticeordered residuated semigroups and, in particular, the representable ones, i.e., those that are isomorphic to algebras of binary relations. We will evaluate expressions (terms, sequents, equations, quasiequations) in representable algebras and give finite axiomatizations for several notions of validity. These results will be applied in the context of substructural logics.
more …
By
Cīrulis, Jānis
1 Citations
Formally, a description of weak BCKalgebras can be obtained by replacing (in the standard axiom set by K. Iseki and S. Tanaka) the first BCK axiom
$${(x  y)  (x  z) \le z  y}$$
by its weakening
$${z \le y \Rightarrow x  y \le x  z}$$
. It is known that every weak BCKalgebra is completely determined by the structure of its initial segments (sections). We consider weak BCKalgebras with De Morgan complemented, orthocomplemented and orthomodular sections, as well as those where sections satisfy a certain compatibility condition, and characterize each of these classes of algebras by an equation or quasiequation. For instance, those weak BCKalgebras in which all initial segments are De Morgan complemented are just commutative weak BCKalgebras.
more …
By
Blyth, T. S.; Fang, Jie; Wang, Leibo
1 Citations
We identify the
$${{}^\star}$$
ideals of a distributive demipseudocomplemented algebra L as the kernels of the boolean congruences on L, and show that they form a complete Heyting algebra which is isomorphic to the interval
$${[G,\iota]}$$
of the congruence lattice of L where G is the Glivenko congruence. We also show that the notions of maximal
$${{}^\star}$$
ideal, prime
$${{}^\star}$$
ideal, and falsity ideal coincide.
more …
By
Lewitzka, Steffen
3 Citations
There are logics where necessity is defined by means of a given identity connective:
$${\square\varphi := \varphi\equiv\top}$$
(
$${\top}$$
is a tautology). On the other hand, in many standard modal logics the concept of propositional identity (PI)
$${\varphi\equiv\psi}$$
can be defined by strict equivalence (SE)
$${\square(\varphi\leftrightarrow\psi)}$$
. All these approaches to modality involve a principle that we call the Collapse Axiom (CA): “There is only one necessary proposition.” In this paper, we consider a notion of PI which relies on the identity axioms of Suszko’s nonFregean logic SCI. Then S3 proves to be the smallest Lewis modal system where PI can be defined as SE. We extend S3 to a nonFregean logic with propositional quantifiers such that necessity and PI are integrated as noninterdefinable concepts. CA is not valid and PI refines SE. Models are expansions of SCImodels. We show that SCImodels are Boolean prealgebras, and viceversa. This associates nonFregean logic with research on Hyperintensional Semantics. PI equals SE iff models are Boolean algebras and CA holds. A representation result establishes a connection to Fine’s approach to propositional quantifiers and shows that our theories are conservative extensions of S3–S5, respectively. If we exclude the Barcan formula and a related axiom, then the resulting systems are still complete w.r.t. a simpler denotational semantics.
more …
By
Wintein, Stefan; Muskens, Reinhard A.
In their recent paper Bifacial truth: a case for generalized truth values Zaitsev and Shramko [7] distinguish between an ontological and an epistemic interpretation of classical truth values. By taking the Cartesian product of the two disjoint sets of values thus obtained, they arrive at four generalized truth values and consider two “semiclassical negations” on them. The resulting semantics is used to define three novel logics which are closely related to Belnap’s wellknown four valued logic. A syntactic characterization of these logics is left for further work. In this paper, based on our previous work on a functionally complete extension of Belnap’s logic, we present a sound and complete tableau calculus for these logics. It crucially exploits the Cartesian nature of the four values, which is reflected in the fact that each proof consists of two tableaux. The bifacial notion of truth of Z&S is thus augmented with a bifacial notion of proof. We also provide translations between the logics for semiclassical negation and classical logic and show that an argument is valid in a logic for semiclassical negation just in case its translation is valid in classical logic.
more …
By
Pailos, Federico; Rosenblatt, Lucas
Theories where truth is a naive concept fall under the following dilemma: either the theory is subject to Curry’s Paradox, which engenders triviality, or the theory is not trivial but the resulting conditional is too weak. In this paper we explore a number of theories which arguably do not fall under this dilemma. In these theories the conditional is characterized in terms of (infinitelyvalued) nondeterministic matrices. These nondeterministic theories are similar to infinitelyvalued Łukasiewicz logic in that they are consistent and their conditionals are quite strong. The difference is the following: while Łukasiewicz logic is
$${\omega}$$
inconsistent, the nondeterministic theories might turn out to be
$${\omega}$$
consistent.
more …
By
Rivello, Edoardo
1 Citations
Revision sequences are a kind of transfinite sequences which were introduced by Herzberger and Gupta in 1982 (independently) as the main mathematical tool for developing their respective revision theories of truth. We generalise revision sequences to the notion of cofinally invariant sequences, showing that several known facts about Herzberger’s and Gupta’s theories also hold for this more abstract kind of sequences and providing new and more informative proofs of the old results.
more …
By
Krivtsov, Victor N.
Within a weak system
$${{{\sf WKVS}}}$$
of intuitionistic analysis one may prove, using the Weak Fan Theorem as an additional axiom, a completeness theorem for intuitionistic firstorder predicate logic relative to validity in generalized Beth models as well as a completeness theorem for classical firstorder predicate logic relative to validity in intuitionistic structures. Conversely, each of these theorems implies over
$${{{\sf WKVS}}}$$
the Weak Fan Theorem.
more …
By
Celani, Sergio A.; Montangie, Daniela
A Hilbert algebra with supremum is a Hilbert algebra where the associated order is a joinsemilattice. This class of algebras is a variety and was studied in Celani and Montangie (2012). In this paper we shall introduce and study the variety of
$${H_{\Diamond}^{\vee}}$$
algebras, which are Hilbert algebras with supremum endowed with a modal operator
$${\Diamond}$$
. We give a topological representation for these algebras using the topological spectrallike representation for Hilbert algebras with supremum given in Celani and Montangie (2012). We will consider some particular varieties of
$${H_{\Diamond}^{\vee}}$$
algebras. These varieties are the algebraic counterpart of extensions of the implicative fragment of the intuitionistic modal logic
$${\mathbf{IntK}_{\Diamond}}$$
. We also determine the congruences of
$${H_{\Diamond}^{\vee}}$$
algebras in terms of certain closed subsets of the associated space, and in terms of a particular class of deductive systems. These results enable us to characterize the simple and subdirectly irreducible
$${H_{\Diamond}^{\vee }}$$
algebras.
more …
By
Domokos, József; Buza, Ovidiu; Toderean, Gavril
2 Citations
This paper intends to present a machine readable Romanian language pronunciation dictionary called NaviRo. The dictionary contains 138,500 unique words from the DexOnline dictionary together with their phonetic transcriptions in speech assessment method phonetic alphabet. The development of the pronunciation dictionary and the performed validation tests are also described in the paper. NaviRo pronunciation dictionary is freely available on the project website (
http://users.utcluj.ro/~jdomokos/naviro
) in plain text, Hidden Markov Model Toolkit and Festival speech synthesis system dictionary format. There are also available for download the used grapheme and phoneme sets and the audio samples for the used phonemes. The use of these resources is completely unrestricted for any research purposes in order to speed up Romanian language speech technology research.
more …
By
Kleynhans, Neil Taylor; Barnard, Etienne
2 Citations
Automatic speech recognition (ASR) technology has matured over the past few decades and has made significant impacts in a variety of fields, from assistive technologies to commercial products. However, ASR system development is a resource intensive activity and requires language resources in the form of text annotated audio recordings and pronunciation dictionaries. Unfortunately, many languages found in the developing world fall into the resourcescarce category and due to this resource scarcity the deployment of ASR systems in the developing world is severely inhibited. One approach to assist with resourcescarce ASR system development, is to select “useful” training samples which could reduce the resources needed to collect new corpora. In this work, we propose a new data selection framework which can be used to design a speech recognition corpus. We show for limited data sets, independent of language and bandwidth, the most effective strategy for data selection is frequencymatched selection and that the widelyused maximum entropy methods generally produced the least promising results. In our model, the frequencymatched selection method corresponds to a logarithmic relationship between accuracy and corpus size; we also investigated other model relationships, and found that a hyperbolic relationship (as suggested from simple asymptotic arguments in learning theory) may lead to somewhat better performance under certain conditions.
more …
By
French, Rohan
2 Citations
Our concern here is with the extent to which the expressive equivalence of Wehmeier’s Subjunctive Modal Language (SML) and the Actuality Modal Language (AML) is sensitive to the choice of background modal logic. In particular we will show that, when we are enriching quantified modal logics weaker than S5, AML is strictly expressively stronger than SML, this result following from general considerations regarding the relationship between operators and predicate markers. This would seem to complicate arguments given in favour of SML which rely upon its being expressively equivalent to AML.
more …
By
Young, William
1 Citations
Much work has been done on specific instances of residuated lattices with modal operators (either nuclei or conuclei). In this paper, we develop a general framework that subsumes three important classes of modal residuated lattices: interior algebras, Abelian ℓgroups with conuclei, and negative cones of ℓgroups with nuclei. We then use this framework to obtain results about these three cases simultaneously. In particular, we show that a categorical equivalence exists in each of these cases. The approach used here emphasizes the role played by reducts in the proofs of these categorical equivalences. Lastly, we develop a connection between translations of logics and images of modal operators.
more …
By
Nowak, Marek
Two examples of Galois connections and their dual forms are considered. One of them is applied to formulate a criterion when a given subset of a complete lattice forms a complete lattice. The second, closely related to the first, is used to prove in a short way the KnasterTarski’s fixed point theorem.
more …
By
Cornejo, Juan M.; Viglizzo, Ignacio D.
4 Citations
Semiintuitionistic logic is the logic counterpart to semiHeyting algebras, which were defined by H. P. Sankappanavar as a generalization of Heyting algebras. We present a new, more streamlined set of axioms for semiintuitionistic logic, which we prove translationally equivalent to the original one. We then study some formulas that define a semiHeyting implication, and specialize this study to the case in which the formulas use only the lattice operators and the intuitionistic implication. We prove then that all the logics thus obtained are equivalent to intuitionistic logic, and give their Kripke semantics.
more …
By
Montagna, Franco; Ugolini, Sara
6 Citations
In this paper we provide a categorical equivalence for the category
$${\mathcal{P}}$$
of product algebras, with morphisms the homomorphisms. The equivalence is shown with respect to a category whose objects are triplets consisting of a Boolean algebra B, a cancellative hoop C and a map
$${\vee_e}$$
from B × C into C satisfying suitable properties. To every product algebra P, the equivalence associates the triplet consisting of the maximum boolean subalgebra B(P), the maximum cancellative subhoop C(P), of P, and the restriction of the join operation to B × C. Although several equivalences are known for special subcategories of
$${\mathcal{P}}$$
, up to our knowledge, this is the first equivalence theorem which involves the whole category of product algebras. The syntactic counterpart of this equivalence is a syntactic reduction of classical logic CL and of cancellative hoop logic CHL to product logic, and viceversa.
more …
By
Nakazawa, Koji; Naya, Hiroto
This paper gives the strong reduction of the combinatory calculus SCL, which was introduced as a combinatory calculus corresponding to the untyped Lambdamu calculus. It proves the confluence of the strong reduction. By the confluence, it also proves the conservativity of the extensional equality of SCL over the combinatory calculus CL, and the consistency of SCL.
more …
By
Trypuz, Robert; Kulicki, Piotr
The paper tackles two problems. The first one is to grasp the real meaning of Jerzy Kalinowski’s theory of normative sentences. His formal system K_{1} is a simple logic formulated in a very limited language (negation is the only operator defined on actions). While presenting it Kalinowski formulated a few interesting philosophical remarks on norms and actions. He did not, however, possess the tools to formalise them fully. We propose a formulation of Kalinowski’s ideas with the use of a settheoretical frame similar to the one presented by Krister Segerberg in his A Deontic Logic of Action. At the same time we enrich the language used by Kalinowski with more operators on actions (parallel execution and free choice) and present an adequate axiomatisation of the resulting system. That allows us to disclose some unrevealed aspects of Kalinowski’s theory. The most important one is a relation between acts which we call moral indiscernibility. Our second problem is a proper understanding of moral indiscernibility. We show how a repertoire of agent’s actions, defined with the use of simple observable elements of actions, can be filtrated by the relation of moral indiscernibility. That allows us to understand the consequences of Kalinowski’s claim that not doing something good is always bad.
more …
By
Boričić, Branislav; Ilić, Mirjana
A normalizable natural deduction formulation, with subformula property, of the implicative fragment of classical logic is presented. A traditional notion of normal deduction is adapted and the corresponding weak normalization theorem is proved. An embedding of the classical logic into the intuitionistic logic, restricted on propositional implicational language, is described as well. We believe that this multipleconclusion approach places the classical logic in the same plane with the intuitionistic logic, from the prooftheoretical viewpoint.
more …
By
Ananthakrishnan, Sankaranarayanan; Mehay, Dennis N.; Hewavitharana, Sanjika; Kumar, Rohit; Roy, Matt; Kan, Enoch
Show all (6)
2 Citations
Lexical ambiguity can cause critical failure in conversational spoken language translation (CSLT) systems that rely on statistical machine translation (SMT) if the wrong sense is presented in the target language. Interactive CSLT systems offer the capability to detect and preempt such wordsense translation errors (WSTEs) by engaging the human operators in a precise clarification dialogue aimed at resolving the problem. This paper presents an endtoend framework for accurate detection and interactive resolution of WSTEs to minimize communication errors due to ambiguous source words. We propose (a) a novel, extensible, twolevel classification architecture for identifying potential WSTEs in SMT hypotheses; (b) a constrained phrasepair clustering mechanism for identifying the translated sense of ambiguous source words in SMT hypotheses; and (c) an interactive strategy that integrates this information to request specific clarifying information from the operator. By leveraging unsupervised and lightly supervised learning techniques, our approach minimizes the need for expensive human annotation in developing each component of this framework. Each component, as well as the overall framework, was evaluated in the context of an interactive EnglishtoIraqi Arabic CSLT system.
more …
By
Gretz, Shai; Itai, Alon; MacWhinney, Brian; Nir, Bracha; Wintner, Shuly
Show all (5)
We present a syntactic parser of (transcripts of) spoken Hebrew: a dependency parser of the Hebrew CHILDES database. CHILDES is a corpus of child–adult linguistic interactions. Its Hebrew section has recently been morphologically analyzed and disambiguated, paving the way for syntactic annotation. This paper describes a novel annotation scheme of dependency relations reflecting constructions of child and childdirected Hebrew utterances. A subset of the corpus was annotated with dependency relations according to this scheme, and was used to train two parsers (MaltParser and MEGRASP) with which the rest of the data were parsed. The adequacy of the annotation scheme to the CHILDES data is established through numerous evaluation scenarios. The paper also discusses different annotation approaches to several linguistic phenomena, as well as the contribution of morphological features to the accuracy of parsing.
more …
By
Vila, Marta; Bertran, Manuel; Martí, M. Antònia; Rodríguez, Horacio
Show all (4)
Paraphrase corpora annotated with the types of paraphrases they contain constitute an essential resource for the understanding of the phenomenon of paraphrasing and the improvement of paraphraserelated systems in natural language processing. In this article, a new annotation scheme for paraphrasetype annotation is set out, together with newly created measures for the computation of interannotator agreement. Three corpora different in nature and in two languages have been annotated using this infrastructure. The annotation results and the interannotator agreement scores for these corpora are proof of the adequacy and robustness of our proposal.
more …
By
Pecina, Pavel; Toral, Antonio; Papavassiliou, Vassilis; Prokopidis, Prokopis; Tamchyna, Aleš; Way, Andy; Genabith, Josef
Show all (7)
3 Citations
In this paper, we tackle the problem of domain adaptation of statistical machine translation (SMT) by exploiting domainspecific data acquired by domainfocused crawling of text from the World Wide Web. We design and empirically evaluate a procedure for automatic acquisition of monolingual and parallel text and their exploitation for system training, tuning, and testing in a phrasebased SMT framework. We present a strategy for using such resources depending on their availability and quantity supported by results of a largescale evaluation carried out for the domains of environment and labour legislation, two language pairs (English–French and English–Greek) and in both directions: into and from English. In general, machine translation systems trained and tuned on a general domain perform poorly on specific domains and we show that such systems can be adapted successfully by retuning model parameters using small amounts of parallel indomain data, and may be further improved by using additional monolingual and parallel training data for adaptation of language and translation models. The average observed improvement in BLEU achieved is substantial at 15.30 points absolute.
more …
By
Letcher, Ned; Dridan, Rebecca; Baldwin, Timothy
The development of precision grammars is an inherently resourceintensive process; their complexity means that changes made to one area of a grammar often introduce unexpected flowon effects elsewhere in the grammar which may only be discovered after some time has been invested in updating numerous test suite items. In this paper, we present the browserbased gDelta tool, which aims to provide grammar engineers with more immediate feedback on the impact of changes made to a grammar by comparing parser output from two different grammar versions. We describe an attribute weighting algorithm for highlighting components of the grammar that have been strongly impacted by a modification to the grammar, as well as a technique for clustering test suite items whose parsability has changed, in order to locate related groups of effects. These two techniques are used to present the grammar engineer with different views on the grammar to inform them of different aspects of change in a datadriven manner.
more …
By
Jesus Martins, Débora Beatriz; Medeiros Caseli, Helena
Although machine translation (MT) has been an object of study for decades now, the texts generated by the stateoftheart MT systems still present several errors for many language pairs. Aiming at coping with this drawback, lots of efforts have been made to postedit those errors either manually or automatically. Manual postediting is more accurate but can be prohibitive when too many changes have to be made. Automatic postediting demands less effort but can also be less effective and give rise to new errors. A way to avoid unnecessary automatic postediting and new errors is by previously selecting only the machinetranslated segments that really need to be postedited. Thus, this paper describes the experiments carried out to automatically identify MT errors generated by a stateoftheart phrasebased statistical MT system. Despite the fact that our experiments have been carried out using a statistical MT engine, we believe the approach can also be applied to other types of MT systems. The experiments investigated the wellknown machinelearning algorithms Naive Bayes, Decision Trees and Support Vector Machines. Using the decision tree algorithm it was possible to identify wrong segments with around 77 % precision and recall when a small training corpus of only 2,147 error instances was used. Our experiments were performed on EnglishtoBrazilian Portuguese MT, and although some of the features are languagedependent, the proposed approach is languageindependent and can be easily generalized to other language pairs.
more …

