WHO technical report series, Development of community — and state-based immunization registries. Travassos C, Martins M. Cad Saude Publica.
Moraes IHS. Center for Disease Control and Prevention. Notice to readers: Immunization registry standards of excellence in support of core immunization program strategies. O contexto domiciliar. In: Duarte Y, Diogo M, editors. Kimura J, Shibasaki H, organizadores. Recent advances in clinical neurophysiology. Amsterdam: Elsevier; Holland, CD. Chemical hormesis: bebeficial effects at low exposures adverse effects at high exposures. Medronho RA. Outros trabalhos publicados. Jornal do Brasil ; 31 jan. Documentos legais.
Leshner AI. Molecular mechanisms of cocaine addiction. N Engl J Med. In press Finally, a set of conclusions is drawn from the obtained results. The task of identifying restricted lexical combinations, as we will state in this article, is not new. It is a relevant procedure for different tasks on Natural Language Processing NLP like, for example, Machine Translation MT , where idiomatic expressions cannot be translated literally. Even collocations need to be translated with caution.
The easiest way to detect sequences of words likely to be considered as a collocation or, at least, as a compound term, is to use the Mutual Information MI or Pointwise Mutual Information PMI metrics.
However, by themselves, these two measures are not enough for the extraction of collocations. A prior study Pavel presents a vast amount of measures that can be used to detect collocations. Nevertheless, most of them perform badly by themselves, and as presented below, new approaches have been used.
Probably the bigger challenge is to detect idiomatic expressions, mostly when they can also have a literal meaning like break the ice that can be considered literally or not. Properties are very diverse, from the usage of prepositions before or after the expression, to graphs of cohesion between the different sentence components.
These properties are then used in a Support Vector Machine algorithm. For that, they trained a binary perceptron based on two types of features: lexical features, like the usage of capital letters, and graph features, using relations information obtained from WordNet and Wiktionary.
The perceptron was training on Wiktionary labeled data, and used the non-labeled data for test purposes. The extraction method, itself, does not take any real advantage of parallel corpora. The hypothesis we are testing is: if a sequence with two words, an adjective and a noun, is translated by two other words, and only one of them is a translation of the original words found in a translation dictionary, then we have a candidate collocation.
This can be better explained using mathematical syntax. Let us define the T function, that translates Spanish words into Portuguese, and the concatenation operator a dot , which joins two words. The translation of two words w a and w b is considered to be compositional if. Therefore, we are looking for a pair of words w a , w b in which one of them is an adjective and the other a noun, and whose translation does not follow the equation presented above 1.
That is, we want to find w a and w b where. The extraction algorithm used is very simple, and its main purpose is to test the hypothesis that the collocation extraction based on non-translation composition is possible. The algorithm starts by iterating over each translation unit in the parallel corpus. Then, each possible bigram from the segment S SP is analyzed using the FreeLing morphological analyzer, looking for a sequence in which one of the words is a noun and the other an adjective.
Note that, although FreeLing has modules to do part-of-speech tagging they were not used. Nevertheless, we are aware of the problems this approach arises, and we will discuss them later. When such a pair of words is found, their possible translation sets are computed. Note that each word can have more than one translation, and, therefore, we need to construct a set of translations.
This translation was done using the Apertium translation dictionary. Then, these translation sets are searched in the target language segment S PT. If any of the words from both translation sets occur together, the word pair is discarded. On the other hand, if one of the words has a translation in the target segment, but the other does not, the Spanish word pair is saved for manual assessment.
Together with the word pair, a segment of Portuguese words in the vicinity of the found translation is seized and also stored. This list was then assessed manually. A Linguistics MSc student classified each word pair manually into one of the following classes:. This happens mainly because the application was not able to find the sequence of words that include the translation of the selected pair of words, or because the original corpus had alignment errors;.
Well established in the field of literary criticism, it became known to historians through the controversial work of Hayden White on metahistorical techniques. London: Hodder Arnold; Center for Disease Control and Prevention. Rio de Janeiro: Imprensa Nacional, Twenty seven medical bulletins were emitted. When such a pair of words is found, their possible translation sets are computed. We can assume that months before the death, the king developed intermittent bouts of severe gastroenteritis with vomiting and diarrhea, besides peripheral neuropathy compatible with probable episodes of arsenic poisoning.
This happens mostly when a possible word translation is not included in the used translation dictionary;. When in doubt about a combination being considered restricted or free, we took the decision to consider it as a free combination. This means that our evaluation is less favorable to our hypothesis. The errors found are from very different kinds, such us from alignment problems, some minor bugs in the algorithm implementation, or the lack of translations from the translation dictionary:.
A palavra de Deus tem poder para curar você completamente. Deus enviou-lhes a Sua Palavra e os sarou, E os livrou do que lhes era mortal. Salmo Femmes handicapes la vie devant elles french edition. Palavras mortais portuguese edition. Die drei musketiere german edition. Strings of connection the .
Nevertheless, considering that our hypothesis is the existence of a sequence with a noun and an adjective, the examples classified as a result of this problem are irrelevant for proving it. Table 1  shows some of these situations. Table 1: Examples of extractions where a verbal form was mistakenly interpreted as a noun.
This turned the assessment impossible. This was a problem inherited from the bad segmentations performed by other tools like the segmenter, tokenizer and the sentence-aligner. Given the missing word marked by the asterisk this segment could not be classified correctly, and therefore, fell in the error class. Just like with the case above, we do not have any detail on the validity or not of the hypothesis.
Table 2 shows further examples of this segmentation problem. All these cases can be safely ignored for the hypothesis test. Between parentheses we show the missing words. This seemed to be a problem on the corpus segmentation and alignment process. This is, indeed, a bug introduced by our implementation, but when it was detected it was too late to perform a complete new extraction and restart the manual evaluation.
Therefore, they were ignored for our hypothesis test. Table 3 shows some of these examples. In italics, on the right, the aligned segment. See Table 4 for some examples. In some cases the Portuguese version included the text in Spanish, and in some other, in English, as shown in Table 5. Some others, as shown in Table 6 , include typos that, not being in the dictionary, activated our hypothesis by mistake. The main interference with the algorithm, which could make it extract free combinations, is the lack of translations from the used translation dictionary.
A similar problem occurred with words not correctly lemmatized, and therefore, not found in the translation dictionary. Finally, there is yet another problem, related with the textual deixis , where there is a reference to a different position in the text that, different languages refer to in different ways.
As anterior and acima are not direct translations, the algorithm extracted them as restricted combinations although they are free combinations. Other than the correct restricted combinations, there are two special kinds that should be mentioned:. This situation was named reduction and happens a few times. Examples are shown in Table 9. These were considered restricted combinations. The best examples from Table 9 are the first and the last. In Portuguese, and although there is the concept of meio ambiente , it is usually used only as ambiente. Table 9: Examples of reductions: situations where two words are correctly translated by only one word.
These were extracted because of the way the nouns are translated. This table shows three columns. The first two are Spanish and Portuguese, and the third, a direct Portuguese translation of the Spanish term. This is usually easy to detect given the specific area of the used corpora, and given that the Portuguese segment includes more words than the two existing in Spanish. Table 11 shows examples. When the algorithm returned interesting results, returning restricted lexical combinations, we were unable to distinguish between collocations and other types of restricted combinations, like quasi-phrasemes and idioms this last type was not found, probably given the type of the used corpus.
When applied to the European Central Bank corpora, our approach extracted more than They were evaluated by exhaustion instead of evaluating a sample of random entries, the evaluator tagged each one of the extracted candidates. This means the evaluation is not affected by sample bias. This, together with the fact that the evaluator gave preference to free combinations over restricted combinations, means that this evaluation is the baseline of the algorithm.
Table 13 presents the number of cases found and classified according to each of the previously mentioned classes. If we ignore the cases of errors, nouns and reductions, we can note that restricted combinations are one quarter of the total number of found combinations. The first reaction to the results was of discontent, as a lot of free combinations were found. As soon as the examples were analyzed was realized: firstly, the translation resource lacks coverage, and secondly, the algorithm used misses the correct lemmatization for some words.
Nevertheless, most of the situations found are easy to correct, and, therefore, further experiments should be performed before considering the method inadequate. In fact, will be interesting to see how this approach performs in a less noisy corpus, with better dictionaries, and with other languages. This data, being manually classified, can be used to train machine learning algorithms. For the extraction of further collocations from other corpora, this data can be used to train a supervised machine learning algorithm, or just be used as a golden standard for this kind of system.
Analyzing the results obtained, the initial starting hypothesis should be reformulated. Using this approach, restricted combinations, and not just a specific type of restricted combinations such as the case of collocations, are detectable. The problem is the non-existence of a clear distinction between them. Some lexical combinations will be classified differently according to the way the linguist decomposes semantically the expression.
There is another problem with our hypothesis, when a restricted combination coincides in the two languages being analyzed, because they can be mistakenly considered free lexical combinations. Aguilar-Amat, A. Tese de doutoramento [microfichas]. Alonso Ramos, M. Tese de doutoramento.