Stemming and N-gram matching in Turkish texts

Vick Robitaille 


One of the main problems involved in using free text for indexing and retrieval is the variation in word forms that are likely to be encountered (Lennon, et al., 1981). The most common type of variations are spelling errors, alternative spellings, multi-word concepts, transliteration, affixes and abbreviations. One way to alleviate this problem is to use a conflation algorithm, a computational procedure designed to combine semantically-related words and reduce them to a single form for retrieval purposes. This paper discusses using conflation techniques for Turkish text databases.

Turkish is a member of the south-western or Oghuz group of the Turkic languages, which also includes Turkmen, Azerbaijani or Azeri, Ghasghai and Gagauz (Crystal, 1987; Lewis, 1991). The Turkish language uses a Latin alphabet consisting of twenty-nine letters, of which eight are vowels and twenty-one are consonants. Unlike the main Indo-European languages (such as French and German), Turkish is an example of an agglutinative language, where words are a combination of several morphemes and suffixes. Here, words contain a root and suffixes are combined with this root in order to extend its meaning or to create other classes of words. In Turkish the process of adding one suffix to another can result in relatively long words, which often contain an amount of semantic information equivalent to a whole English phrase, clause or sentence. Due to this complex morphological structure, a single Turkish word can give rise to a very large number of variants, with a consequent need for effective conflation procedures if high recall is to be achieved in searches of Turkish text databases. Here, we summarise the principal results thus far of a project to develop and to evaluate such procedures; full details of the work are presented by Ekmekçioglu et al. (1995, 1996).

Processing of morphological variants in Latin text searches

Aubrey S. Nielsen 


In this paper we describe the main features of an algorithm that has been developed to make Latin text databases easier to search. Our algorithm is characterized by two main characteristics that distinguish it from the other trunks which the literature describes. Firstly, two stem dictionaries are generated in the algorithm. It is done by using two sets of stemming rules that separate substantives and adjectives from verb forms by default but do not have to encode the parts of the words to be stemmed. Secondly, the policy of deliberately understamming a large number of words gives the resulting word sufficient grammatical information to make it easy to distinguish between different words and similar roots. This feature also allows very specific searching of single grammatical forms of certain words which are an important requirement for the intended users. So far, only the stemming of individual words has been considered. We are currently developing a recovery system that allows a user to present a single query term for a database and present a list of all the morphological variants in the database, which can be added to the query.

Hypermedia as an experiential tool of learning

Preston R. Garcia


Ever since the 1980s, researchers and educators have claimed that hypermedia, side by side with the microcomputer boom, is about to revolutionise education. However, like so many other promising information technologies, the use of hypermedia for educational purposes did not avoid what Maddux et al. (1994) refer to as the Everest Syndrome.

Developing effective practice requires a sound theoretical foundation. It offers an underlying complexity articulation that can be understood, discussed and implemented (Kuhlthau, 1993). The EDLM model aims to provide this articulation to design and develop hypermedia applications using a constructivist approach.

General Medical Practitioners ‘ impact on clinical decision-making

Robert L. Huston 


This paper summarizes some of the main findings of a recent general practitioner (GP) information use study (Wood et al., 1995a). The work was based on previous studies of the value and impact of information undertaken in the Canadian corporate sector (Marshall, 1993) and in both the United States (King, 1987; Marshall, 1992) and the United Kingdom (Hepworth and Urquhart, 1995; Wood et al., 1995b, 1996). The study used a similar technique in Canadian and U.S. studies. The Trent Health Region conducted twenty-seven in-depth interviews with GPs (only one from each practice). The sample, selected from two health districts, included large, medium and small practices, holding fund and non-fund holding practices, and training and non-training practices, representing those in deprived (socio-economic) areas.

Specialists for training in the European Union’s least favored areas (TRAIN-ISS)

John R. Acosta


Europe’s expanding infrastructure will enhance demand for high-quality, accessible and usable information services. At the same time, there is increasing demand in the Information Society for professionals with qualifications who can improve their potential with a lifelong learning experience, access to and use world-wide information sources. Highly qualified information professionals are essential in every society in order to enable the information market to take full advantage of all its benefits and opportunities. There is a lack of education and education opportunities for information professionals in the Less Favored Regions (LFR) of the European Union to enable them to cope with the changing environment rapidly. The TRAIN-ISS project was funded under IMPACT 2 program by the Commission of the European Communities in order to address this requirement for well-trained and qualified information professionals.