Text retrieval using self-organized
document maps
Krista Lagus.
A map of text documents arranged using the Self-Organizing Map (SOM) algorithm (1) is organized in a meaningful manner so that items with similar content appear at nearby locations of the 2-dimensional map display, and (2) clusters the data, resulting in an approximate model of the data distribution in the high-dimensional document space. This report describes how a document map that is automatically organized for browsing and visualization can be successfully utilized also in speeding up document retrieval. Furthermore, experiments on the well-known CISI collection indicate improved performance compared to Salton's vector space model and to Latent Semantic Indexing, measured by average precision when retrieving a small, fixed number of best documents.
Keywords: information retrieval, SOM, text mining,
document maps, LSI
Self Organization of a Massive
Document Collection.
T. Kohonen, S. Kaski, K. K. Lagus, J. Salojärvi, V. Paatero,
and A. Saarela.
This article describes the implementation of a system that is able to organize vast document collections according to textual similarities. It is based on the Self-Organizing Map (SOM) algorithm. As the feature vectors for the documents we use statistical representations of their vocabularies. The main goal in our work has been to scale up the SOM algorithm to be able to deal with large amounts of high-dimensional data. In a practical experiment we mapped 6,840,568 patent abstracts onto a 1,002,240-node SOM. As the feature vectors we used 500-dimensional vectors of stochastic figures obtained as random projections of weighted word histograms.
Self organization of a massive text document collection
Teuvo Kohonen, Samuel Kaski, Krista Lagus, Jarkko Salojärvi,
Jukka Honkela, Vesa Paatero, and Antti Saarela
When the SOM is applied to the mapping of documents, one can represent them statistically by their weighted word frequency histograms or some reduced representations of the histograms that can be regarded as data vectors. We have made such a SOM of about seven million documents, viz. of all of the patent abstracts in the world that have been written in English and are available in electronic form. The map consists of about one million models (nodes). Keywords or key texts can be used to search for the most relevant documents first. New effective coding and computational schemes of the mapping are described.
Keywords: large self-organizing maps, text exploration, knowledge discovery, patent abstracts, content-addressable search.
Keyword selection method for characterizing
text document maps
Krista Lagus and Samuel Kaski
Characterization of subsets of data is a recurring problem in data mining. We propose a keyword selection method that can be used for obtaining characterizations of clusters of data whenever textual descriptions can be associated with the data. Several methods that cluster data sets or form projections of data provide an order or distance measure of the clusters. If such an ordering of the clusters exists or can be deduced, the method utilizes the order to improve the characterizations. The proposed method may be applied, for example, to characterizing graphical displays of collections of data ordered e.g.~with the SOM algorithm. The method is validated using a collection of 10,000 scientific abstracts from the INSPEC database organized on a WEBSOM document map.
WEBSOM for textual data mining
Krista Lagus, Timo Honkela, Samuel Kaski and Teuvo Kohonen
New methods that are user-friendly and efficient are needed for guidance among the masses of textual information available in the Internet and the World Wide Web. We have developed a method and a tool called the WEBSOM which utilizes the self-organizing map algorithm (SOM) for organizing large collections of text documents onto visual document maps. The approach to processing text is statistically oriented, computationally feasible, and scalable---over a million text documents have been ordered on a single map. In the article we consider different kinds of information needs and tasks regarding organizing, visualizing, searching, categorizing and filtering textual data. Furthermore, we discuss and illustrate with examples how document maps can aid in these situations. An example is presented where a document map is utilized as a tool for visualizing and filtering a stream of incoming electronic mail messages.
WEBSOM-Self-Organizing Maps of
Document Collections
Samuel Kaski, Timo Honkela, Krista Lagus, Teuvo Kohonen
With the WEBSOM method a textual document collection may be organized onto a graphical map display that provides an overview of the collection and facilitates interactive browsing. Interesting documents can be located on the map using a content-directed search. Each document is encoded as a histogram of word categories which are formed by the Self-Organizing Map (SOM) algorithm based on the similarities in the contexts of the words. The encoded documents are organized on another Self-Organizing Map, a document map, on which nearby locations contain similar documents. Special consideration is given to the computation of very large document maps which is possible with general-purpose computers if the dimensionality of the word category histograms is first reduced with a random mapping method and if computationally efficient algorithms are used in computing the SOMs.
Keywords: Data mining; Information retrieval; Self-Organising Map; SOM; WEBSOM
Dimensionality reduction by random mapping: Fast similarity
computation for clustering
Samuel Kaski
When the data vectors are high-dimensional it is computationally infeasible to use data analysis or pattern recognition algorithms which repeatedly compute similarities or distances in the original data space. It is therefore necessary to reduce the dimensionality before, for example, clustering the data. If the dimensionality is very high, like in the WEBSOM method which organizes textual document collections on a Self-Organizing Map, then even the commonly used dimensionality reduction methods like the principal component analysis may be too costly. It will be demonstrated that the document classification accuracy obtained after the dimensionality has been reduced using a random mapping method will be almost as good as the original accuracy if the final dimensionality is sufficiently large (about 100 out of 6000). In fact, it can be shown that the inner product (similarity) between the mapped vectors follows closely the inner product of the original vectors.
Generalizability of the WEBSOM Method to
Document Collections of Various Types
Krista Lagus
WEBSOM is a method in which the self-organizing map algorithm is used to automatically organize collections of documents on a map to enable easy exploration of the collection. This article illustrates with case studies how collections of various types of text can be successfully organized using the WEBSOM. The emphasis is on describing the particular challenges that each type of material poses, as well as on identifying properties of a text collection that affect the choices made at each progessing stage. Properties such as the size of the document collection, the size of the vocabulary, the domain, the style of writing, and the language are considered.
Statistical aspects of the WEBSOM system in organizing document
collections
S. Kaski, K. Lagus, T. Honkela, and T. Kohonen
WEBSOM is a novel method for organizing document collections onto map displays to enhance the interactive browsing and retrieval of the documents. The map is organized automatically according to the contents of the full-text documents by the Self-Organizing Map algorithm. The map display provides a visual overview of the whole document collection. The overview, the map display, aids in the exploration since similar documents are located close to each other. In this paper we describe the WEBSOM system in a statistically oriented fashion and discuss its relations to other methods. Particular emphasis is put on how effective the methods are in treating large document collections. The two-phase architecture of the WEBSOM system makes it possible to build contextual information about the relations of words off-line into a word category representation, which can then be utilized rapidly on-line, when the documents are being encoded. The construction of large map displays from the encoded document representations is a computationally intensive operation when done in a straightforward manner. There exist, however, several effective computational shortcuts.
WEBSOM-self-organizing maps of document collections
Timo Honkela, Samuel Kaski, Krista Lagus, and Teuvo Kohonen
Searching for relevant text documents has traditionally been based on keywords and Boolean expressions of them. Often the search results show high recall and low precision, or vice versa. Considerable efforts have been made to develop alternative methods, but their practical applicability has been low. Powerful methods are needed for the exploration of miscellaneous document collections. The WEBSOM method organizes a document collection on a map display that provides an overview of the collection and facilitates interactive browsing. Interesting documents can be retrieved by a content addressable search of interesting map locations. The interesting locations could also be marked as filters for collecting interesting new documents.
Creating an order in digital libraries with self-organizing
maps
Samuel Kaski, Timo Honkela, Krista Lagus, and Teuvo Kohonen
Formulation of suitable search
expressions for information retrieval from large full-text databases
may currently require considerable efforts. Changing the scope of the
search when, e.g., too many or too few hits have been obtained,
requires re-formulation of the search expression. For an alternative
scheme we suggest an explorative full-text information retrieval
method, where the Self-Organizing Map (SOM) algorithm is used to order
documents based on their full textual contents. The visualized order
can then be utilized for an {\em explorative} search or exploration of
novel knowledge areas, whereby the scope can be changed interactively.
The ordering of the documents is achieved by a two-level analysis:
First, word categories are extracted from the text by a ``semantic''
SOM. Second, the textual context of the documents is encoded on the
basis of the histograms of words formed on the word category map.
Self-organizing maps of document collections: A new approach to
interactive exploration
Krista Lagus, Timo Honkela, Samuel Kaski, and Teuvo Kohonen
Powerful methods for interactive exploration and search from collections of free-form textual documents are needed to manage the ever-increasing flood of digital information. In this article we present a method, WEBSOM, for automatic organization of full-text document collections using the self-organizing map (SOM) algorithm. The document collection is ordered onto a map in an unsupervised manner utilizing statistical information of short word contexts. The resulting ordered map where similar documents lie near each other thus presents a general view of the document space. With the aid of a suitable (WWW-based) interface, documents in interesting areas of the map can be browsed. The browsing can also be interactively extended to related topics, which appear in nearby areas on the map. Along with the method we present a case study of its use.
Very large two-level SOM for the browsing of newsgroups
Teuvo Kohonen, Samuel Kaski, Krista Lagus, and Timo Honkela
On January 19, 1996 we published in the Internet a demo of how to use Self-Organizing Maps (SOMs) for the organization of large collections of full-text files. Later we added other newsgroups to the demo. It can be found at the address http://websom.hut.fi/websom/. In the present paper we describe the main features of this system, called the WEBSOM, as well as some newer developments of it.
Exploration of full-text databases with self-organizing maps
Timo Honkela, Samuel Kaski, Krista Lagus, and Teuvo Kohonen
Availability of large full-text document collections in electronic form has created a need for intelligent information retrieval techniques. Especially the expanding World Wide Web presupposes methods for systematic exploration of miscellaneous document collections. In this paper we introduce a new method, the WEBSOM, for this task. Self-Organizing Maps (SOMs) are used to represent documents on a map that provides an insightful view of the text collection. This view visualizes similarity relations between the documents, and the display can be utilized for orderly exploration of the material rather than having to rely on traditional search expressions. The complete WEBSOM method involves a two-level SOM architecture comprising of a word category map and a document map, and means for interactive exploration of the data base.
To WEBSOM Home Page