|
Plenary Talks
Language-enabled Virtual Actors Terminology Mining: Machine learning for Web information extraction The Web contains numerous sources of valuable data, from weather reports and stock quotes, to retail product catalogs and newspaper archives. Researchers and companies worldwide are devising a wide variety of next-generation information services that will capitalize on these resources. Unfortunately, information extraction is a challenging bottleneck in deploying such applications and services. Information extraction is the process of automatically retrieving relevant content from such resources, and translating it into some suitable internal representation. Information extraction in the context of the Web poses many difficult challenges, but the general problem is also constrained in various ways as well. I will survey recent research on Web information extraction, focusing on the use of machine learning techniques to automatically construct Web information-extraction components ("wrappers"). By Dan Moldovan (University of Texas, Dallas, USA) The problem of open - domain Question Answering is to answer questions expressed in natural language forms by searching large collections of documents. The complexity of the question answering process varies from simple extraction of facts by matching question keywords to advanced text understanding and reasoning. Because of the strong need for natural language processing in QA, this technology can not be implemented only as simple extensions of Information Retrieval or Information Extraction. However, IR is used in QA to identify documents or paragraphs where potential answers may be. Similarly, techniques from IE, such as name entity recognition, are indispensable in QA. In this presentation the role of IR and IE in QA is discussed and it is shown that trade-offs between IR and IE are possible when implementing state-of-the-art QA systems. Agents based ontological mediation in IE systems Building more adaptive SW applications is a crucial issue to scale up IE technology to the Web, where information is organized following different underlying knowledge and/or presentation models. Information agents are increasingly being adopted to extract relevant information from semi-structured web sources. To efficiently manage heterogeneous information sources they must be able to cooperate, share their knowledge, and agree upon appropriate terminology to be used during interaction. As the internal knowledge representation could vary among participants, this make unfeasible to directly communicate objective concepts; agent autonomy promote abstraction from details about the internal structuring of other agents. We will argue on main topics involved in using natural language to achieve semantic agreement in communication, and we will present a novel architecture based on a pool of intelligent agents. It will be done by defining a communication model that foresees a strong separation among terms and concepts, being this difference often undervalued in the literature, where terms play the ambiguous roles of concept labels and of communication lexicon. For agents communicating through the language, lexical information embody instead the possibility to "express" the underlying conceptualisations thus agreing to a shared representation. To make the resulting architecture adaptive to the application domain, we will introduce three different agent typologies: resource agents, owning the target knowledge; service agents, providing basic skills to support complex activities and control agents, supplying the structural knowledge of the task, with coordination and control capabilities. We will focus our attention to two dedicated service agent, a mediator, that will care about understanding the information an agent want to express as well as the way to present it to others, and a translator, dealing with lexical misalignment due to different languages. The resulting agent community dynamically assumes the most appropriate configuration, in a transparent way with respect to the involved participants. By Jun-ichi Tsujii (University of Tokio, JP) and Toru Hisamitsu (Hitachi, JP) Automatic term recognition (ATR) has been recognized as an important research area for information extraction, information retrieval, etc. The ATR methods proposed so far can be classified into two classes. One is based on the unithood of a word sequence, and the other is based on representativeness of terms. While there have been several measures proposed for the former such as MI, the log-likelihood, NC-value, etc, not many measures for the latter have been proposed an tested. In this talk, we will discuss new methods of measuring representativeness of terms, which have clear mathematical interpretations, and compare their performances with conventional tf/idf-like measures. Acquisition of Domain Knowledge Information Extraction systems typically operate in a specific domain, and need to be adapted for every new domain and scenario of interest. Adaptation for a particular domain entails the collection of knowledge that is needed to operate within that domain. Experience indicates that such collection cannot be undertaken by manual means only---i.e., by enlisting domain experts to provide expertise, and/or computational linguists to induct the expertise into the system,---as the costs compromise the enterprise. This is known as the acquisition bottleneck. Much attention has focused on automating the process knowledge acquisition in general, and for IE in particular. Linguistic knowledge in NL understanding systems is commonly stratified across several levels. This is true of Information Extraction as well. Typical state-of-the-art IE systems require a specialized lexicon of terms not found in general-purpose dictionaries; domain-specific word or concept classes for semantic generalization; and syntactic-semantic patterns for locating facts or events in text, among other kinds of knowledge. We describe an approach to unsupervised, or minimally supervised, knowledge acquisition. The approach is based on bootstrapping a comprehensive knowledge base from a small set of seed elements. Our approach is embodied in algorithms for discovery of lexicon, concept classes, and patterns, from raw, un-annotated text. We present the results of knowledge acquisition, and examine them in the context of other relevant work. We address problems in evaluating the quality of the acquired knowledge, and present methodologies for evaluation. Talks Timeschedule:
© SCIE 2002 |
|||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||