HOMEUNIVERSITY OF TOR VERGATA3rd Summer Convention on Information Extraction - SCIE 2002
 

 

 

  Home

  Events

  Plenary Talks
  Workshops
  Tech Meetings
  Open Lab

  Organization

  Venue

  Registration

  Hotel Accomodation

  Grants

  Sponsors

Plenary Talks

  • Language-enabled Virtual Actors
    By Marc Cavazza (University of Teesside, UK)
     
    An increasing number of computer applications are incorporating intelligent characters, in next-generation user interfaces (conversational characters) or in interactive digital media (computer games, interactive storytelling). The development of intelligent virtual characters brings expectations on their ability to understand natural language in context. The development of language-enabled characters is faced with several technical challenges. The embedded NLP technologies should be robust enough, in particular in the perspective of speech input. However, the key aspect lies in the interfacing between the NLP component, the agent’s environment and its internal decision mechanisms, most often based on AI planning techniques. This requires a specific investigation into the modalities of human-agent communication, which are mainly human-agent dialogue or natural language instructions. Both of these can be studied through a speech acts approach, though practical implications will differ significantly in both cases. Speech act interpretation can also be related to several paradigms of interactivity that define the relationships between the user and the virtual actor.
    We will discuss some state-of-the-art approaches and illustrate them by describing fully-implemented systems, in interactive storytelling and conversational characters. Beyond the core NLP techniques used, we will emphasise the interpretation of NLP output in terms of agents’ behaviour, which, more than being a simple integration issue, should be an active research area developing another view of NL semantics.

  • Terminology Mining:
    By Béatrice Daille (University of Nantes, FR)

    Terminology mining is a major step forward in terminology extraction and covers acquisition, as well as the structuring of the candidate terms by adding relation between terms and/or grouping terms into semantic classes. After a review of the various approaches in terminology mining, we presents a combination of linguistic and statistical approaches which performs the acquisition of complex terms, the discovery of new terms, and the structuring of the acquired candidate terms. The terminology mining is based on linguistic properties of term variations.
    To conclude, a review of evaluation methods for terminology extraction is presented and results of the efficiency of such methods in evaluation campaigns are discussed.

  • Machine learning for Web information extraction
    By Nicholas Kushmerick (University College Dublin, IE)

    The Web contains numerous sources of valuable data, from weather reports and stock quotes, to retail product catalogs and newspaper archives. Researchers and companies worldwide are devising a wide variety of next-generation information services that will capitalize on these resources. Unfortunately, information extraction is a challenging bottleneck in deploying such applications and services. Information extraction is the process of automatically retrieving relevant content from such resources, and translating it into some suitable internal representation. Information extraction in the context of the Web poses many difficult challenges, but the general problem is also constrained in various ways as well. I will survey recent research on Web information extraction, focusing on the use of machine learning techniques to automatically construct Web information-extraction components ("wrappers").

  • Interrelation between IR and IE in Question Answering
    By Dan Moldovan (University of Texas, Dallas, USA)

    The problem of open - domain Question Answering is to answer questions expressed in natural language forms by searching large collections of documents. The complexity of the question answering process varies from simple extraction of facts by matching question keywords to advanced text understanding and reasoning. Because of the strong need for natural language processing in QA, this technology can not be implemented only as simple extensions of Information Retrieval or Information Extraction. However, IR is used in QA to identify documents or paragraphs where potential answers may be. Similarly, techniques from IE, such as name entity recognition, are indispensable in QA.
    In this presentation the role of IR and IE in QA is discussed and it is shown that trade-offs between IR and IE are possible when implementing state-of-the-art QA systems.

  • Agents based ontological mediation in IE systems
    By Maria Teresa Pazienza (University of Rome "Tor Vergata", IT)

    Building more adaptive SW applications is a crucial issue to scale up IE technology to the Web, where information is organized following different underlying knowledge and/or presentation models. Information agents are increasingly being adopted to extract relevant information from semi-structured web sources. To efficiently manage heterogeneous information sources they must be able to cooperate, share their knowledge, and agree upon appropriate terminology to be used during interaction. As the internal knowledge representation could vary among participants, this make unfeasible to directly communicate objective concepts; agent autonomy promote abstraction from details about the internal structuring of other agents.
    We will argue on main topics involved in using natural language to achieve semantic agreement in communication, and we will present a novel architecture based on a pool of intelligent agents. It will be done by defining a communication model that foresees a strong separation among terms and concepts, being this difference often undervalued in the literature, where terms play the ambiguous roles of concept labels and of communication lexicon. For agents communicating through the language, lexical information embody instead the possibility to "express" the underlying conceptualisations thus agreing to a shared representation.
    To make the resulting architecture adaptive to the application domain, we will introduce three different agent typologies: resource agents, owning the target knowledge; service agents, providing basic skills to support complex activities and control agents, supplying the structural knowledge of the task, with coordination and control capabilities. We will focus our attention to two dedicated service agent, a mediator, that will care about understanding the information an agent want to express as well as the way to present it to others, and a translator, dealing with lexical misalignment due to different languages.
    The resulting agent community dynamically assumes the most appropriate configuration, in a transparent way with respect to the involved participants.

  • Measuring Term Representaiveness
    By Jun-ichi Tsujii (University of Tokio, JP) and Toru Hisamitsu (Hitachi, JP)

    Automatic term recognition (ATR) has been recognized as an important research area for information extraction, information retrieval, etc. The ATR methods proposed so far can be classified into two classes. One is based on the unithood of a word sequence, and the other is based on representativeness of terms. While there have been several measures proposed for the former such as MI, the log-likelihood, NC-value, etc,  not many measures for the latter have been proposed an tested. In this talk, we will discuss new methods  of measuring representativeness of terms, which have clear mathematical interpretations, and compare their performances with conventional tf/idf-like measures.
     
  • Acquisition of Domain Knowledge
    By Roman Yangarber (New York University, USA)

    Information Extraction systems typically operate in a specific domain, and need to be adapted for every new domain and scenario of interest.
    Adaptation for a particular domain entails the collection of knowledge that is needed to operate within that domain. Experience indicates that such collection cannot be undertaken by manual means only---i.e., by enlisting domain experts to provide expertise, and/or computational linguists to induct the expertise into the system,---as the costs compromise the enterprise. This is known as the acquisition bottleneck.
    Much attention has focused on automating the process knowledge acquisition in general, and for IE in particular.
    Linguistic knowledge in NL understanding systems is commonly stratified across several levels. This is true of Information Extraction as well.
    Typical state-of-the-art IE systems require a specialized lexicon of terms not found in general-purpose dictionaries; domain-specific word or concept classes for semantic generalization; and syntactic-semantic patterns for locating facts or events in text, among other kinds of knowledge.
    We describe an approach to unsupervised, or minimally supervised, knowledge acquisition. The approach is based on bootstrapping a comprehensive knowledge base from a small set of seed elements. Our approach is embodied in algorithms for discovery of lexicon, concept classes, and patterns, from raw, un-annotated text.
    We present the results of knowledge acquisition, and examine them in the context of other relevant work. We address problems in evaluating the quality of the acquired knowledge, and present methodologies for evaluation.

    Talks Timeschedule:

    Mon
    July 15
    Tue
    July 16
    Wed
    July 17
    Thu
    July 18
    Fri
    July 19
    9.15 - 10.45 Yangarber   Tsujii/Hisamitsu Moldovan Daille
    Coffee-Break 
    11.00 - 12.45 Moldovan Cavazza Yangarber Kushmerick Tsujii/Hisamitsu
    Lunch
    13.45 - 15.15 Cavazza Daille Pazienza  Pazienza
    Coffee-Break
    15.30 - 17.00   Kushmerick       
    Bus Departure 

    © SCIE 2002

  •