Overview

   The KnowledgeSite Abstracter is a technology that prepares document summaries for use in information retrieval systems. KnowledgeSite Abstracter is an integrated solution consisting of two main components:

  • A configurable batch processor that produces suggested abstracts for documents of different types.
  • An abstracter's workbench that facilitates rapid review and modification of the suggested abstracts.

Benefits

   High quality abstracts are known to enhance the effectiveness of knowledge workers who rely on the information contained in document retrieval systems to do their job. KnowledgeSite Abstracter enables you to

  • reduce the cost and production time required to generate quality abstracts
  • expand the situations in which document summarization is more economically viable

Who is the KnowledgeSite Abstracter designed for?

   The KnowledgeSite Abstracter is designed for electronic publishing operations that supply large document retrieval systems. The technology makes it easy to generate low-cost abstracts while maintaining high editorial standards.

   The abstracts produced by the KnowledgeSite Abstracter can be adapted to a wide variety of abstracting applications. The batch processor runs in an unattended mode that analyzes each document and produces suggested abstracts without human intervention. When quick turnaround and low cost are your most important criteria, these abstracts can be saved directly into the target repository.

   And when your primary objective is editorial quality, you can use the Abstracter's Workbench to review and edit the suggested abstracts. A browser-based interface presents the summary and other concise information about each original document to the abstracter, who reviews and edits, if necessary, each suggested abstract and then saves the final result in the target repository.

The Batch Processor

The Batch Processor

   The batch processor operates on individual electronic documents or entire repositories of electronic documents. It can handle most text formats including PDF and PDF Image+Text documents.

   The KnowledgeSite Abstracter addresses three critical problems common to high-volume metadata creation:

  1. Dealing with unstructured data- In high-volume publishing operations, the origin of the documents typically ranges from OCR'ed versions of printed material to the output of myriad independent desktop publishing operations. The KnowledgeSite Abstracter can be configured to correct many of the problems inherent to this unstructured and variable input.
  2. Isolating the key document sentences- Unlike simple sentence selectors, the KnowledgeSite Abstracter can be configured to detect genre-specific document structure and emphasize important sections and de-emphasize un-important sections of the document.
  3. Sentence post-processing- The KnowledgeSite Abstracter goes several steps further than extractors based on sentence selection. It can perform synonym replacement, convert first person references to third person and perform other modifications to the chosen sentences. The result is a summary that is closer to a human-written abstract than what is possible with extractors limited to sentence selection.

   The batch processor solves these problems using the configurable pipeline architecture shown below.

The KnowledgeSite Abstracter Pipeline

The Batch Processor Pipeline

   The stages in the pipeline perform the following tasks:

  1. Text Extract- When the origin document is PDF or PDF Image+Text, this stage isolates the text component
  2. Break Detector- Determines the section, heading, paragraph and sentence boundaries of each sentence in the document. Converts the text to XML. From this stage forward, all processing is done in XML
  3. Genre Detection- Determines the type of document
  4. Section Zoning- Based on the type of document, chooses sections for emphasis or hides sections to de-emphasize
  5. Sentence Selection- Ranks the top sentences based on linguistic analysis of the structured document
  6. Part of Speech- For each ranked sentence, determines the part of speech and other linguistic data for each word
  7. Anaphor Resolution- Identifies the antecedent of any pronoun in a chosen sentence and replaces the pronoun with the antecedent
  8. Spell Correct- Replaces misspelled words with the correct spelling
  9. Synonym Edit- Replaces target key phrases with a randomized selection from a knowledgebase of allowable replacement phrases
  10. To Third Person- Converts first person pronouns to third person pronouns, adjusting the verbs as necessary

   Each stage in the pipeline can be turned on or off depending on the type of documents being processed, the tasks that need to be performed and the type of results desired. Each stage in the pipeline is also configurable by other installation-specific settings. These configuration settings influence the behavior of steps like synonym replacement, spelling correction, etc.

   When the pipeline is finished, it saves the abstract as an XML, HTML or plain text document.

The Abstracters Workbench

The Abstracter's Workbench

   The Abstractor's Workbench permits rapid review and editing of the suggested abstracts.

View enlarged image

The Abstracter's Workbench

How it Works

   The Abstractor's Workbench is designed for rapid review and editing of suggested abstracts. In high-volume document processing, the cost of abstracting is often directly dependent on

  1. the number of minutes spent producing each abstract
  2. the percentage of documents that are suitable for informative abstracts (abstracts using the author's voice based on sentences the author wrote) or indicative abstracts (interpretive abstracts based on the reader's perspective)

   Since not every document can be effectively summarized by choosing and operating on key sentences, a small percentage of the suggested abstracts may not meet the editorial objectives. In these instances, the Abstracter’s Workbench gives the editor options to review and edit or set aside the suggested abstract for further work.

Abstracter's Workbench - Features

Article View

   The Article view pane is shown above on the left side of the screen. The article pane is used to view different layouts of the original article. The Abstracter’s Workbench displays the source document in its original form or in an interactive XML view that highlights the sections of the document used in the suggested abstract.

Abstract Edit

   The Abstract Edit pane is shown above on the right side of the screen. The abstract pane is used to list or edit the suggested abstracts that need review. It has menu options for selecting, editing and saving finished abstracts. It also has an options screen that allows the user to adjust the size and display of the suggested abstract.

   Within the Abstract Edit pane, each sentence shows the rank assigned during the selection process. There are checkboxes next to each selected sentence that facilitate inclusion or exclusion of a selected sentence. There is also a radio button for choosing the 'hook' sentence. The hook sentence will be made the first sentence in the saved abstract. Finally, the sentence appears within an edit box that the abstracter can use to change the sentence text.

   The Abstract Edit pane is loaded with features that make it easy to adjust the abstract. There are checkboxes next to each selected sentence that facilitate inclusion or exclusion of a selected sentence. Other sentences can be easily cut and pasted from the source document by clicking on them. Sentence order can be quickly adjusted using up and down arrow icons.

Fits With Your Editorial Policies

   The number of sentences that are presented in the suggested abstract depends on configuration parameters set during the review. These are based on editorial policy for that type or length of document. The editorial policy can set a word minimum or maximum, or a sentence minimum or maximum. These choices, in combination with the configuration options established in the Batch Processor, allow you to closely tailor the final abstracts to the product or application in which they will be used.

Integration

Integration With Your Editorial System

   No two publishing operations use the same workflow and editorial systems. KnowledgeSite performs on-site integration and custom engineering to adapt the batch processor and abstracter's workbench to the established editorial system you currently use. We work with you to understand the unique requirements of your publishing operation and prepare quotations for custom features. Each integration project will typically involve a pilot development, on-site training and on-going technology support.

Technology Requirements

Technology Requirements

   The batch processor runs on Windows NT.

   The Abstracter's Workbench is a web application requiring a web server to run the application and a web browser on each abstracter's workstation. The server application is built with JSP templates that will run on any JSP engine such as Tomcat or any J2EE-compliant application server. The browser used by the abstracters needs to be JavaScript compliant.

Browser Info

   Some features of the Abstracter's Workbench, such as the sentence cut-and-paste capability, use features that are available in Internet Explorer 5.5 or later. For the best results, try the demo with a recent version of Internet Explorer. Also recommended is pressing the 'F11' function key after starting the demo which shrinks the IE menus and thus maximizes the amount of screen space availble for viewing and editing. Press 'F11' again when done to restore the IE menus to their original state.

Note about PDF documents

   If you choose to send a PDF document to the KnowledgeSite Abstracter, you will need to have available a 'real' PDF document (that is, one produced by Adobe Acrobat from a desktop publishing system) or you need to use a PDF Image+Text document produced by an OCR package. PDF Image-only documents are not accepted as input by the KnowledgeSite Abstracter since it operates on the text component of the document.

Custom Evaluations

Custom Evaluations

   The KnowledgeSite Abstracter generates abstracts from text-based documents. A variety of results are possible by using different configuration settings for the pipeline. The settings depend primarily on the editorial objectives for the environment in which the final abstracts will be used. If you'd like to see how a customized setup of the KnowledgeSite Abstracter performs on a collection of your documents, contact us about preparing a custom evaluation.

Pricing

Pricing

   The pricing model is based on usage levels and the amount of integration effort required. Due to the variety of installation scenarios possible, it's best to contact us