The Extractor API may be used in many different ways, depending on the intended application. Here are a few examples of how you might implement for:
  • process one document
  • example: for a Summarize button in a word processor
  • process many documents, using the same stop words for all of them
  • example: for an index for a web site
  • process many documents, using different stop words for each one
  • example: for summarizing e-mail for a mail server
  • process many small sections (pages, chapters) from one large document (a book)
  • example: for a book index
  • example: for a book table of contents
  • The API is designed to allow maximum flexibility for a wide variety of applications.

    ===============================

    One Document, One Set of Stop Words

    This is a sketch of how to use the API to process a single text file. This example assumes that there is no need to customize the stop words.

    /* initialize */
    
    call ExtrCreateStopMemory();
    call ExtrCreateDocumentMemory();
    
    /* process text file */
    
    open the text file;
    while the end of the text file has not yet been reached {
        read a block of the text file into a buffer;
        call ExtrReadDocumentBuffer();
    }
    call ExtrSignalDocumentEnd();
    close the text file;
    
    /* print out keyphrases */
    
    call ExtrGetPhraseListSize();
    for i = 0 to (PhraseListSize - 1) do {
        call ExtrGetPhraseByIndex();
        display the i-th keyphrase to the user;
    }
    
    /* free memory */
    
    call ExtrClearStopMemory();
    call ExtrClearDocumentMemory();

    Note that Extractor does not manage the text buffer. Extractor reads the text buffer, but does not change the state of the text buffer in any way. The text buffer must be allocated and freed outside of Extractor.

    This sketch is essentially what is implemented in the API test wrapper, test_api.c.

    ===============================

    Many Documents, One Set of Stop Words

    This is a sketch of how to use the API to process many documents. This example assumes that there is no need to customize the stop words.

    /* initialize the stop words */
    
    call ExtrCreateStopMemory();
    
    /* process the text files */
    
    for each document in the list of documents {
    
        /* initialize the document memory */
    
        call ExtrCreateDocumentMemory();
    
        /* process the current document */
    
        open the text file for the current document;
        while the end of the text file has not yet been reached {
            read a block of the text file into a buffer;
            call ExtrReadDocumentBuffer();
        }
        call ExtrSignalDocumentEnd();
        close the text file for the current document;
    
        /* print out keyphrases */
    
        call ExtrGetPhraseListSize();
        for i = 0 to (PhraseListSize - 1) do {
            call ExtrGetPhraseByIndex();
            display the i-th keyphrase to the user;
        }
    
        /* free the document memory */
    
        call ExtrClearDocumentMemory();
    }
    
    /* free stop word memory */
    
    call ExtrClearStopMemory();

    In this example, all of the documents share the same set of stop words. Therefore the stop word memory is only created once. This is more efficient than putting ExtrCreateStopMemory inside the for each document loop.

    ===============================

    Many Document, Many Sets of Stop Words

    This is a sketch of how to use the API to process many documents. In this example, each document is processed with its own set of stop words.

    /* process the text files */
    
    for each document in the list of documents {
    
        /* initialize */
    
        call ExtrCreateDocumentMemory();
        call ExtrCreateStopMemory();
    
        /* load custom stop words */
    
        open the text file for the custom stop words for the current document;
        while the end of the text file has not yet been reached {
            read a stop word from the file;
            call ExtrAddStopWord();
        }
        close the text file for the custom stop words;
    
        /* process the current document */
    
        open the text file for the current document;
        while the end of the text file has not yet been reached {
            read a block of the text file into a buffer;
            call ExtrReadDocumentBuffer();
        }
        call ExtrSignalDocumentEnd();
        close the text file for the current document;
    
        /* print out keyphrases */
    
        call ExtrGetPhraseListSize();
        for i = 0 to (PhraseListSize - 1) do {
            call ExtrGetPhraseByIndex();
            display the i-th keyphrase to the user;
        }
    
        /* free memory */
    
        call ExtrClearDocumentMemory();
        call ExtrClearStopMemory();
    }

    If the application is a server with many different users, then the users could each have their own personal list of stop words. For example, if the server processes e-mail, then the users might want their own names to be stop words.

    ===============================

    Process a Document in Sections

    This is a sketch of how to use the API to process a large document, one section at a time. This example assumes that the same stop words are used for all sections.

    This could be useful for producing an annotated table of contents for a book. Each section in the book could be annotated by a list of keyphrases, where the keyphrases are extracted from that section alone.

    This could also be useful for producing an index. Extractor generates a list of three to thirty keyphrases for each document that it processes (depending on ExtrSetNumberPhrases). Thirty keyphrases is not enough to make an index for a book. However, if the book is processed in blocks of about one to five pages per block, then Extractor will generate up to thirty keyphrases for each block. A two-hundred page book could then yield six thousand keyphrases. This will be more than enough to make a good index.

    /* initialize stop words */
    
    call ExtrCreateStopMemory();
    
    /* process document */
    
    open the text file for the document;
    while the end of the text file has not yet been reached {
    
        /* process sections */
    
        for each section of the document {
    
            /* initialize memory for current section */
    
            call ExtrCreateDocumentMemory();
    
            /* process current section */
    
            while the end of the section has not yet been reached {
                read a block of the current section into a buffer;
                call ExtrReadDocumentBuffer();
            }
            call ExtrSignalDocumentEnd();
    
            /* print out keyphrases */
    
            call ExtrGetPhraseListSize();
            for i = 0 to (PhraseListSize - 1) do {
                call ExtrGetPhraseByIndex();
                display the i-th keyphrase to the user;
            }
    
            /* free memory for current section */
    
            call ExtrClearDocumentMemory();
        }
    }
    close the text file;
    
    /* free stop words */
    
    call ExtrClearStopMemory();

    Note that Extractor can efficiently handle very large documents without requiring the documents to be split into smaller chunks. Splitting a document into sections is not necessary to increase the speed or capacity of Extractor.

     
        
        
    Features

         Evaluate
                online demonstration
                sample application
                software development kit      
         Platform
                operating system
                        Windows
                        Solaris
                        Linux
                        Mac OS
                        HP/UX
                        ...
                development
                        C / C#
                        Java
                        Perl
                        Python
                        Visual Basic

         API Functions

         Great for...
             
    workforce optimization
              web log tagging
              refined search
              knowledge management (KM)
              information retrieval (IR)
              semantic web development
              indexing
              categorization
              cataloguing
              inference engines
              document management
              Portal Services

         Examples:
             
    Research
              Internet Communications
              HomeLand Security
              Contextual Web Search
              Document Mangement
              Indexing
              Knowledge Management
              Intellectual Property Filter
              Intelligent Search
              Text Summarization
              Wireless Push Technology


         Supporting Documentation

         FAQ

         Purchase

         About

         Contact

         Home