|
|
|||
The Extractor API may be used in many different ways, depending on the intended application. Here are a few examples of how you might implement for: The API is designed to
allow maximum flexibility for a wide variety of applications. One Document, One Set of Stop WordsThis is a sketch of how to use the API to process a single text file. This example assumes that there is no need to customize the stop words. /* initialize */ call ExtrCreateStopMemory(); call ExtrCreateDocumentMemory(); /* process text file */ open the text file; while the end of the text file has not yet been reached { read a block of the text file into a buffer; call ExtrReadDocumentBuffer(); } call ExtrSignalDocumentEnd(); close the text file; /* print out keyphrases */ call ExtrGetPhraseListSize(); for i = 0 to (PhraseListSize - 1) do { call ExtrGetPhraseByIndex(); display the i-th keyphrase to the user; } /* free memory */ call ExtrClearStopMemory(); call ExtrClearDocumentMemory(); Note that Extractor does not manage the text buffer. Extractor reads the text buffer, but does not change the state of the text buffer in any way. The text buffer must be allocated and freed outside of Extractor. This sketch is essentially what is implemented in the API test wrapper, test_api.c. =============================== Many Documents, One Set of Stop WordsThis is a sketch of how to use the API to process many documents. This example assumes that there is no need to customize the stop words. /* initialize the stop words */ call ExtrCreateStopMemory(); /* process the text files */ for each document in the list of documents { /* initialize the document memory */ call ExtrCreateDocumentMemory(); /* process the current document */ open the text file for the current document; while the end of the text file has not yet been reached { read a block of the text file into a buffer; call ExtrReadDocumentBuffer(); } call ExtrSignalDocumentEnd(); close the text file for the current document; /* print out keyphrases */ call ExtrGetPhraseListSize(); for i = 0 to (PhraseListSize - 1) do { call ExtrGetPhraseByIndex(); display the i-th keyphrase to the user; } /* free the document memory */ call ExtrClearDocumentMemory(); } /* free stop word memory */ call ExtrClearStopMemory(); In this example, all of the documents share the same set of stop words. Therefore the stop word memory is only created once. This is more efficient than putting ExtrCreateStopMemory inside the for each document loop. =============================== Many Document, Many Sets of Stop WordsThis is a sketch of how to use the API to process many documents. In this example, each document is processed with its own set of stop words. /* process the text files */ for each document in the list of documents { /* initialize */ call ExtrCreateDocumentMemory(); call ExtrCreateStopMemory(); /* load custom stop words */ open the text file for the custom stop words for the current document; while the end of the text file has not yet been reached { read a stop word from the file; call ExtrAddStopWord(); } close the text file for the custom stop words; /* process the current document */ open the text file for the current document; while the end of the text file has not yet been reached { read a block of the text file into a buffer; call ExtrReadDocumentBuffer(); } call ExtrSignalDocumentEnd(); close the text file for the current document; /* print out keyphrases */ call ExtrGetPhraseListSize(); for i = 0 to (PhraseListSize - 1) do { call ExtrGetPhraseByIndex(); display the i-th keyphrase to the user; } /* free memory */ call ExtrClearDocumentMemory(); call ExtrClearStopMemory(); } If the application is a server with many different users, then the users could each have their own personal list of stop words. For example, if the server processes e-mail, then the users might want their own names to be stop words. =============================== Process a Document in SectionsThis is a sketch of how to use the API to process a large document, one section at a time. This example assumes that the same stop words are used for all sections. This could be useful for producing an annotated table of contents for a book. Each section in the book could be annotated by a list of keyphrases, where the keyphrases are extracted from that section alone. This could also be useful for producing an index. Extractor generates a list of three to thirty keyphrases for each document that it processes (depending on ExtrSetNumberPhrases). Thirty keyphrases is not enough to make an index for a book. However, if the book is processed in blocks of about one to five pages per block, then Extractor will generate up to thirty keyphrases for each block. A two-hundred page book could then yield six thousand keyphrases. This will be more than enough to make a good index. /* initialize stop words */ call ExtrCreateStopMemory(); /* process document */ open the text file for the document; while the end of the text file has not yet been reached { /* process sections */ for each section of the document { /* initialize memory for current section */ call ExtrCreateDocumentMemory(); /* process current section */ while the end of the section has not yet been reached { read a block of the current section into a buffer; call ExtrReadDocumentBuffer(); } call ExtrSignalDocumentEnd(); /* print out keyphrases */ call ExtrGetPhraseListSize(); for i = 0 to (PhraseListSize - 1) do { call ExtrGetPhraseByIndex(); display the i-th keyphrase to the user; } /* free memory for current section */ call ExtrClearDocumentMemory(); } } close the text file; /* free stop words */ call ExtrClearStopMemory(); Note that Extractor can efficiently handle very large documents without requiring the documents to be split into smaller chunks. Splitting a document into sections is not necessary to increase the speed or capacity of Extractor. |
||||
Features Evaluate online demonstration sample application software development kit Platform operating system Windows Solaris Linux Mac OS HP/UX ... development C / C# Java Perl Python Visual Basic API Functions Great for... workforce optimization web log tagging refined search knowledge management (KM) information retrieval (IR) semantic web development indexing categorization cataloguing inference engines document management Portal Services Examples: Research Internet Communications HomeLand Security Contextual Web Search Document Mangement Indexing Knowledge Management Intellectual Property Filter Intelligent Search Text Summarization Wireless Push Technology Supporting Documentation FAQ Purchase About |
||||