|
|
||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||
The Extractor API allows
several documents to be processed simultaneously,
using separate threads for each document. This is useful, for
example, when processing web pages. A major bottle-neck when
downloading web pages is waiting for web servers to respond to
requests for pages. One way around this bottle-neck is to download
several pages simultaneously, using a separate thread to process
each page. ExtrCreateDocumentMemoryThis function creates a block of memory for storing data about a single document. It returns a pointer value that is a unique identifier for this block of memory. This pointer is later passed to any other functions that process the given document. A document is processed as a sequence of memory blocks, by calling ExtrReadDocumentBuffer. A typical document will involve multiple calls to ExtrReadDocumentBuffer. Each call updates the state of the memory that is reserved for processing the given document, DocumentMemory. In a typical application with multiple threads, there will be a one-to-one relationship between threads and DocumentMemory values, and also between DocumentMemory values and individual documents. On the other hand, threads may share StopMemory values, depending on whether it makes sense to use the same stop words and stop phrases for all of the documents that are currently being processed. ExtrCreateStopMemoryThis function creates a block of memory for storing stop words and stop phrases. It returns a pointer value in StopMemory that is a unique identifier for this block of memory. This pointer is later passed to any other functions that use the stop words or stop phrases. The function returns an error code in ErrorCode. If ErrorCode is zero, there are no problems. Otherwise, a call to ExtrGetErrorMessage will return an explanation for the given error code. A stop word is a word that is not allowed in a keyphrase. For example, "the" is a stop word. A stop phrase is a phrase that is not allowed as a keyphrase. The distinction between a stop word and a single-word stop phrase is that a keyphrase will be rejected if it contains a given stop word, but it will only be rejected if it exactly matches a given stop phrase. For example, if "access" is a stop word, then the phrase "information access" will be rejected. If "access" is a stop phrase, then the phrase "information access" is acceptable, although the single-word phrase "access" will be rejected. Calling ExtrCreateStopMemory will initialize the stop word list with some standard stop words (including "the", for example). The standard list may be extended by calling ExtrAddStopWord or ExtrAddStopPhrase. ExtrActivateHighlightsA highlight is a key sentence. This function activates the highlight extraction feature for DocumentMemory. By default, it is assumed that the user does not want highlight extraction. ExtrActivateHighlights should be called before any calls to ExtrReadDocumentBuffer, since it will affect how the document is read. The main result of calling ExtrActivateHighlights is that the functions ExtrGetHighlightListSize and ExtrGetHighlightByIndex will return some highlights selected by Extractor. Extractor attempts to find one key sentence for each keyphrase that it finds. For a given keyphrase, it is possible that Extractor may not be able to find a good example of a sentence that contains the keyphrase. The function ExtrGetHighlightListSize will return the number of highlights that were generated. This number is always less than or equal to the number of keyphrases that were generated, as given by ExtrGetPhraseListSize. ExtrActivateHTMLFilterThis function signals that the document DocumentMemory contains HTML tags. By default, it is assumed that the document does not contain HTML tags. ExtrActivateHTMLFilter should be called before any calls to ExtrReadDocumentBuffer, since it will affect how the document is read. The main result of calling ExtrActivateHTMLFilter is that HTML tags will be parsed. Most tags are ignored, but some tags are used to identify sentence boundaries. The HTML filter will also convert special symbol codes to the symbols that they represent. For example, "é" will be converted to "é". ExtrActivateEmailFilterThis function signals that the document DocumentMemory contains an e-mail header. By default, it is assumed that the document does not contain an e-mail header. ExtrActivateEmailFilter should be called before any calls to ExtrReadDocumentBuffer, since it will affect how the document is read. The main result of calling ExtrActivateEmailFilter is that the e-mail header will be ignored, except for the "Subject" field. Many e-mail gateways cannot handle 8 bit character codes. Often 8 bit character codes will be converted to 7 bit codes, for safe mailing. The e-mail filter will convert MIME quoted-printable 7 bit character codes back to 8 bit codes. The e-mail filter understands MIME types. E-mail attachments will be treated according to their MIME types. Keyphrases will be extracted from plain text and HTML attachments. Other types of attachments will be ignored. The HTML filter will be automatically activated if the MIME type indicates that the attachment is HTML. Therefore ExtrActivateHTMLFilter should not be called by the user when processing e-mail. Note: Activating the e-mail filter with Japanese or Korean text will have no effect. ExtrDeactivateTextFilterThis function deactivates the plain text filter for DocumentMemory. By default, when the following conditions are met, the input document is assumed to be plain text: When these conditions are met, the plain text filter is activated. The plain text filter will attempt to remove non-textual items from the input document, such as tables and addresses. It will also attempt to use white space to determine the boundaries between titles, section headings, and regular paragraphs. If you do not want the plain text filter to process the input document in these ways, then call ExtrDeactivateTextFilter. Since calling ExtrDeactivateTextFilter will affect how the document is read, it should be called before any calls to ExtrReadDocumentBuffer. If the input document contains tabs, the text filter may interpret the lines with tabs as table rows. These lines may be skipped. If you suspect that the text filter is skipping lines that should be processed, then try calling ExtrDeactivateTextFilter. Internally, Extractor uses the characters 1D (hex) to mark a phrase boundary and 1E (hex) to mark a sentence boundary. The text filter automatically inserts these characters in a plain text document, by analyzing the white space in the document (i.e., line feeds, blanks, tabs, and carriage returns). For example, if two lines are separated by several line feeds (significant vertical white space), then the text filter will remove the white space and insert a sentence boundary marker. This automatic process works well for most plain text documents, but you may wish to write your own filter for a certain type of input document (e.g., a certain type of word processor file). You can run the document through your own filter program, and then send the resulting plain text to Extractor. In this case, you should call ExtrDeactivateTextFilter, but do not call ExtrActivateHTMLFilter or ExtrActivateEmailFilter. Your filter program can help Extractor by inserting markers for phrase boundaries (1D) and sentence boundaries (1E) in the appropriate places. ExtrSetInputCodeA call to ExtrSetInputCode sets the document character code that Extractor uses to process the input text buffer. The character code is given by CharCodeID.
The supported Japanese character sets for all the Japanese encodings are:
The supported Korean character sets for all the Korean encodings are:
ISO-8859-1 and MS-DOS Code Page 437 agree on the coding of non-accented alphabetical characters. If there are no accents in the input text, and the text is in single-byte characters, then the choice between the two should not matter. Unicode UCS2 uses double-byte characters. UCS2 is sensitive to the byte ordering of the hardware platform (big endian versus little endian). Extractor handles UCS2 characters using the byte ordering of the hardware for which it is compiled (native byte ordering). ExtrSetOutputCodeA call to ExtrSetOutputCode sets the document character code that Extractor uses for the output list of keyphrases. The character code is given by CharCodeID.
The supported Japanese character sets for all the Japanese encodings are:
The supported Korean character sets for all the Korean encodings are:
ISO-8859-1 and MS-DOS Code Page 437 agree on the coding of non-accented alphabetical characters. If there are no accents in the input text, and the text is in single-byte characters, then the choice between the two should not matter. Unicode UCS2 uses double-byte characters. UCS2 is sensitive to the byte ordering of the hardware platform (big endian versus little endian). Extractor handles UCS2 characters using the byte ordering of the hardware for which it is compiled (native byte ordering). ExtrSetDocumentLanguageA call to ExtrSetDocumentLanguage sets the language that Extractor uses to process the input text buffer. The language is given by LanguageID.
ExtrSetNumberPhrasesThis function sets the desired number of output phrases. The default number is seven. This is the number that will be generated on average; the actual number of phrases that are output for a given document may be slightly less or slightly more than the number specified by DesiredNumber. Note that DesiredNumber is only set for the given document DocumentMemory. This is so that several documents may be processed simultaneously, each with a different desired number of keyphrases. The DesiredNumber must be between 3 and 30. Values outside of this range will be converted to the closest value inside the range. No error message will be generated when values are out of range. This function is optional. There is no need to call it unless you wish to override the default value of seven phrases. ExtrSetHighlightTypeA highlight is a key sentence. If ExtrActivateHighlights has been called, then Extractor attempts to find one key sentence for each keyphrase that it finds. The ExtrSetHighlightType function sets the type (i.e., style) of highlight that is generated. ExtrAddStopWordThis function adds the string Word to the list of stop words stored in the memory at StopMemory. The stop words are stored in a hash table. It does no harm to try to store the same word twice. It is assumed that Word is in lower case and that Word is a single word (containing no white space). Stop words are stored separately for each language. The language is given by LanguageID. ExtrAddStopWord will return a non-zero error code if LanguageID is invalid or if Word contains anything other than lower case characters.
The character code is given by CharCodeID. Word is of type void * so that either single-byte or double-byte character strings can be passed to this function.
ExtrAddStopWord should be called before any calls to ExtrReadDocumentBuffer, since it will affect how the document is read. When the stop word list is first created, by ExtrCreateStopMemory, it is initialized with a list of common stop words. It may not be necessary to add any extra stop words. That is, it may not be necessary to call ExtrAddStopWord. A stop word is a word that is not allowed in a keyphrase. For example, "the" is a stop word. A stop phrase is a phrase that is not allowed as a keyphrase. The distinction between a stop word and a single-word stop phrase is that a keyphrase will be rejected if it contains a given stop word, but it will only be rejected if it exactly matches a given stop phrase. For example, if "access" is a stop word, then the phrase "information access" will be rejected. If "access" is a stop phrase, then the phrase "information access" is acceptable, although the single-word phrase "access" will be rejected. Note: At this time, you cannot add new stop words for Japanese text. However, you can add new Japanese stop phrases. The function returns an error code in ErrorCode. If ErrorCode is zero, there are no problems. Otherwise, a call to ExtrGetErrorMessage will return an explanation for the given error code. ExtrAddStopPhraseThis function adds the string Phrase to the list of stop phrases stored in the memory at StopMemory. The stop phrases are stored in a hash table. It does no harm to try to store the same phrase twice. It is assumed that Phrase is in lower case. Phrase may be one, two, or three words, separated by a single space. Stop phrases are stored separately for each language. The language is given by LanguageID. ExtrAddStopPhrase will return a non-zero error code if LanguageID is invalid or if Phrase contains anything other than lower case characters and spaces.
The character code is given by CharCodeID. Phrase is of type void * so that either single-byte or double-byte character strings can be passed to this function.
The supported Japanese character sets for all the Japanese encodings are:
The supported Korean character sets for all the Korean encodings are:
When the stop phrase list is first created, by ExtrCreateStopMemory, it is initialized with a list of common stop phrases. It may not be necessary to add any extra stop phrases. That is, it may not be necessary to call ExtrAddStopPhrase. A stop word is a word that is not allowed in a keyphrase. For example, "the" is a stop word. A stop phrase is a phrase that is not allowed as a keyphrase. The distinction between a stop word and a single-word stop phrase is that a keyphrase will be rejected if it contains a given stop word, but it will only be rejected if it exactly matches a given stop phrase. For example, if "access" is a stop word, then the phrase "information access" will be rejected. If "access" is a stop phrase, then the phrase "information access" is acceptable, although the single-word phrase "access" will be rejected. The function returns an error code in ErrorCode. If ErrorCode is zero, there are no problems. Otherwise, a call to ExtrGetErrorMessage will return an explanation for the given error code. ExtrAddGoPhraseIf the input document was found by issuing a query to a search engine, the user may have a special interest in whether the query terms appear in the document, and the context in which the query terms appear. This can be achieved by calling the function ExtrAddGoPhrase with each of the terms in the query. This function adds the string Phrase to the list of go phrases stored in the memory at StopMemory. A go phrase is a phrase that will be treated as if it were a key phrase, if it appears in the input document. Go phrases are stored in a list and each sentence in the input document is scanned for each go phrase in the list. This has two important implications: (1) A large list of go phrases may slow the execution of Extractor. (2) A go phrase in the input document will not be detected if it spans a sentence boundary. A go phrase may consist of one or more words or fragments of words. Any character sequence is permitted, except for an empty string. The letters may be in upper or lower case. A go phrase may range from a single character to a full sentence. A go phrase may contain punctuation. Go phrases are stored separately for each language. The language is given by LanguageID. ExtrAddGoPhrase will return a non-zero error code if LanguageID is invalid or if CharCodeID is not compatible with LanguageID. The following types of matches are supported: When go phrases are found in the input document, they will be inserted at the top of the keyphrase list. They will take priority over the regular keyphrases. The length of the keyphrase list will be kept at the value set by ExtrSetNumberPhrases. For each go phrase that is added to the top of the keyphrase list, a regular keyphrase will be deleted from the bottom of the keyphrase list. (Note that Extractor ranks the keyphrases in order of decreasing estimated importance.) A go phrase can be distinguished from a regular keyphrase (a keyphrase generated automatically by Extractor) by its score. All go phrases are given a score of zero, but a regular keyphrase never has a score of zero. When a go phrase is found, it is inserted into the keyphrase list in exactly the same form as it was given to ExtrAddGoPhrase. This may be different from the form it has in the input document, depending on MatchType. If highlights have been activated (by ExtrActivateHighlights), then each go phrase that is found in the input document will have a corresponding highlight. Extractor attempts to find a good sentence to illustrate each go phrase. If bold markup is set (by ExtrSetHighlightType, then the go phrases will be marked in bold within the corresponding highlights. Neighbouring words and characters may also be marked in bold, if they appear to be closely connected to the go phrase. A go phrase might appear in the document, and yet not be found by Extractor. If the go phrase spans a sentence boundary, it will not be detected. For example, "home cooking" will not be found in the text "Pasta is popular in our home. Cooking pasta is easy." Also, if the input document is very long, Extractor may not read the full document, since it should be possible to make a good summary without reading the full text. Therefore, if the go phrase only appears at the end of a very long document, it might not be detected by Extractor. Finally, the number of go phrases that will be found is limited by the desired number of keyphrases, set by ExtrSetNumberPhrases. If the number of go phrases in the input document is greater than the desired number of keyphrases, then the go phrases that appear earlier in the text will be given priority. The following languages are supported:
The character code is given by CharCodeID. Phrase is of type void * so that either single-byte or double-byte character strings can be passed to this function.
The supported Japanese character sets for all the Japanese encodings are:
The supported Korean character sets for all the Korean encodings are:
The function returns an error code in ErrorCode. If ErrorCode is zero, there are no problems. Otherwise, a call to ExtrGetErrorMessage will return an explanation for the given error code. ExtrReadDocumentBufferThis function reads the text in the buffer DocumentBuffer and updates the memory at DocumentMemory. The processing of the buffer is affected by StopMemory. In a typical application, there will be a series of calls to ExtrReadDocumentBuffer for a given document DocumentMemory. The idea is that the document is read in chunks. A call to ExtrSignalDocumentEnd signals that the last chunk has been sent (the end of the given document has been reached). A call to ExtrReadDocumentBuffer will change the memory at DocumentMemory, but the memory at StopMemory will not be modified. If there are multiple threads, each thread will have a unique value for DocumentMemory, but several threads may share StopMemory. The buffer DocumentBuffer may contain single-byte or double-byte characters (see ExtrSetInputCode). This is why it is of type void *. The buffer length BufferLength specifies the number of bytes in the buffer, not the number of characters. When the character code (set by ExtrSetInputCode) indicates double-byte characters, BufferLength must be an even number. That is, the end of the buffer is not allowed to divide a double-byte character into two parts. The function returns an error code in ErrorCode. If ErrorCode is zero, there are no problems. Otherwise, a call to ExtrGetErrorMessage will return an explanation for the given error code. ExtrSignalDocumentEndA call to ExtrSignalDocumentEnd signals that the end of the document has been reached; there will be no further calls to ExtrReadDocumentBuffer with this particular DocumentMemory. This signal triggers the generation of the final list of keyphrases. The phrases in the final list of keyphrases are compared with the list of stop phrases in StopMemory and any matching phrases are deleted from the final list of keyphrases. Case is ignored for matching, but otherwise an exact match is required. ExtrSignalDocumentEnd should only be called once for a given document DocumentMemory. After ExtrSignalDocumentEnd has been called for a given document, that document has no further need for the stop words and stop phrases stored in StopMemory. Unless there are other documents that will need StopMemory, the memory used by StopMemory may be released after ExtrSignalDocumentEnd has been called. The function returns an error code in ErrorCode. If ErrorCode is zero, there are no problems. Otherwise, a call to ExtrGetErrorMessage will return an explanation for the given error code. ExtrGetPhraseListSizeThe function ExtrGetPhraseListSize returns an integer value that is the number of keyphrases that were generated. If there is an error, PhraseListSize will be set to zero. ExtrGetPhraseListSize may be called repeatedly for a given document. It does not modify the memory at DocumentMemory. ExtrGetPhraseListSize should not be called until after ExtrSignalDocumentEnd has been called. The function returns an error code in ErrorCode. If ErrorCode is zero, there are no problems. Otherwise, a call to ExtrGetErrorMessage will return an explanation for the given error code. ExtrGetPhraseByIndexA call to ExtrGetPhraseByIndex returns a pointer to a string. The string is phrase number PhraseIndex. PhraseIndex ranges from zero to PhraseListSize minus one. Phrases are approximately in order of decreasing quality. ExtrSignalDocumentEnd must be called before ExtrGetPhraseByIndex. The string Phrase may contain single-byte or double-byte characters (see ExtrSetOutputCode). This is why it is of type void **. The memory where Phrase is stored will be cleared when ExtrClearDocumentMemory is called. The application should copy Phrase into a more permanent location. The function returns an error code in ErrorCode. If ErrorCode is zero, there are no problems. Otherwise, a call to ExtrGetErrorMessage will return an explanation for the given error code. ExtrGetScoreByIndexA call to ExtrGetScoreByIndex copies a number into the location given by the pointer. The number is the score assigned to phrase number PhraseIndex. PhraseIndex ranges from zero to PhraseListSize minus one. The score of a phrase is an estimate of its value as a keyphrase. Keyphrases are ranked in order of descending score. ExtrSignalDocumentEnd must be called before ExtrGetScoreByIndex. This function is optional. There is no need to call it unless you are curious about the score that is assigned to a phrase. The function returns an error code in ErrorCode. If ErrorCode is zero, there are no problems. Otherwise, a call to ExtrGetErrorMessage will return an explanation for the given error code. ExtrGetDocumentLanguageA call to ExtrGetDocumentLanguage gets the language of the document. If the language was set by a call to ExtrSetDocumentLanguage, then ExtrGetDocumentLanguage returns the same value that was specified with ExtrSetDocumentLanguage. If Extractor was allowed to guess the language, then ExtrGetDocumentLanguage returns the best guess. LanguageID is passed by reference and is modified in the function.
This function is optional. There is no need to call it unless you wish to know which language Extractor guessed (English, French, German, or Spanish). Note that language guessing is currently not available for Japanese or Korean. The function returns an error code in ErrorCode. If ErrorCode is zero, there are no problems. Otherwise, a call to ExtrGetErrorMessage will return an explanation for the given error code. ExtrGetHighlightListSizeThe function ExtrGetHighlightListSize returns an integer value that is the number of highlights that were generated. If there is an error, HighlightListSize will be set to zero. The number of highlights will be less than or equal to the number of keyphrases. There are two reasons that the number of highlights might be less than the number of keyphrases. First, when HighlightType is an odd number, Extractor removes any duplicate highlights. Second, there may be keyphrases for which no acceptable highlights were found. Therefore, for all values of HighlightType, it cannot be assumed that the highlight list size equals the keyphrase list size. ExtrGetHighlightListSize may be called repeatedly for a given document. It does not modify the memory at DocumentMemory. ExtrGetHighlightListSize should not be called until after ExtrSignalDocumentEnd has been called. The function returns an error code in ErrorCode. If ErrorCode is zero, there are no problems. Otherwise, a call to ExtrGetErrorMessage will return an explanation for the given error code. ExtrGetHighlightByIndexA call to ExtrGetHighlightByIndex returns a pointer to a string. The string is highlight number HighlightIndex. HighlightIndex ranges from zero to HighlightListSize minus one. ExtrSignalDocumentEnd must be called before ExtrGetHighlightByIndex. The string Highlight may contain single-byte or double-byte characters (see ExtrSetOutputCode). This is why it is of type void **. The memory where Highlight is stored will be cleared when ExtrClearDocumentMemory is called. The application should copy Highlight into a more permanent location. The function returns an error code in ErrorCode. If ErrorCode is zero, there are no problems. Otherwise, a call to ExtrGetErrorMessage will return an explanation for the given error code. ExtrGetDocumentPropertiesA call to ExtrGetDocumentProperties gets various properties of the document. The following properties are currently defined:
The desired property is specified by setting PropID. The property value is returned in PropValue. The values returned for PropID 1 and 2 depend on the language. For example, a word with an apostrophe counts as two words in French (e.g., "j'ai"), but as one word in English (e.g., "don't"). There are no spaces between words in Japanese, so the values returned for PropID 1 and 2 are rough approximations when the document is in Japanese. If ExtrGetDocumentProperties is called before the language has been determined, the values returned for PropID 1 and 2 will be zero. If the document is exceptionally long, Extractor will only read as much of the document as it needs to generate a summary. In this case, PropID 3 will return a value of 0 and PropID 1 and 2 will return values that are less than the actual values for the whole document. This function is optional. There is no need to call it unless you wish to know one or more of the above properties. The function may be called multiple times, in order to get multiple properties. The function returns an error code in ErrorCode. If ErrorCode is zero, there are no problems. Otherwise, a call to ExtrGetErrorMessage will return an explanation for the given error code. ExtrGetErrorMessageA call to ExtrGetErrorMessage returns a pointer to a character string. The string will contain a short description of the problem, such as, "ERROR: Memory allocation error. Out of RAM." ExtrClearDocumentMemoryA call to ExtrClearDocumentMemory will free the memory that was allocated for processing a given document. The function returns an error code in ErrorCode. If ErrorCode is zero, there are no problems. Otherwise, a call to ExtrGetErrorMessage will return an explanation for the given error code. ExtrClearStopMemoryA call to ExtrClearStopMemory will free the memory that was allocated for stop words and stop phrases. The function returns an error code in ErrorCode. If ErrorCode is zero, there are no problems. Otherwise, a call to ExtrGetErrorMessage will return an explanation for the given error code. |
|||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||
Features Evaluate online demonstration sample application software development kit Platform operating system Windows Solaris Linux Mac OS HP/UX ... development C / C# Java Perl Python Visual Basic API Functions Great for... workforce optimization web log tagging refined search knowledge management (KM) information retrieval (IR) semantic web development indexing categorization cataloguing inference engines document management Portal Services Examples: Research Internet Communications HomeLand Security Contextual Web Search Document Mangement Indexing Knowledge Management Intellectual Property Filter Intelligent Search Text Summarization Wireless Push Technology Supporting Documentation FAQ Purchase About |
|||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||