Extractor API
Descriptions |
|
|
The Extractor API allows
several documents to be processed simultaneously,
using separate threads for each document. This is useful, for
example, when processing web pages. A major bottle-neck when
downloading web pages is waiting for web servers to respond to
requests for pages. One way around this bottle-neck is to download
several pages simultaneously, using a separate thread to process
each page. |
|
Extractor is also fully reentrant, to allow multithreading without the use of
Win32 services.
|
|
Please see below for a
complete discussion of the Extractor API
functions: |
|
|
ExtrCreateDocumentMemory |
|
This function creates a
block of memory for storing data about a single document. It returns a
pointer value that is a unique identifier for this block of memory. This
pointer is later passed to any other functions that process the given
document.
A document is processed as
a sequence of memory blocks, by calling ExtrReadDocumentBuffer. A
typical document will involve multiple calls to ExtrReadDocumentBuffer.
Each call updates the state of the memory that is reserved for
processing the given document, DocumentMemory.
In a typical application
with
multiple threads, there will be a one-to-one relationship between
threads and DocumentMemory values, and also between DocumentMemory
values and individual documents. On the other hand, threads may share
StopMemory values, depending on whether it makes sense to use the same
stop words and stop phrases for all of the documents that are currently
being processed.
|
|
|
|
ExtrCreateStopMemory
|
This function creates a
block of memory for storing stop words and stop phrases. It returns a
pointer value in StopMemory that is a unique identifier for this block
of memory. This pointer is later passed to any other functions that use
the stop words or stop phrases.
The function returns an
error code in ErrorCode. If ErrorCode is zero, there are no problems.
Otherwise, a call to ExtrGetErrorMessage will return an explanation for
the given error code.
A stop word is a word that
is not allowed in a keyphrase. For example, "the" is a stop word. A stop
phrase is a phrase that is not allowed as a keyphrase. The distinction
between a stop word and a single-word stop phrase is that a keyphrase
will be rejected if it contains a given stop word, but it will
only be rejected if it exactly matches a given stop phrase. For
example, if "access" is a stop word, then the phrase "information
access" will be rejected. If "access" is a stop phrase, then the phrase
"information access" is acceptable, although the single-word phrase
"access" will be rejected.
Calling
ExtrCreateStopMemory will initialize the stop word list with some
standard stop words (including "the", for example). The standard list
may be extended by calling ExtrAddStopWord or ExtrAddStopPhrase.
|
|
|
|
ExtrActivateHighlights
|
A highlight is a key
sentence. This function activates the highlight extraction feature
for DocumentMemory. By default, it is assumed that the user does not
want highlight extraction. ExtrActivateHighlights should be called
before any calls to ExtrReadDocumentBuffer, since it will affect how the
document is read. The main result of calling ExtrActivateHighlights is
that the functions ExtrGetHighlightListSize and ExtrGetHighlightByIndex
will return some highlights selected by Extractor.
Extractor attempts to find
one key sentence for each keyphrase that it finds. For a given keyphrase,
it is possible that Extractor may not be able to find a good example of
a sentence that contains the keyphrase. The function
ExtrGetHighlightListSize will return the number of highlights that were
generated. This number is always less than or equal to the number of
keyphrases that were generated, as given by ExtrGetPhraseListSize.
|
|
|
|
ExtrActivateHTMLFilter |
|
This function signals that
the document DocumentMemory contains HTML tags. By default, it is
assumed that the document does not contain HTML tags.
ExtrActivateHTMLFilter should be called before any calls to
ExtrReadDocumentBuffer, since it will affect how the document is read.
The main result of calling ExtrActivateHTMLFilter is that HTML tags will
be parsed. Most tags are ignored, but some tags are used to identify
sentence boundaries.
The HTML filter will also
convert special symbol codes to the symbols that they represent. For
example, "é" will be converted to "é".
|
|
|
|
ExtrActivateEmailFilter
|
This function signals that
the document DocumentMemory contains an e-mail header. By default, it is
assumed that the document does not contain an e-mail header.
ExtrActivateEmailFilter should be called before any calls to
ExtrReadDocumentBuffer, since it will affect how the document is read.
The main result of calling ExtrActivateEmailFilter is that the e-mail
header will be ignored, except for the "Subject" field.
Many e-mail gateways cannot
handle 8 bit character codes. Often 8 bit character codes will be
converted to 7 bit codes, for safe mailing. The e-mail filter will
convert MIME quoted-printable 7 bit character codes back to 8 bit codes.
The e-mail filter
understands MIME types. E-mail attachments will be treated according to
their MIME types. Keyphrases will be extracted from plain text and HTML
attachments. Other types of attachments will be ignored. The HTML filter
will be automatically activated if the MIME type indicates that the
attachment is HTML. Therefore ExtrActivateHTMLFilter should not be
called by the user when processing e-mail.
Note: Activating the
e-mail filter with Japanese or Korean text will have no effect.
|
|
|
ExtrDeactivateTextFilter
|
This function deactivates
the plain text filter for DocumentMemory. By default, when the following
conditions are met, the input document is assumed to be plain text:
When these conditions are
met, the plain text filter is activated. The plain text filter will
attempt to remove non-textual items from the input document, such as
tables and addresses. It will also attempt to use white space to
determine the boundaries between titles, section headings, and regular
paragraphs. If you do not want the plain text filter to process the
input document in these ways, then call ExtrDeactivateTextFilter. Since
calling ExtrDeactivateTextFilter will affect how the document is read,
it should be called before any calls to ExtrReadDocumentBuffer.
If the input document
contains tabs, the text filter may interpret the lines with tabs as
table rows. These lines may be skipped. If you suspect that the text
filter is skipping lines that should be processed, then try calling
ExtrDeactivateTextFilter.
Internally, Extractor uses
the characters 1D (hex) to mark a phrase boundary and 1E
(hex) to mark a sentence boundary. The text filter automatically inserts
these characters in a plain text document, by analyzing the white space
in the document (i.e., line feeds, blanks, tabs, and carriage returns).
For example, if two lines are separated by several line feeds
(significant vertical white space), then the text filter will remove the
white space and insert a sentence boundary marker. This automatic
process works well for most plain text documents, but you may wish to
write your own filter for a certain type of input document (e.g., a
certain type of word processor file). You can run the document through
your own filter program, and then send the resulting plain text to
Extractor. In this case, you should call ExtrDeactivateTextFilter, but
do not call ExtrActivateHTMLFilter or ExtrActivateEmailFilter. Your
filter program can help Extractor by inserting markers for phrase
boundaries (1D) and sentence boundaries (1E) in the
appropriate places.
|
|
|
|
ExtrSetInputCode
|
A call to ExtrSetInputCode
sets the document character code that Extractor uses to process the
input text buffer. The character code is given by CharCodeID.
CharCodeID
|
Character Code |
Compatible
languages |
Description |
0
|
ISO-8859-1 |
English, French,
German, Spanish |
ISO-8859-1 is also
known as ISO Latin-1. |
1
|
MS-DOS |
English, French,
German, Spanish |
MS-DOS is also
known as MS-DOS Code Page 437. |
2
|
Unicode UCS2 |
All |
Unicode UCS2
double-byte characters, in native byte order. |
3
|
Shift-JIS |
Japanese only |
SJIS, MS-Kanji,
Code Page 932. |
4
|
JIS |
Japanese only |
New, Old, NEC,
ISO-2022-JP. |
5
|
EUC-JP |
Japanese only |
Extended UNIX
Code, Packed Format for Japanese. |
6
|
EUC-KR |
Korean only |
KS C 5601-1987,
KSC5601, Extended UNIX Code, Packed Format for Korean, Code Page
949. |
7
|
Johap |
Korean only |
Johab, KS X
1001:1992 alternate encoding. |
The supported Japanese
character sets for all the Japanese encodings are:
- JIS X 0208-1990 (note:
JIS X 0212-1990 NOT SUPPORTED)
- ASCII
- Halfwidth katakana
The supported Korean
character sets for all the Korean encodings are:
- KS X 1001, KS X 1002,
KS X 1005-1, Code page 949 (for Windows 95, NT)
- KS X 2901 (for UNIX),
Johap
- ASCII
ISO-8859-1 and MS-DOS Code
Page 437 agree on the coding of non-accented alphabetical characters. If
there are no accents in the input text, and the text is in single-byte
characters, then the choice between the two should not matter.
Unicode UCS2 uses
double-byte characters. UCS2 is sensitive to the byte ordering of the
hardware platform (big endian versus little endian). Extractor handles
UCS2 characters using the byte ordering of the hardware for which it is
compiled (native byte ordering).
|
|
|
|
ExtrSetOutputCode
|
A call to ExtrSetOutputCode
sets the document character code that Extractor uses for the output list
of keyphrases. The character code is given by CharCodeID.
CharCodeID
|
Character Code |
Compatible
languages |
Description |
0
|
ISO-8859-1 |
English, French,
German, Spanish |
ISO-8859-1 is also
known as ISO Latin-1. |
1
|
MS-DOS |
English, French,
German, Spanish |
MS-DOS is also
known as MS-DOS Code Page 437. |
2
|
Unicode UCS2 |
All |
Unicode UCS2
double-byte characters, in native byte order. |
3
|
Shift-JIS |
Japanese only |
SJIS, MS-Kanji,
Code Page 932. |
4
|
JIS |
Japanese only |
New, Old, NEC,
ISO-2022-JP. |
5
|
EUC-JP |
Japanese only |
Extended UNIX
Code, Packed Format for Japanese. |
6
|
EUC-KR |
Korean only |
KS C 5601-1987,
KSC5601, Extended UNIX Code, Packed Format for Korean, Code Page
949. |
7
|
Johap |
Korean only |
Johab, KS X
1001:1992 alternate encoding. |
The supported Japanese
character sets for all the Japanese encodings are:
- JIS X 0208-1990 (note:
JIS X 0212-1990 NOT SUPPORTED)
- ASCII
- Halfwidth katakana
The supported Korean
character sets for all the Korean encodings are:
- KS X 1001, KS X 1002,
KS X 1005-1, Code page 949 (for Windows 95, NT)
- KS X 2901 (for UNIX),
Johap
- ASCII
ISO-8859-1 and MS-DOS Code
Page 437 agree on the coding of non-accented alphabetical characters. If
there are no accents in the input text, and the text is in single-byte
characters, then the choice between the two should not matter.
Unicode UCS2 uses
double-byte characters. UCS2 is sensitive to the byte ordering of the
hardware platform (big endian versus little endian). Extractor handles
UCS2 characters using the byte ordering of the hardware for which it is
compiled (native byte ordering).
|
|
ExtrSetDocumentLanguage
|
A call to
ExtrSetDocumentLanguage sets the language that Extractor uses to process
the input text buffer. The language is given by LanguageID.
LanguageID
|
Language |
Description |
0
|
Automatic |
Let Extractor
automatically detect the language (for English, French, German,
Spanish). |
1
|
English |
Force Extractor to
interpret the document as English. |
2
|
French |
Force Extractor to
interpret the document as French. |
3
|
Japanese |
Force Extractor to
interpret the document as Japanese. |
4
|
German |
Force Extractor to
interpret the document as German. |
5
|
Spanish |
Force Extractor to
interpret the document as Spanish. |
6
|
Korean |
Force Extractor to
interpret the document as Korean. |
|
|
|
|
ExtrSetNumberPhrases
|
This function sets the
desired number of output phrases. The default number is seven. This is
the number that will be generated on average; the actual number of
phrases that are output for a given document may be slightly less or
slightly more than the number specified by DesiredNumber. Note that
DesiredNumber is only set for the given document DocumentMemory. This is
so that several documents may be processed simultaneously, each with a
different desired number of keyphrases.
The DesiredNumber must be
between 3 and 30. Values outside of this range will be converted to the
closest value inside the range. No error message will be generated when
values are out of range.
This function is optional.
There is no need to call it unless you wish to override the default
value of seven phrases.
|
|
|
|
ExtrSetHighlightType
|
A highlight is a key
sentence. If ExtrActivateHighlights has been called, then Extractor
attempts to find one key sentence for each keyphrase that it finds. The
ExtrSetHighlightType function sets the type (i.e., style) of highlight
that is generated. |
|
|
|
ExtrAddStopWord
|
This function adds the
string Word to the list of stop words stored in the memory at StopMemory.
The stop words are stored in a hash table. It does no harm to try to
store the same word twice. It is assumed that Word is in lower case and
that Word is a single word (containing no white space).
Stop words are stored
separately for each language. The language is given by LanguageID.
ExtrAddStopWord will return a non-zero error code if LanguageID is
invalid or if Word contains anything other than lower case characters.
LanguageID
|
Language |
Description |
1
|
English |
Add the given stop
word to the English stop words. |
2
|
French |
Add the given stop
word to the French stop words. |
4
|
German |
Add the given stop
word to the German stop words. |
5
|
Spanish |
Add the given stop
word to the Spanish stop words. |
6
|
Korean |
Add the given stop
word to the Korean stop words. |
The character code is given
by CharCodeID. Word is of type void * so that either single-byte or
double-byte character strings can be passed to this function.
CharCodeID
|
Character Code |
Language |
Description |
0
|
ISO-8859-1 |
English, French,
German, Spanish |
ISO-8859-1 is also
known as ISO Latin-1. |
1
|
MS-DOS |
English, French,
German, Spanish |
MS-DOS is also
known as MS-DOS Code Page 437. |
2
|
Unicode UCS2 |
All |
Unicode UCS2
double-byte characters, in native byte order. |
6
|
EUC-KR |
Korean only |
KS C 5601-1987,
KSC5601, Extended UNIX Code, Packed Format for Korean, Code Page
949. |
7
|
Johap |
Korean only |
Johab, KS X
1001:1992 alternate encoding. |
ExtrAddStopWord should be
called before any calls to ExtrReadDocumentBuffer, since it will affect
how the document is read.
When the stop word list is
first created, by ExtrCreateStopMemory, it is initialized with a list of
common stop words. It may not be necessary to add any extra stop words.
That is, it may not be necessary to call ExtrAddStopWord.
A stop word is a word that
is not allowed in a keyphrase. For example, "the" is a stop word. A stop
phrase is a phrase that is not allowed as a keyphrase. The distinction
between a stop word and a single-word stop phrase is that a keyphrase
will be rejected if it contains a given stop word, but it will
only be rejected if it exactly matches a given stop phrase. For
example, if "access" is a stop word, then the phrase "information
access" will be rejected. If "access" is a stop phrase, then the phrase
"information access" is acceptable, although the single-word phrase
"access" will be rejected.
Note: At this time, you
cannot add new stop words for Japanese text. However, you can add new
Japanese stop phrases.
The function returns an
error code in ErrorCode. If ErrorCode is zero, there are no problems.
Otherwise, a call to ExtrGetErrorMessage will return an explanation for
the given error code.
|
|
|
|
ExtrAddStopPhrase
|
This function adds the
string Phrase to the list of stop phrases stored in the memory at
StopMemory. The stop phrases are stored in a hash table. It does no harm
to try to store the same phrase twice. It is assumed that Phrase is in
lower case. Phrase may be one, two, or three words, separated by a
single space.
Stop phrases are stored
separately for each language. The language is given by LanguageID.
ExtrAddStopPhrase will return a non-zero error code if LanguageID is
invalid or if Phrase contains anything other than lower case characters
and spaces.
LanguageID
|
Language |
Description |
1
|
English |
Add the given stop
phrase to the English stop phrases. |
2
|
French |
Add the given stop
phrase to the French stop phrases. |
3
|
Japanese |
Add the given stop
phrase to the Japanese stop phrases. |
4
|
German |
Add the given stop
phrase to the German stop phrases. |
5
|
Spanish |
Add the given stop
phrase to the Spanish stop phrases. |
6
|
Korean |
Add the given stop
phrase to the Korean stop phrases. |
The character code is given
by CharCodeID. Phrase is of type void * so that either single-byte or
double-byte character strings can be passed to this function.
CharCodeID
|
Character Code |
Language |
Description |
0
|
ISO-8859-1 |
English, French,
German, Spanish |
ISO-8859-1 is also
known as ISO Latin-1. |
1
|
MS-DOS |
English, French,
German, Spanish |
MS-DOS is also
known as MS-DOS Code Page 437. |
2
|
Unicode UCS2 |
All |
Unicode UCS2
double-byte characters, in native byte order. |
3
|
Shift-JIS |
Japanese only |
SJIS, MS-Kanji,
Code Page 932. |
4
|
JIS |
Japanese only |
New, Old, NEC,
ISO-2022-JP. |
5
|
EUC-JP |
Japanese only |
Extended UNIX
Code, Packed Format for Japanese. |
6
|
EUC-KR |
Korean only |
KS C 5601-1987,
KSC5601, Extended UNIX Code, Packed Format for Korean, Code Page
949. |
7
|
Johap |
Korean only |
Johab, KS X
1001:1992 alternate encoding. |
The supported Japanese
character sets for all the Japanese encodings are:
- JIS X 0208-1990 (note:
JIS X 0212-1990 NOT SUPPORTED)
- ASCII
- Halfwidth katakana
The supported Korean
character sets for all the Korean encodings are:
- KS X 1001, KS X 1002,
KS X 1005-1, Code page 949 (for Windows 95, NT)
- KS X 2901 (for UNIX),
Johap
- ASCII
When the stop phrase list
is first created, by ExtrCreateStopMemory, it is initialized with a list
of common stop phrases. It may not be necessary to add any extra stop
phrases. That is, it may not be necessary to call ExtrAddStopPhrase.
A stop word is a word that
is not allowed in a keyphrase. For example, "the" is a stop word. A stop
phrase is a phrase that is not allowed as a keyphrase. The distinction
between a stop word and a single-word stop phrase is that a keyphrase
will be rejected if it contains a given stop word, but it will
only be rejected if it exactly matches a given stop phrase. For
example, if "access" is a stop word, then the phrase "information
access" will be rejected. If "access" is a stop phrase, then the phrase
"information access" is acceptable, although the single-word phrase
"access" will be rejected.
The function returns an
error code in ErrorCode. If ErrorCode is zero, there are no problems.
Otherwise, a call to ExtrGetErrorMessage will return an explanation for
the given error code.
|
|
|
|
ExtrAddGoPhrase
|
If the input document was
found by issuing a query to a search engine, the user may have a special
interest in whether the query terms appear in the document, and the
context in which the query terms appear. This can be achieved by calling
the function ExtrAddGoPhrase with each of the terms in the query.
This function adds the
string Phrase to the list of go phrases stored in the memory at
StopMemory. A go phrase is a phrase that will be treated as if it were a
key phrase, if it appears in the input document. Go phrases are stored
in a list and each sentence in the input document is scanned for each go
phrase in the list. This has two important implications: (1) A large
list of go phrases may slow the execution of Extractor. (2) A go phrase
in the input document will not be detected if it spans a sentence
boundary.
A go phrase may consist of
one or more words or fragments of words. Any character sequence is
permitted, except for an empty string. The letters may be in upper or
lower case. A go phrase may range from a single character to a full
sentence. A go phrase may contain punctuation.
Go phrases are stored
separately for each language. The language is given by LanguageID.
ExtrAddGoPhrase will return a non-zero error code if LanguageID is
invalid or if CharCodeID is not compatible with LanguageID.
The following types of
matches are supported:
When go phrases are found
in the input document, they will be inserted at the top of the keyphrase
list. They will take priority over the regular keyphrases. The length of
the keyphrase list will be kept at the value set by ExtrSetNumberPhrases.
For each go phrase that is added to the top of the keyphrase list, a
regular keyphrase will be deleted from the bottom of the keyphrase list.
(Note that Extractor ranks the keyphrases in order of decreasing
estimated importance.) A go phrase can be distinguished from a regular
keyphrase (a keyphrase generated automatically by Extractor) by its
score. All go phrases are given a score of zero, but a regular keyphrase
never has a score of zero.
When a go phrase is found,
it is inserted into the keyphrase list in exactly the same form as it
was given to ExtrAddGoPhrase. This may be different from the form it has
in the input document, depending on MatchType.
If highlights have been
activated (by ExtrActivateHighlights), then each go phrase that is found
in the input document will have a corresponding highlight. Extractor
attempts to find a good sentence to illustrate each go phrase. If bold
markup is set (by ExtrSetHighlightType, then the go phrases will be
marked in bold within the corresponding highlights. Neighbouring words
and characters may also be marked in bold, if they appear to be closely
connected to the go phrase.
A go phrase might appear in
the document, and yet not be found by Extractor. If the go phrase spans
a sentence boundary, it will not be detected. For example, "home
cooking" will not be found in the text "Pasta is popular in our home.
Cooking pasta is easy." Also, if the input document is very long,
Extractor may not read the full document, since it should be possible to
make a good summary without reading the full text. Therefore, if the go
phrase only appears at the end of a very long document, it might not be
detected by Extractor. Finally, the number of go phrases that will be
found is limited by the desired number of keyphrases, set by
ExtrSetNumberPhrases. If the number of go phrases in the input document
is greater than the desired number of keyphrases, then the go phrases
that appear earlier in the text will be given priority.
The following languages are
supported:
LanguageID
|
Language |
Description |
1
|
English |
Add the given go
phrase to the English go phrases. |
2
|
French |
Add the given go
phrase to the French go phrases. |
3
|
Japanese |
Add the given go
phrase to the Japanese go phrases. |
4
|
German |
Add the given go
phrase to the German go phrases. |
5
|
Spanish |
Add the given go
phrase to the Spanish go phrases. |
6
|
Korean |
Add the given go
phrase to the Korean go phrases. |
The character code is given
by CharCodeID. Phrase is of type void * so that either single-byte or
double-byte character strings can be passed to this function.
CharCodeID
|
Character Code |
Language |
Description |
0
|
ISO-8859-1 |
English, French,
German, Spanish |
ISO-8859-1 is also
known as ISO Latin-1. |
1
|
MS-DOS |
English, French,
German, Spanish |
MS-DOS is also
known as MS-DOS Code Page 437. |
2
|
Unicode UCS2 |
All |
Unicode UCS2
double-byte characters, in native byte order. |
3
|
Shift-JIS |
Japanese only |
SJIS, MS-Kanji,
Code Page 932. |
4
|
JIS |
Japanese only |
New, Old, NEC,
ISO-2022-JP. |
5
|
EUC-JP |
Japanese only |
Extended UNIX
Code, Packed Format for Japanese. |
6
|
EUC-KR |
Korean only |
KS C 5601-1987,
KSC5601, Extended UNIX Code, Packed Format for Korean, Code Page
949. |
7
|
Johap |
Korean only |
Johab, KS X
1001:1992 alternate encoding. |
The supported Japanese
character sets for all the Japanese encodings are:
- JIS X 0208-1990 (note:
JIS X 0212-1990 NOT SUPPORTED)
- ASCII
- Halfwidth katakana
The supported Korean
character sets for all the Korean encodings are:
- KS X 1001, KS X 1002,
KS X 1005-1, Code page 949 (for Windows 95, NT)
- KS X 2901 (for UNIX),
Johap
- ASCII
The function returns an
error code in ErrorCode. If ErrorCode is zero, there are no problems.
Otherwise, a call to ExtrGetErrorMessage will return an explanation for
the given error code.
|
|
|
|
ExtrReadDocumentBuffer
|
This function reads the
text in the buffer DocumentBuffer and updates the memory at
DocumentMemory. The processing of the buffer is affected by StopMemory.
In a typical application,
there will be a series of calls to ExtrReadDocumentBuffer for a given
document DocumentMemory. The idea is that the document is read in
chunks. A call to ExtrSignalDocumentEnd signals that the last chunk has
been sent (the end of the given document has been reached).
A call to
ExtrReadDocumentBuffer will change the memory at DocumentMemory, but the
memory at StopMemory will not be modified. If there are multiple
threads, each thread will have a unique value for DocumentMemory, but
several threads may share StopMemory.
The buffer DocumentBuffer
may contain single-byte or double-byte characters (see ExtrSetInputCode).
This is why it is of type void *. The buffer length BufferLength
specifies the number of bytes in the buffer, not the number of
characters. When the character code (set by ExtrSetInputCode) indicates
double-byte characters, BufferLength must be an even number. That is,
the end of the buffer is not allowed to divide a double-byte character
into two parts.
The function returns an
error code in ErrorCode. If ErrorCode is zero, there are no problems.
Otherwise, a call to ExtrGetErrorMessage will return an explanation for
the given error code.
|
|
|
|
ExtrSignalDocumentEnd
|
A call to
ExtrSignalDocumentEnd signals that the end of the document has been
reached; there will be no further calls to ExtrReadDocumentBuffer with
this particular DocumentMemory. This signal triggers the generation of
the final list of keyphrases.
The phrases in the final
list of keyphrases are compared with the list of stop phrases in
StopMemory and any matching phrases are deleted from the final list of
keyphrases. Case is ignored for matching, but otherwise an exact match
is required.
ExtrSignalDocumentEnd
should only be called once for a given document DocumentMemory. After
ExtrSignalDocumentEnd has been called for a given document, that
document has no further need for the stop words and stop phrases stored
in StopMemory. Unless there are other documents that will need
StopMemory, the memory used by StopMemory may be released after
ExtrSignalDocumentEnd has been called.
The function returns an
error code in ErrorCode. If ErrorCode is zero, there are no problems.
Otherwise, a call to ExtrGetErrorMessage will return an explanation for
the given error code.
|
|
|
|
ExtrGetPhraseListSize
|
The function
ExtrGetPhraseListSize returns an integer value that is the number of
keyphrases that were generated. If there is an error, PhraseListSize
will be set to zero.
ExtrGetPhraseListSize may
be called repeatedly for a given document. It does not modify the memory
at DocumentMemory. ExtrGetPhraseListSize should not be called until
after ExtrSignalDocumentEnd has been called.
The function returns an
error code in ErrorCode. If ErrorCode is zero, there are no problems.
Otherwise, a call to ExtrGetErrorMessage will return an explanation for
the given error code.
|
|
|
|
ExtrGetPhraseByIndex
|
A call to
ExtrGetPhraseByIndex returns a pointer to a string. The string is phrase
number PhraseIndex. PhraseIndex ranges from zero to PhraseListSize minus
one. Phrases are approximately in order of decreasing quality.
ExtrSignalDocumentEnd must be called before ExtrGetPhraseByIndex.
The string Phrase may
contain single-byte or double-byte characters (see ExtrSetOutputCode).
This is why it is of type void **.
The memory where Phrase is
stored will be cleared when ExtrClearDocumentMemory is called. The
application should copy Phrase into a more permanent location.
The function returns an
error code in ErrorCode. If ErrorCode is zero, there are no problems.
Otherwise, a call to ExtrGetErrorMessage will return an explanation for
the given error code.
|
|
|
|
ExtrGetScoreByIndex
|
A call to
ExtrGetScoreByIndex copies a number into the location given by the
pointer. The number is the score assigned to phrase number PhraseIndex.
PhraseIndex ranges from zero to PhraseListSize minus one. The score of a
phrase is an estimate of its value as a keyphrase. Keyphrases are ranked
in order of descending score. ExtrSignalDocumentEnd must be called
before ExtrGetScoreByIndex.
This function is optional.
There is no need to call it unless you are curious about the score that
is assigned to a phrase.
The function returns an
error code in ErrorCode. If ErrorCode is zero, there are no problems.
Otherwise, a call to ExtrGetErrorMessage will return an explanation for
the given error code.
|
|
|
|
ExtrGetDocumentLanguage
|
A call to
ExtrGetDocumentLanguage gets the language of the document. If the
language was set by a call to ExtrSetDocumentLanguage, then
ExtrGetDocumentLanguage returns the same value that was specified with
ExtrSetDocumentLanguage. If Extractor was allowed to guess the language,
then ExtrGetDocumentLanguage returns the best guess. LanguageID is
passed by reference and is modified in the function.
LanguageID
|
Language |
Description |
0
|
Unknown |
Extractor was not
able to guess, or the language is neither English, French,
German, nor Spanish. |
1
|
English |
Extractor guessed
English, or English was specified by ExtrSetDocumentLanguage. |
2
|
French |
Extractor guessed
French, or French was specified by ExtrSetDocumentLanguage. |
3
|
Japanese |
Japanese was
specified by ExtrSetDocumentLanguage. |
4
|
German |
Extractor guessed
German, or German was specified by ExtrSetDocumentLanguage. |
5
|
Spanish |
Extractor guessed
Spanish, or Spanish was specified by ExtrSetDocumentLanguage. |
6
|
Korean |
Korean was
specified by ExtrSetDocumentLanguage. |
This function is optional.
There is no need to call it unless you wish to know which language
Extractor guessed (English, French, German, or Spanish). Note that
language guessing is currently not available for Japanese or Korean.
The function returns an
error code in ErrorCode. If ErrorCode is zero, there are no problems.
Otherwise, a call to ExtrGetErrorMessage will return an explanation for
the given error code.
|
|
|
|
ExtrGetHighlightListSize
|
The function
ExtrGetHighlightListSize returns an integer value that is the number of
highlights that were generated. If there is an error, HighlightListSize
will be set to zero.
The number of highlights
will be less than or equal to the number of keyphrases. There are two
reasons that the number of highlights might be less than the number of
keyphrases. First, when HighlightType is an odd number, Extractor
removes any duplicate highlights. Second, there may be keyphrases for
which no acceptable highlights were found. Therefore, for all values of
HighlightType, it cannot be assumed that the highlight list size equals
the keyphrase list size.
ExtrGetHighlightListSize
may be called repeatedly for a given document. It does not modify the
memory at DocumentMemory. ExtrGetHighlightListSize should not be called
until after ExtrSignalDocumentEnd has been called.
The function returns an
error code in ErrorCode. If ErrorCode is zero, there are no problems.
Otherwise, a call to ExtrGetErrorMessage will return an explanation for
the given error code.
|
|
|
|
ExtrGetHighlightByIndex
|
A call to
ExtrGetHighlightByIndex returns a pointer to a string. The string is
highlight number HighlightIndex. HighlightIndex ranges from zero to
HighlightListSize minus one. ExtrSignalDocumentEnd must be called before
ExtrGetHighlightByIndex.
The string Highlight may
contain single-byte or double-byte characters (see ExtrSetOutputCode).
This is why it is of type void **.
The memory where Highlight
is stored will be cleared when ExtrClearDocumentMemory is called. The
application should copy Highlight into a more permanent location.
The function returns an
error code in ErrorCode. If ErrorCode is zero, there are no problems.
Otherwise, a call to ExtrGetErrorMessage will return an explanation for
the given error code.
|
|
|
|
ExtrGetDocumentProperties
|
A call to
ExtrGetDocumentProperties gets various properties of the document. The
following properties are currently defined:
PropID
|
Description |
1
|
get the number of
words that were read |
2
|
get the number of
non-stop words (content words) that were read |
3
|
see whether the
whole document was read
(0 = only the beginning of the document was read; 1 = the whole
document was read) |
The desired property is
specified by setting PropID. The property value is returned in PropValue.
The values returned for
PropID 1 and 2 depend on the language. For example, a word with an
apostrophe counts as two words in French (e.g., "j'ai"), but as one word
in English (e.g., "don't"). There are no spaces between words in
Japanese, so the values returned for PropID 1 and 2 are rough
approximations when the document is in Japanese. If
ExtrGetDocumentProperties is called before the language has been
determined, the values returned for PropID 1 and 2 will be zero.
If the document is
exceptionally long, Extractor will only read as much of the document as
it needs to generate a summary. In this case, PropID 3 will return a
value of 0 and PropID 1 and 2 will return values that are less than the
actual values for the whole document.
This function is optional.
There is no need to call it unless you wish to know one or more of the
above properties. The function may be called multiple times, in order to
get multiple properties.
The function returns an
error code in ErrorCode. If ErrorCode is zero, there are no problems.
Otherwise, a call to ExtrGetErrorMessage will return an explanation for
the given error code.
|
|
|
|
ExtrGetErrorMessage
|
A call to
ExtrGetErrorMessage returns a pointer to a character string. The string
will contain a short description of the problem, such as, "ERROR: Memory
allocation error. Out of RAM." |
|
|
|
ExtrClearDocumentMemory
|
A call to
ExtrClearDocumentMemory will free the memory that was allocated for
processing a given document.
The function returns an
error code in ErrorCode. If ErrorCode is zero, there are no problems.
Otherwise, a call to ExtrGetErrorMessage will return an explanation for
the given error code.
|
|
|
|
ExtrClearStopMemory
|
A call to
ExtrClearStopMemory will free the memory that was allocated for stop
words and stop phrases.
The function returns an
error code in ErrorCode. If ErrorCode is zero, there are no problems.
Otherwise, a call to ExtrGetErrorMessage will return an explanation for
the given error code. |
|
back to top |
|
|
|