|
|
Why does
Extractor / xAIgent ignore the META tag in HTML? |
|
What is the
meaning of the numbers associated with each key
phrase? |
|
How can I
normalize the score? |
|
Can I find the
frequency of each key phrase in the document? |
|
Given a
sentence such as, "I am not skiing today," why
does Extractor select "skiing" as a key phrase
instead of "not skiing"? |
|
I want to use
Extractor for automatic document classification.
Can you help me? |
|
How can I
combine key phrases that were extracted from
many different documents? |
|
Can Extractor
handle language X? |
|
Can Extractor
handle document format X? |
|
Can Extractor
handle character encoding X? |
|
How can I
generate 100 key phrases?When I give a document
to Extractor and ask for four key phrases and
then take the same document and ask for seven
key phrases, the four key phrases are not always
a subset of the seven key phrases. Why? |
|
I want
Extractor to generate exactly N highlights (key
sentences). I know that I can set the number of
key phrases, but how do I set the number of
highlights? |
|
In my input
document, I repeatedly use the word "X". It is a
very important word, and I use it early and
frequently. Yet Extractor does not recognize it
as a key phrase. Why? |
|
In our
documents, we have phrases with four and more
words. What does Extractor do? |
|
I use
programming language X. Is there a way to call
the Extractor API from within language X? |
|
|
|
|
Why
does Extractor ignore the META tag in HTML?
The META tag in HTML is used to convey
meta-information about the document, for
example: |
<META HTTP-EQUIV="Expires" CONTENT="Tue, 04 Dec
1993 21:29:02 GMT">
<META HTTP-EQUIV="Keywords" CONTENT="Nanotechnology,
Biochemistry">
<META
HTTP-EQUIV="Reply-to" CONTENT="dsr@w3.org (Dave
Raggett)">
Extractor ignores this meta-information. In
particular, it does not use the "Keywords"
meta-information. Extractor ignores the META tag
for two reasons: (1) If you really care about
the META tag, then you can easily write your own
subroutine to parse it. (2) The META tag is
widely abused. It is mainly used as a device for
tricking search engines into giving a page a
higher ranking in a hit list when a user enters
a query. If you search for the word "meta", you
will find many web pages that give web authors
tips on how to fool search engines by using the
META tag. |
|
What is the
meaning of the numbers associated with each key
phrase?
When you run the sample program "test_api.exe"
(or "test_api.bin" for Unix platforms), each key
phrase will be output with a number after it.
These numbers are the scores returned by the API
function
ExtrGetScoreByIndex(). The score of a phrase
is an estimate of its value as a key phrase. key
phrases are ranked in order of descending score.
A score can be any positive real number. The
scores with long documents as input tend to be
higher than the scores with short documents. The
method of calculating the score is described in
detail in
Learning to Extract key phrases from Text.
For some applications, it might be desirable to
normalize the score. |
|
How can
I normalize the score?
For some applications, it might be desirable to
normalize the score, so that the scores of key
phrases from different documents can be
compared. Here are some suggestions for
normalization: |
|
- Ignore the scores produced by Extractor.
Given a large collection of documents (e.g.,
web pages), score each key phrase by the
percentage of documents for which the given
key phrase was suggested by Extractor.
(Example: "The key phrase 'corporate merger'
was generated by Extractor for 45 of the 100
documents. Thus 'corporate merger' has a
score of 45%.")
|
- Ignore the scores produced by Extractor.
Given a large collection of documents (e.g.,
web pages), score each key phrase by the
percentage of documents in which the given
key phrase appears somewhere (even if it was
not suggested by Extractor). (Example: "The
key phrase 'corporate merger' appears
somewhere in the body of 45 of the 100
documents. Thus 'corporate merger' has a
score of 45%.")
|
- Take the score produced by Extractor and
normalize it so that it ranges from 0% to
100%, by dividing the score of each key
phrase by the score of the first key phrase.
(The first key phrase always has the highest
score.) (Example: "Extractor suggests three
phrases: 'corporate merger' with a score of
50, 'stocks' with a score of 30, and 'bonds'
with a score of 10. The normalized scores
are 100%, 60%, and 20%, respectively.")
|
- Longer documents often seem to have
better key phrases than shorter documents.
The problem with suggestion (3) is that it
ignores the document length. One possibility
would be to multiply the normalized score of
(3) by (say) the logarithm of the length of
the document (measured in number of words or
in bytes). Another possibility would be to
sort the document collection by length and
increase the score of documents according to
the percentile in which they appear.
(Example: "The key phrase 'corporate merger'
appears in document #345. The key phrase has
a normalized score of 60%. However, since
document #345 is in the top 25 percentile of
documents in the collection, according to
length, we will boost the score of
'corporate merger' by 20%, for an adjusted
score of 80%.")
|
Can I
find the frequency of each key phrase in the
document?
Although Extractor calculates the frequency of
each key phrase in the input document, the API
does not currently enable access to these
numbers. If you are using the frequency as an
indicator of the importance of the key phrase,
then you should consider using the
score instead. |
|
Given a
sentence such as, "I am not skiing today," why
does Extractor select "skiing" as a key phrase
instead of "not skiing"?
The intention of Extractor is to capture the
main topics that are discussed in the input
document. Extractor does not attempt to convey
exactly how these topics are discussed. For
example, if a document discusses legal issues
concerning guns, Extractor might suggest the key
phrase "gun law". This key phrase does not
indicate whether the document supports strict
legal control of guns or it is against any
government involvement in gun control. The
design of Extractor was based on a study of how
authors use key phrases. We have examined
several thousand documents with key phrases
supplied by their authors. None of the key
phrases we have seen so far include the word
"not". |
|
I want
to use Extractor for automatic document
classification. Can you help me?
Automatic document classification is the use of
software to sort documents into various
pre-defined categories. A similar task is
automatic document clustering, in which there
are no pre-defined categories, so the software
must create the categories by itself. If you
want to learn more about automatic document
classification and clustering, there is a
hypertext Bibliography on Machine Learning
Applied to Text. Extractor can be used to
generate features for use in feature vectors for
machine learning algorithms. (If you are not
familiar with this terminology, it should become
clear to you as you read the papers in the
bibliography.) If you wish to use Extractor to
generate feature vectors, we suggest the
following approach: |
|
- Apply Extractor to all of the documents
in your sample collection.
|
- Take the union of all of
the extracted key phrases as the feature
set.
|
- For each document and each feature, let
the value of the feature be the number of
times that the given phrase occurs in the
given document (regardless of whether
Extractor extracted it from the given
document).
|
- Apply your favorite machine learning
algorithm (e.g., decision tree induction,
neural network, genetic algorithm, etc.) to
the resulting feature vectors.
|
How can
I combine key phrases that were extracted from
many different documents?
For some applications, you may wish to have a
list of key phrases that covers a whole
collection of documents, where each document has
been processed individually by Extractor. If you
have no constraints on the size of the list of
key phrases, you might simply take the union of
all of the phrases as your combined list. To
reduce the size of the list slightly, you might
drop words that have the same stem (e.g.,
"automobile" and "automobiles"). If you want to
substantially reduce the size of the list, then
you can assign a normalized score
to each key phrase and select the key phrases
with the highest normalized scores. |
|
Can
Extractor handle language X?
Extractor currently works with monolingual
documents in English, French, Japanese, German,
Spanish, or Korean. |
|
Can
Extractor handle document format X?
Extractor currently handles plain text, HTML,
and email. The HTML filter handles HTML escape
sequences for accents and ISO Latin-1 HTML
character entities. The email filter handles
MIME quoted-printable accents. If you are
developing software which must handle other
formats, there are several companies that offer
conversion modules that can be embedded in your
software. |
|
Can
Extractor handle character encoding X?
For English, French, German, and Spanish,
Extractor currently handles ISO Latin-1, MS-DOS
Code Page 437, and Unicode UCS2 double-byte
character codes, using native byte ordering.
There is a choice of four Japanese character
encodings: JIS, Shift-JIS, EUC-JP, and Unicode
UCS-2. There is a choice of three Korean
character encodings: EUC-KR, Johap, and Unicode
UCS-2. |
|
How
can I generate 100 key phrases?
Extractor currently allows the user to specify
from 3 to 30 key phrases. For some applications,
you may wish to have more key phrases. One
solution is to break the document into smaller
sections and pass each section to Extractor.
Suppose we gave you a book and asked you to give
us a list of key phrases that capture the main
topics of the book. When your list approached 30
key phrases, we think you would struggling to
think of more key phrases. It seems likely that
there are less than 30 "main topics" for most
books. Perhaps an average book only has 10 or 15
"main topics", but you could cover each topic
with 2 or 3 synonymous key phrases, to yield a
total of about 30 key phrases.
On the other hand, if we took any single chapter
from the same book, and asked you to give us a
list of key phrases that capture the main topics
of the chapter, we think the list would be
approximately the same size as the list you
would give us for the whole book. A key phrase
that captures the "main topic" of the chapter
might only capture a "minor topic" of the whole
book. So the union of the key phrases for each
chapter would be a superset of the key phrases
for the whole book.
This is why Extractor has a maximum of 30 key
phrases per "chunk". If you want more key
phrases, then you can break the document into
smaller "chunks" and take the union of the key
phrases for each individual "chunk". We believe
that this strategy will produce a superior list
to the strategy of treating the document as a
single, homogenous whole. |
|
When I
give a document to Extractor and ask for four
key phrases and then take the same document and
ask for seven key phrases, the four key phrases
are not always a subset of the seven key
phrases. Why?
This is explained in detail in Learning to
Extract key phrases from Text. If it is
important for your application that the four key
phrases that you get when you ask for four key
phrases should be the same as the first four key
phrases that you get when you ask for seven key
phrases, then ask for seven key phrases but only
take the first four. In general, if you
currently want M key phrases but you might
eventually want N key phrases (where N > M),
then ask Extractor for N key phrases, but only
take the first M key phrases. Better yet, store
all N key phrases, so you can later lookup the
remaining N - M key phrases instead of running
Extractor twice. |
|
- Ask for K = 2 × N key phrases. On
average, you will get about 0.6 × 2 × N =
1.2 × N highlights.
Set the highlight type to remove
duplicates and to sort the highlights by
order of appearance in the text. Take the
first N highlights as your desired key
sentences. If there are more highlights
available, ignore them. If there are not
enough highlights available, try asking for
K = 2.5 × N key phrases. If K = 2.5 × N is
greater than 30, then break the document
into smaller sections and pass each section
to Extractor.
|
- Alternatively, ask for K = 3 × N key
phrases. Set the highlight type to allow
duplicates. In general, you will get 3 × N
highlights, with duplicates. The i-th
highlight shows the i-th key phrase in the
context of a sentence. Find the score of the
i-th key phrase and use this score as a
measure of the quality of the corresponding
highlight. When there are duplicate
highlights, score the highlight by the
maximum of the scores of each copy of the
highlight. Output the top N scoring
highlights.
|
- Proceed as in the previous suggestion,
but when there are duplicate highlights,
score the highlight by the sum of the scores
of each copy of the highlight. Output the
top N scoring highlights.
|
- Proceed as in the previous suggestion,
but when there are duplicate highlights,
score the highlight by the number of copies
of the highlight. For example, if there are
three copies of a certain sentence, then
that sentence gets a score of three. In
other words, the score of a highlight is the
number of key phrases that it contains.
Output the top N scoring highlights.
|
In my
input document, I repeatedly use the word "X".
It is a very important word, and I use it early
and frequently. Yet Extractor does not recognize
it as a key phrase. Why?
There are several possibilities. First,
Extractor ignores words with less than three
letters. Second, your word "X" might be in the
stop word list. Third, your word "X" might be in
the stop phrase list. You cannot remove a word
from the stop word or stop phrase lists through
the API. You will need access to the source code
if you wish to remove words or phrases from the
stop word or stop phrase lists. You also cannot
modify the minimum required word length (three
letters) through the API. However, your issue
might be addressed by adding "X" to the list of
go phrases. A go phrase will be found, when it
appears in the input document, even if it also
appears in the stop word or stop phrase lists. |
|
In our
documents, we have phrases with four and more
words. What does Extractor do? Is there a limit
to the number of words in a key phrase?
Extractor is designed to extract key phrases
with one, two, or three words. We have collected
thousands of documents with key phrases supplied
by the authors, and authors only create key
phrases with four or more words about 5% of the
time. When we try to include phrases with four
or more words, we can cover a few more of the
authors' key phrases, but we also introduce a
few more errors. Since there is a net loss,
Extractor does not attempt to cover these longer
phrases. There are two things you might try, if
you really need to capture these longer phrases: |
|
- If Extractor outputs a phrase of the
form "A B C" and a phrase of the form "B C
D", then you can conjecture that these are
parts of a longer phrase "A B C D", and join
them together. For example, "National
Research Council" and "Research Council
Canada" would be joined to make "National
Research Council Canada".
|
- If you activate the highlights feature
(key sentences) and set the highlight
feature to mark key phrases in bold, the
bold marking will include phrases of four or
more words. You can then extract, from the
highlights, the phrases that are marked in
bold, by writing your own routine to process
the output highlights.
|
I use
programming language X. Is there a way to call
the Extractor API from within language X?
The Extractor API is written in ISO/ANSI C.
Whatever programming language you use, it is
almost certain that there is a way for your
language to call an external C program. If you
are programming in C or C++, you will have no
problems calling Extractor. If you are
programming in Java, Perl, Python, or Visual
Basic, we have some experience with calling
Extractor from these languages. Please contact
us for help. |
|
|
|
|
|
|