|
|
|
Why does Extractor ignore the META tag in HTML?
The META tag in HTML is used to convey meta-information about the document,
for example:
<META HTTP-EQUIV="Expires" CONTENT="Tue, 04 Dec 1993 21:29:02 GMT">
<META HTTP-EQUIV="Keywords" CONTENT="Nanotechnology, Biochemistry">
<META HTTP-EQUIV="Reply-to" CONTENT="dsr@w3.org (Dave Raggett)">
Extractor ignores this meta-information. In particular, it does not use
the "Keywords" meta-information.
Extractor ignores the META tag for two reasons: (1) If you really care about
the META tag, then you can easily write your own subroutine to parse it.
(2) The META tag is widely abused. It is mainly used as a device for tricking
search engines into giving a page a higher ranking in a hit list when a user
enters a query. If you search for the word "meta", you will find many
web pages that give web authors tips on how to fool search engines by using the META tag.
What is the meaning of the numbers associated with each
key phrase?
When you run the sample program "test_api.exe" (or "test_api.bin" for
Unix platforms), each key phrase will be output with a number after it. These numbers are the scores
returned by the API function ExtrGetScoreByIndex().
The score of a phrase is an estimate of its value as a key phrase. key phrases are ranked in order of descending score.
A score can be any positive real number. The scores with long
documents as input tend to be higher than the scores with short documents.
The method of calculating the score is described in detail in
Learning to Extract key phrases from Text.
For some applications, it might be desirable to normalize
the score.
How can I normalize the score?
For some applications, it might be desirable to normalize the score,
so that the scores of key phrases from different documents can be
compared. Here are some suggestions for normalization:
- Ignore the scores produced by Extractor. Given a large collection of
documents (e.g., web pages), score each key phrase by the percentage of documents
for which the given key phrase was suggested by Extractor. (Example: "The key phrase
'corporate merger' was generated by Extractor for 45 of the 100
documents. Thus 'corporate merger' has a score of 45%.")
- Ignore the scores produced by Extractor. Given a large collection of
documents (e.g., web pages), score each key phrase by the percentage of documents
in which the given key phrase appears somewhere (even if it was not suggested
by Extractor). (Example: "The key phrase 'corporate merger' appears somewhere in
the body of 45 of the 100 documents. Thus 'corporate merger' has a score of
45%.")
- Take the score produced by Extractor and normalize it so that it ranges
from 0% to 100%, by dividing the score of each key phrase by the score of the
first key phrase. (The first key phrase always has the highest score.)
(Example: "Extractor suggests three phrases: 'corporate merger' with a score
of 50, 'stocks' with a score of 30, and 'bonds' with a score of 10. The
normalized scores are 100%, 60%, and 20%, respectively.")
- Longer documents often seem to have better
key phrases than shorter
documents. The problem with suggestion (3) is that it ignores the document
length. One possibility would be to multiply the normalized score of (3)
by (say) the logarithm of the length of the document (measured in number
of words or in bytes). Another possibility would be to sort the document
collection by length and increase the score of documents according to
the percentile in which they appear. (Example: "The key phrase 'corporate merger'
appears in document #345. The key phrase has a normalized score of 60%.
However, since document #345 is in the top 25 percentile of documents in
the collection, according to length, we will boost the score of 'corporate
merger' by 20%, for an adjusted score of 80%.")
Can I find the frequency of each key phrase
in the document?
Although Extractor calculates the frequency of each key phrase in
the input document, the API does not currently enable access to
these numbers. If you are using the frequency as an indicator of the
importance of the key phrase, then you should consider using the score
instead.
Given a sentence such as, "I am not skiing
today," why does Extractor select "skiing" as a key phrase instead
of "not skiing"?
The intention of Extractor is to capture the main topics that
are discussed in the input document. Extractor does not attempt
to convey exactly how these topics are discussed. For example,
if a document discusses legal issues concerning guns, Extractor
might suggest the key phrase "gun law". This key phrase does not
indicate whether the document supports strict legal control of
guns or it is against any government involvement in gun control.
The design of Extractor was based on a study of how authors
use key phrases. We have examined several thousand documents with key phrases supplied by their authors. None of the
key phrases we
have seen so far include the word "not".
I want to use Extractor for automatic
document classification. Can you help me?
Automatic document classification is the use of software to
sort documents into various pre-defined categories. A similar
task is automatic document clustering, in which there are no
pre-defined categories, so the software must create the categories
by itself. If you want to learn more about automatic document classification
and clustering, there is a hypertext
Bibliography on Machine Learning Applied to Text. Extractor
can be used to generate features for use in feature vectors for
machine learning algorithms. (If you are not familiar with this terminology,
it should become clear to you as you read the papers in the bibliography.)
If you wish to use Extractor to generate feature vectors, we suggest
the following approach:
- Apply Extractor to all of the documents in your sample collection.
- Take the union of all of the extracted key phrases
as
the feature set.
- For each document and each feature, let the value of the
feature be the number of times that the given phrase occurs
in the given document (regardless of whether Extractor extracted
it from the given document).
- Apply your favorite machine learning algorithm (e.g., decision
tree induction, neural network, genetic algorithm, etc.) to the
resulting feature vectors.
How can I combine key phrases that were
extracted from many different documents?
For some applications, you may wish to have a list of key phrases
that covers a whole collection of documents, where each document
has been processed individually by Extractor. If you have no
constraints on the size of the list of key phrases, you might
simply take the union of all of the phrases as your combined
list. To reduce the size of the list slightly, you might
drop words that have the same stem (e.g.,
"automobile" and "automobiles"). If you want to substantially
reduce the size of the list, then you can assign a normalized
score
to each key phrase and select the key phrases with the highest
normalized scores.
Can Extractor handle language X?
Extractor currently works with monolingual documents in English,
French, Japanese, German, Spanish, or Korean.
Can Extractor handle document format X?
Extractor currently handles plain text, HTML, and email.
The HTML filter handles HTML escape sequences for accents and
ISO Latin-1 HTML character entities. The email filter handles
MIME quoted-printable accents.
If you are developing software which must handle other formats,
there are several companies that offer conversion modules that
can be embedded in your software.
Can Extractor handle character encoding X?
For English, French, German, and Spanish, Extractor currently handles ISO Latin-1, MS-DOS
Code Page 437, and Unicode UCS2 double-byte character codes, using native
byte ordering. There is a choice of four Japanese character encodings: JIS,
Shift-JIS, EUC-JP, and Unicode UCS-2. There is a choice of three Korean
character encodings: EUC-KR, Johap, and Unicode UCS-2.
How can I generate 100 key phrases?
Extractor currently allows the user to
specify from 3 to 30 key phrases. For some applications, you may wish to have more
key phrases. One solution is to
break the document into smaller sections
and pass each section to Extractor.
Suppose we gave you a book and asked you to give us a list of key phrases
that capture the main topics of the book. When your list approached 30
key phrases, we think you would struggling to think of more key phrases.
It seems likely that there are less than 30 "main topics" for most books.
Perhaps an average book only has 10 or 15 "main topics", but you could
cover each topic with 2 or 3 synonymous key phrases, to yield a total of
about 30 key phrases.
On the other hand, if we took any single chapter from the same book, and
asked you to give us a list of key phrases that capture the main topics
of the chapter, we think the list would be approximately the same size as the
list you would give us for the whole book. A key phrase that captures the
"main topic" of the chapter might only capture a "minor topic" of the
whole book. So the union of the key phrases for each chapter would be
a superset of the key phrases for the whole book.
This is why Extractor has a maximum of 30 key phrases per "chunk". If you
want more key phrases, then you can break the document into smaller
"chunks" and take the union of the key phrases for each individual "chunk".
We believe that this strategy will produce a superior list to the strategy
of treating the document as a single, homogenous whole.
When I give a document to Extractor and ask
for four key phrases and then take the same document and ask for
seven key phrases, the four key phrases are not always a subset of the
seven key phrases. Why?
This is explained in detail in
Learning to Extract
key phrases from Text. If it is important
for your application that the four key phrases that you get when
you ask for four key phrases should be the same as the first four key phrases that you get when you ask for seven
key phrases, then
ask for seven key phrases but only take the first four. In general,
if you currently want M key phrases but you might eventually want
N key phrases (where N > M), then ask Extractor for N key phrases,
but only take the first M key phrases. Better yet, store all N key phrases, so you can later lookup the remaining N - M
key phrases instead of running Extractor twice.
I want Extractor to generate exactly N highlights
(key sentences). I know that I can set the number of key phrases, but
how do I set the number of highlights?
Extractor currently allows the user to
specify from 3 to 30
key phrases (key concepts). If you have
set the highlight type to allow duplicates, then the number of
highlights (key sentences) will be the same as the specified number of key phrases. For each
key phrase, there will be a matching highlight, showing
the key phrase in context. (There might be fewer highlights than key phrases,
if Extractor was not able to find a good highlight for a certain key phrase.)
However, if you set the highlight type to remove duplicates, then there will
usually be fewer highlights than key phrases (because two or more different key phrases may be best illustrated by the same key sentence).
On average, when duplicate
highlights are removed, if you specify K key phrases, then you will get approximately
N = 0.6 × K highlights. If you require exactly N highlights,
with duplicate highlights removed, here are some options:
- Ask for K = 2 × N
key phrases. On average, you will get about
0.6 × 2 × N = 1.2 × N highlights.
Set the highlight type to remove duplicates and to sort the highlights by order
of appearance in the text. Take the first N highlights as your desired key sentences.
If there are more highlights available, ignore them. If there are not enough
highlights available, try asking for K = 2.5 × N key phrases.
If K = 2.5 × N is greater than 30, then break the document into smaller sections
and pass each section to Extractor.
- Alternatively, ask for K = 3 × N
key phrases.
Set the highlight type to allow duplicates. In general, you will get
3 × N highlights, with duplicates. The i-th highlight
shows the i-th key phrase in the context of a sentence.
Find the score of the i-th
key phrase and use this score as
a measure of the quality of the corresponding highlight. When there are
duplicate highlights, score the highlight by the maximum of the scores of
each copy of the highlight. Output the top N scoring highlights.
- Proceed as in the previous suggestion, but when there are duplicate
highlights, score the highlight by the sum of the scores of each copy
of the highlight. Output the top N scoring highlights.
- Proceed as in the previous suggestion, but when there are duplicate
highlights, score the highlight by the number of copies of the highlight.
For example, if there are three copies of a certain sentence, then that
sentence gets a score of three. In other words, the score of a highlight
is the number of key phrases that it contains. Output the top N scoring highlights.
In my input document, I repeatedly use the word "X".
It is a very important word, and I use it early and frequently. Yet
Extractor does not recognize it as a key phrase. Why?
There are several possibilities. First, Extractor ignores words with less
than three letters. Second, your word "X" might be in the
stop word list. Third, your word "X" might be in the
stop phrase list. You cannot remove a word from the stop word or stop phrase lists
through the API. You will need access to the source code if you wish to
remove words or phrases from the stop word or stop phrase lists. You also cannot modify the
minimum required word length (three letters) through the API. However,
your issue might be addressed by adding "X" to the list of
go phrases. A go phrase will
be found, when it appears in the input document, even if it also appears
in the stop word or stop phrase lists.
In our documents, we have phrases with four and more words.
What does Extractor do? Is there a limit to the number of words in a key phrase?
Extractor is designed to extract key phrases with one, two, or three words. We
have collected thousands of documents with key phrases supplied by the authors,
and authors only create key phrases with four or more words
about 5% of the time. When we
try to include phrases with four or more words, we can cover a few more
of the authors' key phrases, but we also introduce a few more errors. Since there
is a net loss, Extractor does not attempt to cover these longer phrases.
There are two things you might try, if you really need to capture these longer
phrases:
- If Extractor outputs a phrase of the form "A B C" and a phrase
of the form "B C D", then you can conjecture that these are parts of a longer
phrase "A B C D", and join them together. For example, "National Research
Council" and "Research Council Canada" would be joined to make
"National Research Council Canada".
- If you
activate the highlights feature
(key sentences) and
set the highlight feature to mark key phrases in bold, the
bold marking will include phrases of four or more words. You can then extract,
from the highlights, the phrases that are marked in bold,
by writing your own routine to process the output highlights.
I use programming language X. Is there a way to
call the Extractor API from within language X?
The
Extractor API is written in ISO/ANSI C. Whatever programming language
you use, it is almost certain that there is a way for your language to
call an external C program. If you are programming in C or C++, you will
have no problems calling Extractor. If you are programming in Java, Perl,
Python, or Visual Basic, we have some experience with calling Extractor
from these languages. Please
contact us for help.
|
|