RELATED U.S. APPLICATIONS DATA
This application is a continuation to U.S. application Ser. No. 12/757,910, filed Apr. 9, 2010. The application is incorporated herein by reference for all purposes
BACKGROUND OF THE INVENTION
Traditionally, editors of documents, such as web pages, manually selected portions of the documents to augment with additional information. For example, an editor might choose to associate various terms in the document with assorted hyperlinks to other documents. Unfortunately, there are a number of problems with this approach. For example, it can be difficult to determine which terms to highlight and it can also be difficult to determine which destinations should be selected for respective hyperlinks. Another problem is that for certain documents, it is not possible and/or not typical to augment content, even though doing so may benefit readers. As one example, messages posted to message boards typically include relatively few hyperlinks to other content. An additional problem is that the passage of time can potentially render both the selected portions of the document and the corresponding additional information out of date. Attempts to automatically augment document content are also problematic. For example, a publisher might choose to automatically link any occurrences in a document of phrases, such as “diet drug,” to advertising sites. Readers, encountering such links, may erroneously believe that they have been inserted manually and meant to enhance their knowledge of key aspects of the document. When they follow the links, they may become annoyed and/or feel as if they have been tricked into following a link, potentially resulting in a loss of goodwill toward both the publisher and the advertiser.
BRIEF DESCRIPTION OF THE DRAWINGS
Various embodiments of the invention are disclosed in the following detailed description and the accompanying drawings.
FIG. 1 illustrates an embodiment of an environment in which documents are processed.
FIG. 2A illustrates an embodiment of a portion of a web page as rendered in a browser.
FIG. 2B illustrates an embodiment of a portion of a web page as rendered in a browser.
FIG. 3 illustrates an embodiment of a data processing engine.
FIG. 4 illustrates a mapping between a set of textual representations and a set of concepts.
FIG. 5 illustrates a process for resolving an ambiguity.
FIG. 6 illustrates an updated mapping between a set of textual representations and a set of concepts.
FIG. 7 illustrates an example of a portion of output generated by a document processing engine.
FIG. 8 illustrates an embodiment of a process for determining a mapping between a textual representation in a document and a concept.
FIG. 9 illustrates an embodiment of a process for categorizing a document.
FIG. 10 illustrates an embodiment of a portion of a webpage as rendered in a browser.
FIG. 11 is a chart illustrating the distance between textual representations.
FIG. 12 illustrates an embodiment of a process for including a hyperlink in a document.
FIG. 13 illustrates an embodiment of a portion of a webpage as rendered in a browser.
FIG. 14 illustrates an embodiment of a portion of a webpage as rendered in a browser.
FIG. 15 illustrates an embodiment of a process for delivering an article.
FIG. 16 illustrates an embodiment of a system for creating a hierarchy of concepts from a corpus of documents.
FIG. 17A is a portion of an arc list according to one embodiment.
FIG. 17B is a portion of a vertex list according to one embodiment.
FIG. 17C is a portion of an arc list according to one embodiment.
FIG. 17D is a portion of a subtree preferences list according to one embodiment.
FIG. 18 is a flow chart illustrating an embodiment of a process for creating a hierarchy of concepts from a corpus of documents.
FIG. 19 illustrates an example of a vector of weights according to one embodiment.
FIG. 20 is a flow chart illustrating an embodiment of a process for creating a hierarchy of concepts from a corpus of documents.
FIG. 21 illustrates an example of a portion of a concept hierarchy.
FIG. 22 illustrates an example of a hierarchy of information types according to some embodiments.
FIG. 23 illustrates an example of a system for categorizing a query.
FIG. 24 illustrates an example of a process for categorizing a query.
FIG. 25 illustrates an example of scores determined as part of a process for associating a query with a concept.
FIG. 26 illustrates an example of a process for cleaning concepts.
FIG. 27 illustrates an example of a concept hierarchy and scores associated with a query.
FIG. 28 illustrates an example of a system for categorizing a query.
FIG. 29 illustrates an example of a process for categorizing a query.
FIG. 30 illustrates an example of a portion of a process for categorizing a query.
FIG. 31 illustrates an example of a page that includes dynamically selected components, as rendered in a browser.
FIG. 32 illustrates an example of a system for delivering a page that includes a plurality of modules.
FIG. 33 is a flow chart illustrating an embodiment of a process for delivering a page that includes a plurality of modules.
FIG. 34 is a flow chart illustrating an embodiment of a process for delivering a page that includes a plurality of modules.
FIG. 35A illustrates an example of a page layout.
FIG. 35B illustrates an example of a page layout.
FIG. 35C illustrates an example of a page layout.
FIG. 35D illustrates an example of a page layout.
FIG. 36 illustrates an embodiment of a process for providing information to a module.
The invention can be implemented in numerous ways, including as a process; an apparatus; a system; a composition of matter; a computer program product embodied on a computer readable storage medium; and/or a processor, such as a processor configured to execute instructions stored on and/or provided by a memory coupled to the processor. In this specification, these implementations, or any other form that the invention may take, may be referred to as techniques. In general, the order of the steps of disclosed processes may be altered within the scope of the invention. Unless stated otherwise, a component such as a processor or a memory described as being configured to perform a task may be implemented as a general component that is temporarily configured to perform the task at a given time or a specific component that is manufactured to perform the task. As used herein, the term ‘processor’ refers to one or more devices, circuits, and/or processing cores configured to process data, such as computer program instructions.
A detailed description of one or more embodiments of the invention is provided below along with accompanying figures that illustrate the principles of the invention. The invention is described in connection with such embodiments, but the invention is not limited to any embodiment. The scope of the invention is limited only by the claims and the invention encompasses numerous alternatives, modifications and equivalents. Numerous specific details are set forth in the following description in order to provide a thorough understanding of the invention. These details are provided for the purpose of example and the invention may be practiced according to the claims without some or all of these specific details. For the purpose of clarity, technical material that is known in the technical fields related to the invention has not been described in detail so that the invention is not unnecessarily obscured.
FIG. 1 illustrates an embodiment of an environment in which documents are processed. In the example shown, a user of client 116 (hereinafter “Alice”) uses a web browser to access a variety of sites 118-124. Site 118 hosts a blog that belongs to Alice\'s friend, Joe. Joe uses site 118 to make astronomy-related blog posts and to engage in discussions with readers of his blog via a commenting feature. Site 120 is a medically-themed message board on which users discuss various medical conditions and other topics with one another. Site 122 is a news aggregation service. Visitors to site 122 provide information about their interests and are provided with personalized news feeds. Site 124 belongs to the company for which Alice works, Acme Corporation. Site 124 securely makes available internal documents to users such as Alice that have appropriate credentials.
In the example shown, documents, such as document 102, are provided to document processing system 104 for processing. Examples of documents include blog posts made on site 118, forum messages exchanged on site 120, news articles made available through site 122, the various types of documents served by site 124, and any other text (in formats such as HTML, TXT, PDF, etc.) as applicable.
In various embodiments, for a given document 102, document processing engine 106 produces two types of output—a list of entities 110 and a document vector 112. As used herein, an entity is a pair of items—a textual representation (i.e., a string of text appearing in the document) and a concept associated with the textual representation. Unlike the textual representation (which is literally present in the document), the associated concept need not be literally present in the document. Instead, the concept is present in a taxonomy, such as is stored in database 108.
As one example, suppose a news article describes the saving of a baby from a fire by a dog. An excerpt from the article reads “The small, heroic sheltie saved baby Fred on Tuesday.” When the article is provided to system 104, one example of an entity 110 that is generated is (“sheltie”,“Shetland Sheepdog”). The first portion of the pair (the textual representation) is the fourth word of the excerpt. The second portion of the pair is the associated concept that is included in a taxonomy of concepts—the canonical name of the breed of dog also known as a “sheltie.” Document vector 112 is a ranked list of concepts associated with the document. An example of a document vector 112 for the dog article is: (pets:10, dogs:6, Shetland Sheepdog:4, arson:2) with each concept having an associated score. In various embodiments, the associated scores are normalized between 0 and 1.
The administrators of sites 118-122 (also referred to herein as “publishers”) each communicate with system 104 via portal 114. Through the portal, they configure information about their respective sites and also specify preferences for how the processing of system 104 is performed with respect to their documents. Joe (the owner of site 118) has specified that system 104 should automatically tag the blog posts that he writes with appropriate keywords and should also insert hyperlinks into the posts that lead to informative topical pages. He is too busy to include such links in his posts when he writes them and uses system 104 to improve the experience of his readers. In some embodiments, if advertisements are displayed on the topical pages to which Joe\'s pages link, he is afforded a share of the revenue generated from the advertisements. Tags and links are generated by Joe\'s blog software, in conjunction with an application program interface (API) provided by system 104, when he submits a new entry.
The administrator of site 120 has configured the site to prohibit, for security and spam minimization reasons, message board contributors from including hyperlinks in their messages. When viewers access site 120, however, posts appear to include relevant, informative links. As with Joe\'s blog, many of the links direct users to custom generated topical pages. In addition, for concepts that monetize well (e.g., diet drugs), a small number of links to advertising sites (or sites other than the topical pages) are included. Links can also direct users to other pages within the publisher\'s site or network of sites.
When users of site 122 first visit the site, they provide a list of topics that are of interest to them. Examples include “entrepreneurism” and “dog health.” When new news articles are detected by site 122, they are processed (via an API) by system 104. System 104 provides back to site 122 a document vector 112. As will be described in more detail below, articles are selectively provided to users based on the user\'s interests and the concepts included in the articles\' document vectors.
In the environment shown, Acme Corporation owns a document processing system 126 that provides functionality similar to that of system 104. System 126 is configured to receive as input various internal documents and to categorize and summarize those documents in accordance with the techniques described herein.
The techniques described herein can also be used to process documents without the explicit cooperation of a publisher or other document source. For example, client 130 includes a web browser application that is configured to use a plugin 132 that is in communication with site 104. When a user of client 130 visits a page on website 128, the plugin provides a copy of the page to system 104. System 104 processes the document in accordance with the techniques described herein and provides information to plugin 132 that is used when the browser application renders the page for the user. Plugin 132 can be configured to provide a variety of enhancements to the user\'s viewing experience pages. As one example, the browser can include additional links in the rendered page (similar to the functionality of site 120). The browser can also provide a separate window, frame, or sidebar into which information, such as a summary of the page, key terms in the page, concepts related to the page, and even custom widgets/modules, are displayed, without altering the rendering of the page itself.
In the example shown in FIG. 1, system 104 comprises standard commercially available server hardware (e.g., having a multi-core processor, 4G+ of RAM, and Gigabit network interface adaptors) running a typical server-class operating system (e.g., Linux). In various embodiments, system 104 is implemented across a scalable infrastructure comprising multiple such servers, solid state drives, and other applicable high-performance hardware.
Whenever system 104 is described as performing a task (such as communicating with a client or accessing information in a database), either a single component or a subset of components or all components of system 104 may cooperate to perform the task. Similarly, whenever a component of system 104 is described as performing a task, a subcomponent may perform the task and/or the component may perform the task in conjunction with other components. In various embodiments, portions of system 104 are provided by one or more third parties. As one example, database 108 stores a taxonomy comprising millions of concepts. The taxonomy can be created by system 104 (using techniques described in more detail below) and can also be supplied to system 104 by a separate component, or by a third party. As another example, database 108 also includes various statistical information, such as inverse document frequency information, that can be periodically computed by system 104, supplied by a separate component, or provided by a third party.
FIG. 2A illustrates an embodiment of a portion of a web page as rendered in a browser. In the example shown, Joe is preparing to make a new blog post. He is asked to supply a title for the post in region 202 and to provide the body of the post in region 204. By selecting box 206, Joe is indicating that he would like system 104 to automatically select portions of the body of his post and generate hyperlinks for those portions. Joe may choose to select this box because he is too busy to carefully annotate his post. Another reason Joe may choose to select this box is because he is unsure of which terms he should select to link and/or which destination pages would be best to link to for a given term. By selecting box 208, Joe is indicating that he would like system 104 to automatically tag the post with a few key concepts.
FIG. 2B illustrates an embodiment of a portion of a web page as rendered in a browser. The page shown in FIG. 2B was created as a result of Joe supplying a title 252 and a body of text 256 to the interface shown in FIG. 2A. When Joe selected submit button 210, title 252 and body 256 were transmitted by Joe\'s blog server software to system 104. System 104 processed the received title and body (collectively, a document) and returned to site 118 a set of tags to be included in region 254 and instructions on which phrases in body 256 should have associated hyperlinks, as well as URL information for each such hyperlink.
Using techniques described in more detail below, system 104 was able to determine that Joe\'s post pertains to the Lunar Reconnaissance Orbiter, as well as to cameras, as indicated in region 254. System 104 also determined that a total of ten hyperlinks should be included and that those ten hyperlinks should be distributed with a higher concentration of links toward the top portion of the body and a more sparse distribution of links toward the bottom.
As mentioned above, system 104 allows Joe to configure, via portal 114, a variety of preferences for how system 104 processes his documents. As one example, Joe can specify constraints on where visitors will be directed by the inserted hyperlinks. In the example shown in FIG. 2B, Joe has made the following customization choices: (1) When businesses or organizations mentioned in his articles are selected by system 104, Joe would like visitors to be directed to the canonical websites of those locations. Thus, if a visitor such as Alice were to click on link 258, she would be directed to www.agu.org, the main site of the American Geophysical Union. In various embodiments, this is made possible by the taxonomy stored in database 108, including information such as an associated website for each concept, or for some subset of concepts. The associated website can be scraped and can also be manually included in the taxonomy by an administrator of system 104 and/or by a representative of the business/organization, such as via portal 114. (2) When a phrase selected (that is not a business or organization) has a corresponding entry in Wikipedia, Joe would like visitors to be directed to the appropriate Wikipedia page. Thus, if a visitor were to click on link 260, he would be directed to http://en.wikipedia.org/wiki/Lunar_Reconnaissance_Orbiter, a Wikipedia entry about the orbiter. (3) Finally, for any phrases selected by system 104 which do not match either of the aforementioned situations, Joe would like visitors to be directed to an automatically generated topic page (described in more detail below). When Joe\'s visitors are directed to the automatically generated topic page, Joe will receive a portion of any advertising revenue generated by those visitors as they encounter advertisements on the automatically generated topic page.
Joe has selected three different types of destination URLs because he believes that the customizations he has made will result in the most appealing experience for his visitors. Joe can also leave the URL selection up to system 104 entirely, can specify that only phrases with corresponding Wikipedia pages be linked (even if it results in fewer than ten hyperlinks being inserted), can specify that links to those topic pages generating the most revenue be preferred over other links, etc.
Joe can also customize how and which tags (254) are selected. One purpose of tagging a post is to allow visitors who are interested in one of the tagged subjects to quickly find other blog posts on the site that pertain to that subject by clicking on the appropriate tag. Rather than selecting tags 254 from among all of the potentially millions of concepts stored in database 108, Joe has specified that system 104 should select tags only from those tags already in use on his site. If he chooses, Joe can instead specify that tags be selected from a list of 50 subjects he has previously specified as being acceptable, can specify that he be prompted by site 118 to approve any tags selected by system 104 that have not previously be used on site 118, or other any other appropriate configuration.
FIG. 3 illustrates an embodiment of a data processing engine. Data processing engine 106 is of a modular design and employs a blackboard architecture in which various modules (if included) contribute to computation and refinement of various calculations (such as the computation of vectors 310-314) as applicable. Some of the processing performed by the modules of data processing engine 106 is parallelizable, such as natural language processing and textual representation detection. Further, the processing performed by engine 106 is customizable through the use of configuration file 318 (e.g., allowing documents from different publishers to be processed differently). Additional detail on various aspects of data processing engine 106 will now be provided.
When a document, such as document 102, is received, if applicable, preprocessor 302 converts the document (e.g. from a DOC or PDF file) or otherwise extracts (e.g. from HTML or XML) a plaintext representation of the content of the document. Preprocessor 302 is also configured to handle special characters, such as by converting occurrences of the “&” sign into whitespace or into the word “and.”
Boundary Processing/Position Information
Boundary processor 304 is configured to recognize certain types of boundaries within a document based on the format of the document (e.g., <head>, <body>, <hl>, and <p> HTML tags) and can also parse configuration information supplied by publishers regarding the formatting of documents on their sites. Documents provided to processor 106 by the interface shown in FIG. 2A include two sections—a title section and a body section. In some embodiments, document boundaries are ignored and the processing of boundary processor 304 is omitted. In various embodiments, boundary processor 304 is also configured to store, for each term in the document, the position of the term. As one example, the first word in the document would have a position 0, the second word in the document would have a position 1, and so on. As will be described in more detail below, terms that appear in one section of a document (such as a title) may be scored or otherwise treated differently than terms that appear in another section (such as in the comments). In addition, publishers can use sections to enforce preferences, such as that all terms appearing in a document be used to categorize the document, but that only terms appearing in the main body (and not the title or comments sections) be able to be associated with hyperlinks. Such preferences can be provided by the publisher via configuration 318.
Natural Language Processing
Natural language processor 306 is configured to determine part-of-speech information for each term in the document. In various embodiments, natural language processor 306 uses part-of-speech tags, such as are provided by the Brown corpus, to tag each term in the document. Using the article shown in FIG. 2B as an example, “NASA\'s” would be tagged “NP$,” meaning that it is a possessive proper noun. As will be described in more detail below, in various embodiments, different parts of speech are assigned different scores and those scores can be used in evaluating textual representations.
Textual Representation Detection
Whitelist 320, extracted from the taxonomy stored in database 108, is a list of all of the concepts that are included in the taxonomy. Textual representation detector 308 is configured to perform a greedy match against the document using whitelist 320. Each match is included in a list of candidate textual representations 324. Using the first line of the article shown in FIG. 2B as an example “NASA,” “mission,” “orbit,” and “moon,” would each be included in the list of candidate textual representations 324. Suppose “Lunar” and “Lunar Reconnaissance Orbiter” are both phrases that are included in whitelist 320 but “Lunar Reconnaissance” is not. Because detector 308 is configured to perform a greedy match, “Lunar Reconnaissance Orbiter” will be added to the list of candidate textual representations 324 while the other two terms will not. In various embodiments, detector 308 is configured to perform other types of matches, instead of or in addition to greedy matches. In some embodiments, all matches (e.g. both “Lunar” and “Lunar Reconnaissance Orbiter”) are added to list 324.
Suppose “The American” and “American Pie” are both concepts included in whitelist 320, but that “The American Pie” is not. Also suppose that document 102 includes the string “The American Pie movie is showing at the Downtown Theatre tomorrow.” When performing its greedy match, detector 308 might add to list 324 two entries, “The American” and “Pie,” erroneously omitting “American Pie.” To address this problem, in some embodiments, detector 308 employs a prepositional rule in which, when a match that includes at its start a preposition is detected, the preposition is temporarily ignored and the greedy match continues using the next word in the document. If a match is found, the preposition is discarded and the phrase that does not include it is used. In this example, because “The American” includes a leading preposition, “The” would be temporarily ignored, and a match of “American Pie” would be detected. From the three words, “The American Pie,” only one entry would be added to list 324—“American Pie.”
Without further refinement, the list of candidate textual representations 324 might include virtually every word in document 102. Accordingly, in various embodiments, textual representation detector 308 employs additional logic to refine the list of candidate textual representations. As will be described in more detail below, the candidate list can be refined/pruned both before and after feature vectors for items on the candidate list are populated.
Static and Runtime Blacklists
In various embodiments, textual representation detector 308 is configured to exclude from inclusion in list 324 those textual representations that match a blacklist 316. Stop words (such as “a,” “about,” “again,” and “would”) are one example of terms that can be included in a static blacklist. A publisher can also provide custom blacklists (referred to herein as “runtime” blacklists) that should be considered by engine 106 when processing that particular publisher\'s documents. As one example, a publisher may blacklist the names of competitors. As another example, the publisher may have an agreement with a third-party advertising company that certain words be directed to that advertising company. By employing a blacklist, the publisher can prevent the already-contracted-for words from being considered by engine 106. Publishers can also specify constraints such as requiring that all textual representations belong to one or more verticals (also referred to herein as “top level categories”) specified by the publisher, which will be described in more detail below.
Concepts included in a taxonomy can be used to bias/prune candidate textual representations, as will be described in more detail below. As with the examples described in the previous section, concept-based blacklists can be static (e.g., applied to all documents) or runtime (e.g., used according to a configuration supplied by a publisher or other runtime clue). For example, an administrator of engine 106 can configure as blacklisted concepts “chronology” and “days of the week.” Child topics such as “Monday” and “1997” would be blacklisted as a result. As another example, message board publisher 120 can indicate a preference for health-themed textual representations by specifying the vertical, “Health,” as a whitelisted concept in configuration 318. Publisher 122 can indicate a preference against adult-themed textual representations by specifying the vertical “Adult Entertainment” as a blacklisted concept. Instead of supplying whitelists/blacklists, in some embodiments publishers assign weights to various categories, so that higher weighted categories are given preference over lower weighted categories by engine 106. As one example, publisher 122 could provide the following: “Health(1); Sports(0.5)” indicating a preference for health-related concepts but also indicating that sports concepts should be considered. In yet another embodiment, plugin 132 can be configured to provide a concept “signature” for Alice—a customized list of Alice\'s topical preferences, such as: “Science(1); Animals(0.5); Entertainment(0.4); Travel(0.4); Sports(−1).”
In various embodiments, concept whitelist/blacklist information is passed in at runtime via the provider of document 102 instead of or in addition to being supplied via configuration 318. Whitelist information can also be collected on behalf of a publisher, without requiring the publisher to manually specify category preferences. One way of accomplishing this is as follows. When a publisher initially decides to use the services provided by system 106, system 106 performs the document categorization techniques described herein across the corpus of documents included in the publisher\'s site and collects together the dominant concepts into a concept whitelist.
Regular Expression Patterns
In various embodiments, textual representation detector 308 is configured to exclude from inclusion in list 324 those textual representations that match a regular expression. As one example, as a result of converter/preprocessor 302 manipulating document 102, a term such as “AT&T GSM” may be converted to “AT T GSM.” Suppose “TGSM” is a concept included in whitelist 320. During the greedy match portion, “TGSM” may be erroneously added to candidate list 324. A regular expression pattern that discards matches that begin with a lone “T” or a lone “S” can be used to prevent the erroneous match from being included.
Proper Noun Sequences