FreshPatents.com Logo FreshPatents.com icons
Monitor Keywords Patent Organizer File a Provisional Patent Browse Inventors Browse Industry Browse Agents

n/a

views for this patent on FreshPatents.com
updated 05/24/13


Inventor Store

    Free Services  

  • MONITOR KEYWORDS
  • Enter keywords & we'll notify you when a new patent matches your request (weekly update).

  • ORGANIZER
  • Save & organize patents so you can view them later.

  • RSS rss
  • Create custom RSS feeds. Track keywords without receiving email.

  • ARCHIVE
  • View the last few months of your Keyword emails.

  • COMPANY PATENTS
  • Patents sorted by company.

Including hyperlinks in a document   

pdficondownload pdfimage preview


20120297278 patent thumbnailAbstract: Techniques for including a hyperlink in a document is disclosed. A document is received via a communications interface. An entity pair is determined by a processor. The entity pair includes a concept included in a concept taxonomy and a textual representation included in the document. As output, a hyperlink is provided.
Agent: Wal-mart Stores, Inc. - Bentonville, AR, US
USPTO Applicaton #: #20120297278 - Class: 715205 (USPTO) - 11/22/12 - Class 715 
Related Terms: Hyperlink   Hyperlinks   Taxonomy   
view organizer monitor keywords


The Patent Description & Claims data below is from USPTO Patent Application 20120297278, Including hyperlinks in a document.

pdficondownload pdf

RELATED U.S. APPLICATIONS DATA

This application is a continuation to U.S. application Ser. No. 12/757,910, filed Apr. 9, 2010. The application is incorporated herein by reference for all purposes

BACKGROUND OF THE INVENTION

Traditionally, editors of documents, such as web pages, manually selected portions of the documents to augment with additional information. For example, an editor might choose to associate various terms in the document with assorted hyperlinks to other documents. Unfortunately, there are a number of problems with this approach. For example, it can be difficult to determine which terms to highlight and it can also be difficult to determine which destinations should be selected for respective hyperlinks. Another problem is that for certain documents, it is not possible and/or not typical to augment content, even though doing so may benefit readers. As one example, messages posted to message boards typically include relatively few hyperlinks to other content. An additional problem is that the passage of time can potentially render both the selected portions of the document and the corresponding additional information out of date. Attempts to automatically augment document content are also problematic. For example, a publisher might choose to automatically link any occurrences in a document of phrases, such as “diet drug,” to advertising sites. Readers, encountering such links, may erroneously believe that they have been inserted manually and meant to enhance their knowledge of key aspects of the document. When they follow the links, they may become annoyed and/or feel as if they have been tricked into following a link, potentially resulting in a loss of goodwill toward both the publisher and the advertiser.

BRIEF DESCRIPTION OF THE DRAWINGS

Various embodiments of the invention are disclosed in the following detailed description and the accompanying drawings.

FIG. 1 illustrates an embodiment of an environment in which documents are processed.

FIG. 2A illustrates an embodiment of a portion of a web page as rendered in a browser.

FIG. 2B illustrates an embodiment of a portion of a web page as rendered in a browser.

FIG. 3 illustrates an embodiment of a data processing engine.

FIG. 4 illustrates a mapping between a set of textual representations and a set of concepts.

FIG. 5 illustrates a process for resolving an ambiguity.

FIG. 6 illustrates an updated mapping between a set of textual representations and a set of concepts.

FIG. 7 illustrates an example of a portion of output generated by a document processing engine.

FIG. 8 illustrates an embodiment of a process for determining a mapping between a textual representation in a document and a concept.

FIG. 9 illustrates an embodiment of a process for categorizing a document.

FIG. 10 illustrates an embodiment of a portion of a webpage as rendered in a browser.

FIG. 11 is a chart illustrating the distance between textual representations.

FIG. 12 illustrates an embodiment of a process for including a hyperlink in a document.

FIG. 13 illustrates an embodiment of a portion of a webpage as rendered in a browser.

FIG. 14 illustrates an embodiment of a portion of a webpage as rendered in a browser.

FIG. 15 illustrates an embodiment of a process for delivering an article.

FIG. 16 illustrates an embodiment of a system for creating a hierarchy of concepts from a corpus of documents.

FIG. 17A is a portion of an arc list according to one embodiment.

FIG. 17B is a portion of a vertex list according to one embodiment.

FIG. 17C is a portion of an arc list according to one embodiment.

FIG. 17D is a portion of a subtree preferences list according to one embodiment.

FIG. 18 is a flow chart illustrating an embodiment of a process for creating a hierarchy of concepts from a corpus of documents.

FIG. 19 illustrates an example of a vector of weights according to one embodiment.

FIG. 20 is a flow chart illustrating an embodiment of a process for creating a hierarchy of concepts from a corpus of documents.

FIG. 21 illustrates an example of a portion of a concept hierarchy.

FIG. 22 illustrates an example of a hierarchy of information types according to some embodiments.

FIG. 23 illustrates an example of a system for categorizing a query.

FIG. 24 illustrates an example of a process for categorizing a query.

FIG. 25 illustrates an example of scores determined as part of a process for associating a query with a concept.

FIG. 26 illustrates an example of a process for cleaning concepts.

FIG. 27 illustrates an example of a concept hierarchy and scores associated with a query.

FIG. 28 illustrates an example of a system for categorizing a query.

FIG. 29 illustrates an example of a process for categorizing a query.

FIG. 30 illustrates an example of a portion of a process for categorizing a query.

FIG. 31 illustrates an example of a page that includes dynamically selected components, as rendered in a browser.

FIG. 32 illustrates an example of a system for delivering a page that includes a plurality of modules.

FIG. 33 is a flow chart illustrating an embodiment of a process for delivering a page that includes a plurality of modules.

FIG. 34 is a flow chart illustrating an embodiment of a process for delivering a page that includes a plurality of modules.

FIG. 35A illustrates an example of a page layout.

FIG. 35B illustrates an example of a page layout.

FIG. 35C illustrates an example of a page layout.

FIG. 35D illustrates an example of a page layout.

FIG. 36 illustrates an embodiment of a process for providing information to a module.

DETAILED DESCRIPTION

The invention can be implemented in numerous ways, including as a process; an apparatus; a system; a composition of matter; a computer program product embodied on a computer readable storage medium; and/or a processor, such as a processor configured to execute instructions stored on and/or provided by a memory coupled to the processor. In this specification, these implementations, or any other form that the invention may take, may be referred to as techniques. In general, the order of the steps of disclosed processes may be altered within the scope of the invention. Unless stated otherwise, a component such as a processor or a memory described as being configured to perform a task may be implemented as a general component that is temporarily configured to perform the task at a given time or a specific component that is manufactured to perform the task. As used herein, the term ‘processor’ refers to one or more devices, circuits, and/or processing cores configured to process data, such as computer program instructions.

A detailed description of one or more embodiments of the invention is provided below along with accompanying figures that illustrate the principles of the invention. The invention is described in connection with such embodiments, but the invention is not limited to any embodiment. The scope of the invention is limited only by the claims and the invention encompasses numerous alternatives, modifications and equivalents. Numerous specific details are set forth in the following description in order to provide a thorough understanding of the invention. These details are provided for the purpose of example and the invention may be practiced according to the claims without some or all of these specific details. For the purpose of clarity, technical material that is known in the technical fields related to the invention has not been described in detail so that the invention is not unnecessarily obscured.

FIG. 1 illustrates an embodiment of an environment in which documents are processed. In the example shown, a user of client 116 (hereinafter “Alice”) uses a web browser to access a variety of sites 118-124. Site 118 hosts a blog that belongs to Alice\'s friend, Joe. Joe uses site 118 to make astronomy-related blog posts and to engage in discussions with readers of his blog via a commenting feature. Site 120 is a medically-themed message board on which users discuss various medical conditions and other topics with one another. Site 122 is a news aggregation service. Visitors to site 122 provide information about their interests and are provided with personalized news feeds. Site 124 belongs to the company for which Alice works, Acme Corporation. Site 124 securely makes available internal documents to users such as Alice that have appropriate credentials.

In the example shown, documents, such as document 102, are provided to document processing system 104 for processing. Examples of documents include blog posts made on site 118, forum messages exchanged on site 120, news articles made available through site 122, the various types of documents served by site 124, and any other text (in formats such as HTML, TXT, PDF, etc.) as applicable.

In various embodiments, for a given document 102, document processing engine 106 produces two types of output—a list of entities 110 and a document vector 112. As used herein, an entity is a pair of items—a textual representation (i.e., a string of text appearing in the document) and a concept associated with the textual representation. Unlike the textual representation (which is literally present in the document), the associated concept need not be literally present in the document. Instead, the concept is present in a taxonomy, such as is stored in database 108.

As one example, suppose a news article describes the saving of a baby from a fire by a dog. An excerpt from the article reads “The small, heroic sheltie saved baby Fred on Tuesday.” When the article is provided to system 104, one example of an entity 110 that is generated is (“sheltie”,“Shetland Sheepdog”). The first portion of the pair (the textual representation) is the fourth word of the excerpt. The second portion of the pair is the associated concept that is included in a taxonomy of concepts—the canonical name of the breed of dog also known as a “sheltie.” Document vector 112 is a ranked list of concepts associated with the document. An example of a document vector 112 for the dog article is: (pets:10, dogs:6, Shetland Sheepdog:4, arson:2) with each concept having an associated score. In various embodiments, the associated scores are normalized between 0 and 1.

The administrators of sites 118-122 (also referred to herein as “publishers”) each communicate with system 104 via portal 114. Through the portal, they configure information about their respective sites and also specify preferences for how the processing of system 104 is performed with respect to their documents. Joe (the owner of site 118) has specified that system 104 should automatically tag the blog posts that he writes with appropriate keywords and should also insert hyperlinks into the posts that lead to informative topical pages. He is too busy to include such links in his posts when he writes them and uses system 104 to improve the experience of his readers. In some embodiments, if advertisements are displayed on the topical pages to which Joe\'s pages link, he is afforded a share of the revenue generated from the advertisements. Tags and links are generated by Joe\'s blog software, in conjunction with an application program interface (API) provided by system 104, when he submits a new entry.

The administrator of site 120 has configured the site to prohibit, for security and spam minimization reasons, message board contributors from including hyperlinks in their messages. When viewers access site 120, however, posts appear to include relevant, informative links. As with Joe\'s blog, many of the links direct users to custom generated topical pages. In addition, for concepts that monetize well (e.g., diet drugs), a small number of links to advertising sites (or sites other than the topical pages) are included. Links can also direct users to other pages within the publisher\'s site or network of sites.

The administrator of site 120 provides various configuration information to system 104. System 104 makes available to the administrator a snippet of JavaScript code that is embedded in each page of site 120. When visitors to site 120 retrieve content from the site, execution of the embedded JavaScript results in the text of the page being viewed (i.e. a page of message board posts) being provided to system 104, a set of entities determined, and an appropriate set of hyperlinks being included in the page as rendered. The links can be configured to appear as any other links that might be present (i.e., using the same color scheme) but can also be made to appear different from other links. Behavior such as whether following a link should open the new page in the same window or a new window can also be specified. Unlike the approach used by site 118 (in which static links are generated once, at the time the article is created), which terms are linked and the destination pages associated with those links can change over time. Links can also open an overlay on hover, or on click, which displays content and/or advertisements that are relevant to the linked concept. For example, if a famous rock musician is selected for linking, on the click of a user, an overlay can be created that includes music videos associated with that rock musician.

When users of site 122 first visit the site, they provide a list of topics that are of interest to them. Examples include “entrepreneurism” and “dog health.” When new news articles are detected by site 122, they are processed (via an API) by system 104. System 104 provides back to site 122 a document vector 112. As will be described in more detail below, articles are selectively provided to users based on the user\'s interests and the concepts included in the articles\' document vectors.

In the environment shown, Acme Corporation owns a document processing system 126 that provides functionality similar to that of system 104. System 126 is configured to receive as input various internal documents and to categorize and summarize those documents in accordance with the techniques described herein.

The techniques described herein can also be used to process documents without the explicit cooperation of a publisher or other document source. For example, client 130 includes a web browser application that is configured to use a plugin 132 that is in communication with site 104. When a user of client 130 visits a page on website 128, the plugin provides a copy of the page to system 104. System 104 processes the document in accordance with the techniques described herein and provides information to plugin 132 that is used when the browser application renders the page for the user. Plugin 132 can be configured to provide a variety of enhancements to the user\'s viewing experience pages. As one example, the browser can include additional links in the rendered page (similar to the functionality of site 120). The browser can also provide a separate window, frame, or sidebar into which information, such as a summary of the page, key terms in the page, concepts related to the page, and even custom widgets/modules, are displayed, without altering the rendering of the page itself.

In the example shown in FIG. 1, system 104 comprises standard commercially available server hardware (e.g., having a multi-core processor, 4G+ of RAM, and Gigabit network interface adaptors) running a typical server-class operating system (e.g., Linux). In various embodiments, system 104 is implemented across a scalable infrastructure comprising multiple such servers, solid state drives, and other applicable high-performance hardware.

Whenever system 104 is described as performing a task (such as communicating with a client or accessing information in a database), either a single component or a subset of components or all components of system 104 may cooperate to perform the task. Similarly, whenever a component of system 104 is described as performing a task, a subcomponent may perform the task and/or the component may perform the task in conjunction with other components. In various embodiments, portions of system 104 are provided by one or more third parties. As one example, database 108 stores a taxonomy comprising millions of concepts. The taxonomy can be created by system 104 (using techniques described in more detail below) and can also be supplied to system 104 by a separate component, or by a third party. As another example, database 108 also includes various statistical information, such as inverse document frequency information, that can be periodically computed by system 104, supplied by a separate component, or provided by a third party.

FIG. 2A illustrates an embodiment of a portion of a web page as rendered in a browser. In the example shown, Joe is preparing to make a new blog post. He is asked to supply a title for the post in region 202 and to provide the body of the post in region 204. By selecting box 206, Joe is indicating that he would like system 104 to automatically select portions of the body of his post and generate hyperlinks for those portions. Joe may choose to select this box because he is too busy to carefully annotate his post. Another reason Joe may choose to select this box is because he is unsure of which terms he should select to link and/or which destination pages would be best to link to for a given term. By selecting box 208, Joe is indicating that he would like system 104 to automatically tag the post with a few key concepts.

FIG. 2B illustrates an embodiment of a portion of a web page as rendered in a browser. The page shown in FIG. 2B was created as a result of Joe supplying a title 252 and a body of text 256 to the interface shown in FIG. 2A. When Joe selected submit button 210, title 252 and body 256 were transmitted by Joe\'s blog server software to system 104. System 104 processed the received title and body (collectively, a document) and returned to site 118 a set of tags to be included in region 254 and instructions on which phrases in body 256 should have associated hyperlinks, as well as URL information for each such hyperlink.

Using techniques described in more detail below, system 104 was able to determine that Joe\'s post pertains to the Lunar Reconnaissance Orbiter, as well as to cameras, as indicated in region 254. System 104 also determined that a total of ten hyperlinks should be included and that those ten hyperlinks should be distributed with a higher concentration of links toward the top portion of the body and a more sparse distribution of links toward the bottom.

As mentioned above, system 104 allows Joe to configure, via portal 114, a variety of preferences for how system 104 processes his documents. As one example, Joe can specify constraints on where visitors will be directed by the inserted hyperlinks. In the example shown in FIG. 2B, Joe has made the following customization choices: (1) When businesses or organizations mentioned in his articles are selected by system 104, Joe would like visitors to be directed to the canonical websites of those locations. Thus, if a visitor such as Alice were to click on link 258, she would be directed to www.agu.org, the main site of the American Geophysical Union. In various embodiments, this is made possible by the taxonomy stored in database 108, including information such as an associated website for each concept, or for some subset of concepts. The associated website can be scraped and can also be manually included in the taxonomy by an administrator of system 104 and/or by a representative of the business/organization, such as via portal 114. (2) When a phrase selected (that is not a business or organization) has a corresponding entry in Wikipedia, Joe would like visitors to be directed to the appropriate Wikipedia page. Thus, if a visitor were to click on link 260, he would be directed to http://en.wikipedia.org/wiki/Lunar_Reconnaissance_Orbiter, a Wikipedia entry about the orbiter. (3) Finally, for any phrases selected by system 104 which do not match either of the aforementioned situations, Joe would like visitors to be directed to an automatically generated topic page (described in more detail below). When Joe\'s visitors are directed to the automatically generated topic page, Joe will receive a portion of any advertising revenue generated by those visitors as they encounter advertisements on the automatically generated topic page.

Joe has selected three different types of destination URLs because he believes that the customizations he has made will result in the most appealing experience for his visitors. Joe can also leave the URL selection up to system 104 entirely, can specify that only phrases with corresponding Wikipedia pages be linked (even if it results in fewer than ten hyperlinks being inserted), can specify that links to those topic pages generating the most revenue be preferred over other links, etc.

Joe can also customize how and which tags (254) are selected. One purpose of tagging a post is to allow visitors who are interested in one of the tagged subjects to quickly find other blog posts on the site that pertain to that subject by clicking on the appropriate tag. Rather than selecting tags 254 from among all of the potentially millions of concepts stored in database 108, Joe has specified that system 104 should select tags only from those tags already in use on his site. If he chooses, Joe can instead specify that tags be selected from a list of 50 subjects he has previously specified as being acceptable, can specify that he be prompted by site 118 to approve any tags selected by system 104 that have not previously be used on site 118, or other any other appropriate configuration.

FIG. 3 illustrates an embodiment of a data processing engine. Data processing engine 106 is of a modular design and employs a blackboard architecture in which various modules (if included) contribute to computation and refinement of various calculations (such as the computation of vectors 310-314) as applicable. Some of the processing performed by the modules of data processing engine 106 is parallelizable, such as natural language processing and textual representation detection. Further, the processing performed by engine 106 is customizable through the use of configuration file 318 (e.g., allowing documents from different publishers to be processed differently). Additional detail on various aspects of data processing engine 106 will now be provided.

Conversion/Preprocessing

When a document, such as document 102, is received, if applicable, preprocessor 302 converts the document (e.g. from a DOC or PDF file) or otherwise extracts (e.g. from HTML or XML) a plaintext representation of the content of the document. Preprocessor 302 is also configured to handle special characters, such as by converting occurrences of the “&” sign into whitespace or into the word “and.”

Boundary Processing/Position Information

Boundary processor 304 is configured to recognize certain types of boundaries within a document based on the format of the document (e.g., <head>, <body>, <hl>, and <p> HTML tags) and can also parse configuration information supplied by publishers regarding the formatting of documents on their sites. Documents provided to processor 106 by the interface shown in FIG. 2A include two sections—a title section and a body section. In some embodiments, document boundaries are ignored and the processing of boundary processor 304 is omitted. In various embodiments, boundary processor 304 is also configured to store, for each term in the document, the position of the term. As one example, the first word in the document would have a position 0, the second word in the document would have a position 1, and so on. As will be described in more detail below, terms that appear in one section of a document (such as a title) may be scored or otherwise treated differently than terms that appear in another section (such as in the comments). In addition, publishers can use sections to enforce preferences, such as that all terms appearing in a document be used to categorize the document, but that only terms appearing in the main body (and not the title or comments sections) be able to be associated with hyperlinks. Such preferences can be provided by the publisher via configuration 318.

Natural Language Processing

Natural language processor 306 is configured to determine part-of-speech information for each term in the document. In various embodiments, natural language processor 306 uses part-of-speech tags, such as are provided by the Brown corpus, to tag each term in the document. Using the article shown in FIG. 2B as an example, “NASA\'s” would be tagged “NP$,” meaning that it is a possessive proper noun. As will be described in more detail below, in various embodiments, different parts of speech are assigned different scores and those scores can be used in evaluating textual representations.

Textual Representation Detection

Whitelist 320, extracted from the taxonomy stored in database 108, is a list of all of the concepts that are included in the taxonomy. Textual representation detector 308 is configured to perform a greedy match against the document using whitelist 320. Each match is included in a list of candidate textual representations 324. Using the first line of the article shown in FIG. 2B as an example “NASA,” “mission,” “orbit,” and “moon,” would each be included in the list of candidate textual representations 324. Suppose “Lunar” and “Lunar Reconnaissance Orbiter” are both phrases that are included in whitelist 320 but “Lunar Reconnaissance” is not. Because detector 308 is configured to perform a greedy match, “Lunar Reconnaissance Orbiter” will be added to the list of candidate textual representations 324 while the other two terms will not. In various embodiments, detector 308 is configured to perform other types of matches, instead of or in addition to greedy matches. In some embodiments, all matches (e.g. both “Lunar” and “Lunar Reconnaissance Orbiter”) are added to list 324.

Leading Prepositions

Suppose “The American” and “American Pie” are both concepts included in whitelist 320, but that “The American Pie” is not. Also suppose that document 102 includes the string “The American Pie movie is showing at the Downtown Theatre tomorrow.” When performing its greedy match, detector 308 might add to list 324 two entries, “The American” and “Pie,” erroneously omitting “American Pie.” To address this problem, in some embodiments, detector 308 employs a prepositional rule in which, when a match that includes at its start a preposition is detected, the preposition is temporarily ignored and the greedy match continues using the next word in the document. If a match is found, the preposition is discarded and the phrase that does not include it is used. In this example, because “The American” includes a leading preposition, “The” would be temporarily ignored, and a match of “American Pie” would be detected. From the three words, “The American Pie,” only one entry would be added to list 324—“American Pie.”

Without further refinement, the list of candidate textual representations 324 might include virtually every word in document 102. Accordingly, in various embodiments, textual representation detector 308 employs additional logic to refine the list of candidate textual representations. As will be described in more detail below, the candidate list can be refined/pruned both before and after feature vectors for items on the candidate list are populated.

Static and Runtime Blacklists

In various embodiments, textual representation detector 308 is configured to exclude from inclusion in list 324 those textual representations that match a blacklist 316. Stop words (such as “a,” “about,” “again,” and “would”) are one example of terms that can be included in a static blacklist. A publisher can also provide custom blacklists (referred to herein as “runtime” blacklists) that should be considered by engine 106 when processing that particular publisher\'s documents. As one example, a publisher may blacklist the names of competitors. As another example, the publisher may have an agreement with a third-party advertising company that certain words be directed to that advertising company. By employing a blacklist, the publisher can prevent the already-contracted-for words from being considered by engine 106. Publishers can also specify constraints such as requiring that all textual representations belong to one or more verticals (also referred to herein as “top level categories”) specified by the publisher, which will be described in more detail below.

Concept-Based Blacklists

Concepts included in a taxonomy can be used to bias/prune candidate textual representations, as will be described in more detail below. As with the examples described in the previous section, concept-based blacklists can be static (e.g., applied to all documents) or runtime (e.g., used according to a configuration supplied by a publisher or other runtime clue). For example, an administrator of engine 106 can configure as blacklisted concepts “chronology” and “days of the week.” Child topics such as “Monday” and “1997” would be blacklisted as a result. As another example, message board publisher 120 can indicate a preference for health-themed textual representations by specifying the vertical, “Health,” as a whitelisted concept in configuration 318. Publisher 122 can indicate a preference against adult-themed textual representations by specifying the vertical “Adult Entertainment” as a blacklisted concept. Instead of supplying whitelists/blacklists, in some embodiments publishers assign weights to various categories, so that higher weighted categories are given preference over lower weighted categories by engine 106. As one example, publisher 122 could provide the following: “Health(1); Sports(0.5)” indicating a preference for health-related concepts but also indicating that sports concepts should be considered. In yet another embodiment, plugin 132 can be configured to provide a concept “signature” for Alice—a customized list of Alice\'s topical preferences, such as: “Science(1); Animals(0.5); Entertainment(0.4); Travel(0.4); Sports(−1).”

In various embodiments, concept whitelist/blacklist information is passed in at runtime via the provider of document 102 instead of or in addition to being supplied via configuration 318. Whitelist information can also be collected on behalf of a publisher, without requiring the publisher to manually specify category preferences. One way of accomplishing this is as follows. When a publisher initially decides to use the services provided by system 106, system 106 performs the document categorization techniques described herein across the corpus of documents included in the publisher\'s site and collects together the dominant concepts into a concept whitelist.

Regular Expression Patterns

In various embodiments, textual representation detector 308 is configured to exclude from inclusion in list 324 those textual representations that match a regular expression. As one example, as a result of converter/preprocessor 302 manipulating document 102, a term such as “AT&T GSM” may be converted to “AT T GSM.” Suppose “TGSM” is a concept included in whitelist 320. During the greedy match portion, “TGSM” may be erroneously added to candidate list 324. A regular expression pattern that discards matches that begin with a lone “T” or a lone “S” can be used to prevent the erroneous match from being included.

Proper Noun Sequences

In various embodiments, textual representation detector 308 is configured to evaluate proper nouns included in list 324 and remove from the list those proper nouns that have an adjacent proper noun that was not selected. One purpose of this rule is to prevent one person that has a famous last name (but is not that famous person) from being erroneously recognized as the famous person. Suppose an article discusses a chemist named John Mozart and that “Mozart” is added to list 324 as a result of the greedy match. Since “John Mozart” is not included in whitelist 320, it is not included in list 324. Detector 308 is configured to recognize that Mozart was added, has an adjacent proper noun (“John”) and to remove “Mozart” from list 324.

Initial Feature Vector Population

Vector populator 326 is configured to populate a feature vector 310 for each candidate textual representation included in list 324. A feature vector comprises a set of various signals associated with the textual representation. The signals can be used in various ways, as will be described in more detail below. Some of the signal information is obtained from analyzing document 102 and other information is obtained from data included in database 108.

One signal, denoted herein as “TitleTF,” indicates the number of times that the term appears in the title section of the document. Using the textual representation, “Lunar Reconnaissance Orbiter,” as shown at 260 in FIG. 2 as an example, that term is not present in the title section of the document and thus has a TitleTF=0. “BodyTF” is a signal that indicates the number of times that the term appears in the body section of the document. The term, “Lunar Reconnaissance Orbiter” has a BodyTF=1 because it is present once in the body section of the document. Another textual representation, “LRO,” also has a TitleTF=0, but has a BodyTF=3. Other term frequency counts can also be used instead of or in addition to TitleTF and BodyTF, as applicable. For example, the term\'s frequencies with respect to meta tags, bold/strong tags, H3-H6 tags, H1-H2 tags, and anchor classes can all be included in its feature vector. As another example, a CommentTF signal can be used to indicate the number of times a term appears in the comment section of a blog. Arbitrary section frequency counts can also be used, such as Section0TF, Section1TF, Section2TF, etc., indicating the number of times the term appears, respectively, in the 0th, 1st, and 2nd sections of the document. One way that section frequency signal information can be used is to allow words occurring in the comments to be considered when categorizing a document, but also to prevent those words from being selected for automated hyperlinking.

As mentioned above, a score (the “NLP score”) can be assigned to a textual representation based on its part of speech. As one example, proper nouns are assigned a score of “1,” common nouns are assigned a score of “0.75,” and verbs are assigned a score of “0.” For multi-word textual representations, the NLP score can be computed as the average of each constituent word\'s score, the sum of each constituent word\'s score, or in accordance with any other appropriate calculation. The “Case” signal scores the number of capitalized words in the textual representation. In the example of “Lunar Reconnaissance Orbiter,” the Case score is 3 because each component of the term is capitalized. In the example of “Apollo landing sites,” as shown at 262 in FIG. 2, the Case score is 1.

Both the NLP score and the Case score can be used to resolve whether particular textual representations included in document are proper nouns or common nouns and also to resolve ambiguities, as described in more detail below. As one example, the occurrence of the words “Simply hired” in a document could refer to the author\'s explanation of how easy it was to be hired at a job and could also refer to the jobs website, www.simplyhired.com. The Case score of “Simply hired” is 1. The Case score of the canonical name of the jobs website, Simply Hired, is 2. As another example, “it\'s it” in a document could refer to something the author thinks is “it,” but could also refer to It\'s-It brand ice cream sandwiches. The Case score of “it\'s it” as written is 0. The Case score of the canonical form of the ice cream sandwich product is 2.

The Position signal indicates the relative location of the textual representation in the document. “Lunar Reconnaissance Orbiter” occurs once in document 102, at position 154. In various embodiments, if the textual representation occurs multiple times, the position of each occurrence is included in a list (e.g., Position=100,202,554). In various embodiments, the position of the term can be used to bias various processing. For example, links to terms occurring earlier in the document can be preferred over ones occurring later.

The NumWords signal indicates the number of words included in the textual representation. “Lunar Reconnaissance Orbiter” includes three words, and thus has NumWords=3.

The signals described herein are examples of signals and particular signals can be omitted and/or accompanied by additional signals based on factors such as availability of data/information, preferences, and other implementation choices.

Transforming Candidate Textual Representations into the Taxonomy Space

Mapper 328 is configured to map candidate textual representations to nodes in the taxonomy stored in database 108. As explained above, the whitelist 320 used to identify textual representations is extracted from a taxonomy stored in database 108. Each node in the taxonomy has an associated ID. As one example, the concept “Lunar Reconnaissance Orbiter” has a ConceptID of 2381014. The concept “academic conference” has a ConceptID of 118760. In some cases (such as with “Lunar Reconnaissance Orbiter”), the textual representation unambiguously corresponds to a single node in the taxonomy (i.e., node 2381014). In other cases, the textual representation\'s meaning may be ambiguous. For example, a textual representation of “jaguar” occurring in a document could correspond to the concept “Jaguar Cars Ltd.,” to the concept “Panthera onca,” to the concept “Mac OS X v10.2,” or one of several other concepts. A textual representation of “apple” occurring in a document could correspond to the concept “Malus domestica,” to the concept “Apple Inc.,” or one of several other concepts.

In various embodiments, mapper 328 determines the set of all concepts to which a particular textual representation maps. Each mapping is associated with a mapping vector. Mapping vectors (332) are either of type “unambiguous” or type “ambiguous.” A mapping vector is of type “unambiguous” only if a given textual representation maps to a single concept. Otherwise, it is of type ambiguous. A mapping vector also stores additional information, such as a pointer to the textual representation in the document, the conceptID of the mapped concept, the feature vector of the textual representation, and a strength value that indicates a confidence in the mapping. As will be described in more detail below, in some embodiments the mappings 332 initially created by mapper 328 are pruned through a series of refining actions.

FIG. 4 illustrates a mapping between a set of textual representations (402) and a set of concepts (404). In the example shown, the first textual representation (tr1) unambiguously maps to concept c1. The unambiguous mapping between the two is denoted as “u” along line 406. As one example, tr1 is “Lunar Reconnaissance Module” and c1 is the concept “Lunar Reconnaissance Orbiter.” Suppose tr2 is the acronym, “LRO.” LRO could be short for “Lunar Reconnaissance Orbiter” but could also be short for “large receive offload,” (c3) which is a technique in computer networking for increasing inbound throughput. As shown in FIG. 4A, tr2 can thus be mapped both to c1 and c3. The ambiguous nature of the mappings is denoted as “a” along lines 408 and 410.

Mapper 328 sorts a document\'s textual representations into a set of unambiguous textual representations (e.g., tr1) and a set of ambiguous textual representations (e.g., tr2, tr3, and trn). For each ambiguous textual representation, the mapper determines whether a concept to which it is mapped is also a concept to which a textual representation that is unambiguous is mapped. If so, the ambiguous textual representation is reclassified as an unambiguous textual representation and is mapped solely to the concept to which the unambiguous concept is mapped.

FIG. 5 illustrates a process for resolving an ambiguity. In various embodiments, the process shown in FIG. 5 is performed by mapper 328. The process begins at 502 when a set of textual representations is received. At 504, the textual representations are divided into sets based on whether they unambiguously or ambiguously map to a concept. Finally, at 506, an attempt to resolve ambiguities is made. One technique for attempting to resolve an ambiguity is presented in the preceding paragraph.

FIG. 6 illustrates the mapping depicted in FIG. 4 after the processing described in conjunction with FIG. 5 has been performed. As explained above, textual representation t2 was initially mapped to two concepts—c1 and c3. Since unambiguous textual representation tr1 mapped to concept c1, mapper 328 removed the mapping vector corresponding to 410 and changed the type of the mapping vector corresponding to 408 from ambiguous to unambiguous. Two textual representations that map to the same concept (such as tr1 and tr2 as shown in FIG. 6) are examples of synonyms. Both textual representations (“Lunar Reconnaissance Orbiter” and “LRO”) refer to the concept “Lunar Reconnaissance Orbiter.” In the example shown in FIG. 6, if another textual representation, “tr4,” also mapped to c1, it would also be considered a synonym of tr1 and tr2.

The process of FIG. 5 can also be used to resolve ambiguities for textual representations which are not synonyms of one another but share related concepts. As one example, suppose “Steve Jobs” is a textual representation included in a document and unambiguously resolves to the concept of businessman, Steven Paul Jones. The textual representation “apple” is also present in the document, in the sentence, “I would like to buy an apple.” The term, “apple” is not synonymous with “Steve Jobs,” however, its potential meaning as a fruit can be disambiguated by the presence of Steve Jobs in the document. One approach for accomplishing this is for mapper 328, when performing portion 506 of the process shown in FIG. 5, to examine the nearest neighbors of concepts in the taxonomy. Another approach is to use the concept blacklist/whitelist signals described in more detail below. Yet another approach is to use a document similarity score described in more detail below.

In some embodiments, for any remaining textual representations in the ambiguous textual representation set (e.g., four meanings of “jaguar”), a mapping between the textual representation and the concept corresponding to each possible meaning is added to the unambiguous set (e.g., four different unambiguous mappings), and the textual representation (“jaguar”) is removed from the ambiguous set. Engine 106 is configured to remember that the meaning of the textual representation was not resolved (i.e., that jaguar could mean one of four things). As will be described in more detail below, pruning of three of the four different unambiguous mapping vectors is performed after a document vector is computed and a document similarity score generated.

Creating a Concept Feature Vector

In addition to the processing described above, vector populator 326 is also configured to populate a set of concept feature vectors 312. One way of accomplishing this is as follows. For each concept remaining after the processing of FIG. 5, vector populator 326 merges the feature vector scores of any textual representations mapped to respective concept (e.g., by adding the values together) and includes additional information (described in more detail below).

Using the example of the representations “Lunar Reconnaissance Orbiter” and “LRO,” a concept feature vector for “Lunar Reconnaissance Orbiter” is formed by summing the respective feature vectors of the two textual representations and adding additional information. The concept “Lunar Reconnaissance Orbiter” would accordingly have a TitleTF=0+0=0, a BodyTF=1+3=4, and so on.

Inverse Document Frequency Signal

One additional piece of information that is included in the concept feature vector is the inverse document frequency (“IDF”) of a canonical textual representation associated with the concept. As one example, “JFK,” “John F. Kennedy,” and “Jack Kennedy” all refer to the 35th president of the United States. The canonical textual representation is “John F. Kennedy” and the IDF included in a concept feature vector for the president would be determined using “John F. Kennedy.” The canonical textual representation is stored in the taxonomy in database 108 and is in some embodiments the title of the concept as it appears in a third party corpus such as Wikipedia. In some embodiments the IDF is computed for all textual representations occurring in the document instead of or in addition to the canonical textual representation.

The IDF is a statistical measure that can be used to evaluate how important a word is to a particular document included in a corpus of documents (e.g., the world wide web, documents on an enterprise server, etc). For a given term “i,” one way to compute the IDF of i is as follows:

IDF i = log    D   { d : t i ∈ d } 

with |D| being the number of documents in the corpus, and |{d:tiεd}| being the number of documents where the term t, appears.

Number of Homonyms Signal

Another piece of information that can be included in the concept feature vector is the Homonyms signal. This signal indicates the number of homonyms for the concept and can be used to weight against (or toward) the selection of concepts that can easily be confused with other concepts. The number of homonyms associated with a concept is, in some embodiments, included in the taxonomy stored in database 108.

Concept Whitelist/Blacklist Signals

Yet another piece of information that can be included in the concept feature vector is whether or not the concept is present in a concept whitelist (or concept blacklist, as applicable). For example, in configuration 318, publishers can specify concept whitelists (concepts they prefer to bias toward) and concept blacklists (concepts they have a bias against). If the concept is present in the concept whitelist, in some embodiments a Whitelist=1 signal is included in the concept feature vector (and has a “0” value otherwise). If the concept is present in the concept blacklist, in some embodiments a Blacklist=1 signal is included in the concept feature vector (and has a “0” value otherwise). The whitelist/blacklist signals can be used as weights and can also be used to prune concepts.

Linkworthiness, Popularity, and Freshness Signals

“Linkworthiness” is another signal that can be precomputed for a concept in the taxonomy and included in a concept feature vector. One example of a linkworthiness signal is a measure of how frequently the concept is included in a hyperlink in a corpus. As one example, suppose “bottled water” occurs 4,543 times within the corpus of documents that comprise the Wikipedia site. However, the term is linked a single time. Bottled water would accordingly have a linkworthiness score of 1/4,543=0.00022. As another example, suppose “carpe diem” occurs 200 times and is linked to 88 times. Carpe diem would accordingly have a linkworthiness score of 88/200=0.44. A corpus including multiple sites and/or the entire World Wide Web can also be parsed in determining linkworthiness instead of or in addition to Wikipedia. In some embodiments, the documents used to perform the linkworthiness determination are selected based on a pagerank or other measure of their quality. For example, links included in highly rated newspaper sites might be parsed, while links included in domain parked sites would not. The measure of quality can also be factored into the linkworthiness score itself.

For ambiguous concepts, such as “jaguar,” in addition to determining the number of times a concept is linked, the meaning to which it is linked is also examined. For example, suppose that within Wikipedia, “jaguar” appears 500 times. Of those 500 instances, 300 have associated hyperlinks. Of the 300 hyperlinks, 60% direct the viewer to a page about Panthera onca, 30% direct the viewer to a page about the car company, and the remaining 10% of links direct viewers to other (even less common) meanings of the word. In this example, a popularity score can be associated with each of the meanings and used as a signal (described in more detail below), such as the cat meaning having a popularity score of 0.6, the car meaning having a popularity score of 0.3, and so on. In the case where the Wikipedia corpus is used, whether or not a particular ambiguous concept is designated as the default can also be used as a measure of popularity.

The “freshness” of a topic can also be used as a signal. Such information can be gleaned by scraping Twitter feeds, news aggregation sites, and other indicators of current topics, stored in the taxonomy and included by vector populator 326 in the concept feature vector. One example of a change in a concept\'s freshness is the concept “cupola.” Prior to the STS-130 shuttle mission, the term rarely appeared in news articles and Twitter messages. The inclusion in the payload of a cupola for the International Space Station however, resulted in considerably more use of the term and thus its freshness score rose.

In various embodiments, the linkworthiness, popularity, and freshness signals are combined together into a single signal. The values may be binary (e.g., fresh=0 or fresh=1) or any other appropriate value, typically normalized between 0 and 1.

Additional Signals

A capitalization signal can be used to indicate how often a concept is capitalized in documents appearing in a corpus such as the World Wide Web. As one example, the n-gram data made available by Google can be used to estimate the percentage of times a concept is capitalized.

In some embodiments, rules are used to weight various signals on a category basis. For example, if a topic such as “Hired” belongs to the category “Film,” a category-based rule can be used to give higher weight to the Case signal accordingly.

Pruning Concepts

In various embodiments, once vector populator 326 has completed populating concept feature vectors 312, some of the concepts are pruned. For example, concepts having a non-zero TitleTF score and a BodyTF=0, having NLP scores of 0, or having very low IDF scores (e.g., a term such as “shopping”) are dropped. As another example, concepts that are orphans (e.g., nodes in the taxonomy without at least one parent or child) are also dropped.

As explained above, the “Case” score of a textual representation can be used when determining whether the textual representation maps to a particular concept. Suppose “Has Been” is the name of a musical album (a concept) and “has been” appears in a document 102. The Case score of the concept is 2, because the musical album\'s title is capitalized. The Case score of the textual representation is 0. In some embodiments, the musical album is pruned due to the mismatch in case scores.

As another example, if concept whitelist/blacklist information has been provided to engine 106, the information can be used to resolve ambiguous meanings. For example, suppose medically themed site 120 has specified either the vertical “Health” or a series of lower level concepts such as “nutrition” and “organic foods” in whitelist 320. Also suppose that document 102 includes an ambiguous occurrence of the textual representation “apple” which is mapped by mapper 328 in accordance with the techniques described above to two concepts—a fruit and a computer company. The ambiguity can be resolved (and one of the two concepts pruned) by detecting that the Whitelist signal for the fruit concept has a value of 1 and the Whitelist signal for the computer concept has a value of 0.

In some embodiments, filtering is performed by various components of document processing engine at various stages of processing. For example, in some embodiments orphan concepts are omitted from whitelist 320. As another example, in some embodiments filtering based on scores such as NLP scores and IDF scores occurs prior to the processing described in conjunction with portion 506 of the process shown in FIG. 5.

Category Vectors

Each concept “c” in the taxonomy stored in database 108 has an associated category vector 322. In various embodiments, the category vector is precomputed (i.e., prior to the processing of document 102) and is also stored in database 108. For a particular concept c in the taxonomy, the category vector is a set of categories/concepts that are related to that concept c, along with a weight for each of the included categories/concepts. A variety of techniques can be used to compute the category vector.

One way to populate the category vector is to use the up-lineage of the concept (e.g., parents, grandparents, etc.), and assign a decreasing score based on distance (e.g., parents have a score of 0.9, grandparents have a score of 0.8, etc.). A second way to populate the category vector is to use the down-lineage of the concept (e.g., children). A third way to populate the category vector is to use a predetermined list of concepts designated as being “related” to the concept (e.g., including siblings), or to use the concept lighting techniques described in more detail below.

A fourth way to populate the category vector is to use membership in a subset “K” of a taxonomy “T,” where |K|<<|C|. For example, K can include only verticals and entity classes. Further, elements within K should not have parent-child relationship, meaning that all members of a given k in K should not automatically be members of another k.

Document Vector

Vector populator 326 is configured to populate a document vector 314 for each document 102. In some embodiments this is accomplished by computing the average of all category vectors implicated by the concepts associated with document 102 remaining after the pruning described above. Document vector 314 can thus be denoted as follows:

dv = ∑ n  cv i n .

In some embodiments, the document vector is normalized so that the sum of the components of cvi is 1. Other techniques can also be used to compute a document vector, as applicable. For example, a weight value on an exponent can be included in the computation such that top level concepts (like “health” and “sports”) are favored or disfavored, as indicated by a publisher, over bottom level concepts (like “Sungold Tomato”). As another example, the computation of the document vector can take into account rules such as that concepts that have ambiguous parents be excluded from the document vector, that concepts associated with terms appearing in the title be weighted significantly more than other concepts, etc. Document vector 314 is one example of output that can be provided by engine 106 to various applications described in more detail below.

Document Similarity and Further Disambiguation

In some embodiments, vector populator 326 is configured to use document vector 314 to compute a set of document similarity scores. For a given concept, the document similarity score is computed as: dsi= dv∘ cvi. It provides an indication of how similar the concept vector is to the document vector. Once computed, the document similarity score is included in the concept\'s feature vector 312. In various embodiments, other similarity scores, such as a site similarity score can also be computed (e.g., by computing the similarity of a concept over all the documents from a given site) and included in feature vector 312.

The document similarity score can be used to resolve remaining ambiguities. For example, suppose document 102 includes the statement, “Jaguar prices are climbing.” Absent additional information, the textual representation “Jaguar,” could plausibly refer to either an animal or an automobile. By examining the document similarity scores of both the Panthera onca and the Jaguar Cars Ltd. concepts, disambiguation can be performed. For example, if the document is an article about the cost of zoo exhibits, concepts such as “zoo” and “wildlife” and “park” will likely be included in the document vector, while concepts such as “luxury cars” and “high performance engine” will likely not (or will have considerably lower scores). Accordingly, the document similarity score of “Panthera onca” will be considerably higher than the score for “Jaguar Cars Ltd.” and the ambiguity can be finally resolved by pruning the second concept.

In some embodiments, additional information is employed to resolve remaining ambiguities. For example, the textual representation, “Michael Jackson” most frequently refers to the American musician. However, the taxonomy also includes other individuals of note that are also named “Michael Jackson” (e.g., a civil war soldier, a British television executive, etc.). It is possible that a document could be referring to a Michael Jackson that is not the musician. In various embodiments, the popularity of a particular concept is used as one consideration (e.g., with the musician meaning being more popular than the civil war solider) and concept\'s document similarity score is used as another. Based on customizable weights, engine 106 can be configured to disambiguate concepts such as “Michael Jackson” by preferring the popular meaning (and pruning the others), except when the document similarity score overwhelmingly indicates (e.g., having a document similarity score exceeding 0.7) that an alternate meaning should be selected. As another example, the freshness of a topic can be considered.

Ranking Results

Even after the scoring and pruning actions described above have been performed, for a given document 102, it is possible that hundreds (or more) of textual representations and associated concepts remain as candidates. Typically, only a handful of the top textual representations and/or concepts are needed.

Ranker 330 is configured to rank the concepts remaining in consideration after the above processing/pruning has been performed. One approach is to use a scoring function s that computes a score given a concept feature vector. In various embodiments, what weights to apply to the various signals included in the concept feature vector are empirically determined and then tunes using linear regression. In various embodiments, only a subset of the signals is used (e.g., a combination of the document similarity score and linkworthiness/popularity/freshness signals). For a given document 102, a threshold/cutoff is applied to limit the final list of concepts to an appropriately manageable size for consumption by an application. Concepts having a score above the threshold (and their corresponding textual representations) are provided as output (i.e., “entities”).

Publishers can, through configuration 318, specify customized rules for the combination function used to calculate final concept scores. For example, publisher 120 can specify as a rule that while all medical concepts should be considered by engine 106 when generating the document vector 314, disease symptoms should not be output as entities. As another example, publisher 120 might choose to weight the values of the Whitelist/Blacklist signals more heavily than publisher 118, who might in turn prefer another signal, such as by preferring concepts with the higher freshness scores, or a monetization signal that measures how well a given concept monetizes. One benefit of using category-based monetization is that an extrapolation can be made as to the monetization of a very specific textual representation based on the concept (or higher level category/vertical) with which it is associated. It may be the case that pharmaceuticals monetize well but names of diseases do not. When a new pharmaceutical is introduced to market, the publisher need not take any action to indicate a preference toward textual representations of the new pharmaceutical as a candidate term. As another example, if specific words are empirically determined to monetize well on a given publisher\'s website (e.g., “golden retriever,” “collie,”), the categorization of those words (e.g., “breeds of dog”) within the taxonomy can be used by engine 106 to bias the selection of other words belonging the category (e.g., “beagle”) even absent historic data for those other words.

In some embodiments the threshold/cutoff is manually selected, such as by a publisher specifying in configuration 318 that a maximum number of 10 entities be returned. In other embodiments, engine 106 applies a dynamically generated threshold based on factors such as the document length. For example, the publisher can specify a link density, such as that up to 5% of the number of words in a document be included in entities. In some embodiments, the number of textual representations remaining in candidate list 324 is used as a proxy for the document length. Other information, such as click-through rate data, can also be used to determine the cutoff number of entities and also as an additional, site-specific signal that can be stored (e.g., in database 108) and used while processing other documents (e.g., as an additional concept feature vector signal).

FIG. 7 illustrates an example of a portion of output generated by a document processing engine. The example shown illustrates the first and twenty-fourth ranked entities determined from the document shown in FIG. 2B. The concept “Lunar Reconnaissance Orbiter” (and corresponding textual representation “Lunar Reconnaissance Orbiter”) has the highest score as indicated in region 702. The concept “academic conference” (and corresponding textual representation “scientific meeting”) has a considerably lower score as indicated in region 704.

Example Process for Detecting an Entity

FIG. 8 illustrates an embodiment of a process for determining a mapping between a textual representation in a document and a concept. In various embodiments, the process shown in FIG. 8 is performed by document processing engine 106. The process begins at 802 when a document is received. As one example, a document is received at 802 when Joe submits a blog post to site 118 and site 118 provides the post to system 106 via an API. At 804, candidate textual representations are identified, such as by textual representation detector 308. At 806, concepts associated with the candidate textual representations are determined, such as by mapper 328. As explained above, various refinements (e.g., disambiguation) and pruning of the candidate textual representations and associated concepts can be performed.

Finally, at 808 pairs of textual representations and associated concepts are provided as output. As one example, at 808, entities 110 are provided to hyperlink generator 134, which provides to site 118 instructions for generating links such as from “American Geophysical Union” to www.agu.org. In various embodiments, the instructions include properly formed HTML. In others, a list comprising the textual representation and a destination URL (but no HTML) is generated by generator 134 and provided to site 118. As another example, at 808, entities 110 are provided to plugin 132 which determines additional information that should be displayed to Alice in a separate window.

In various embodiments, plugin 132 leverages additional information to which it has access, such as cookies, passwords, and other information stored within a browser, when including additional links in the rendered page. For example, suppose Alice is currently viewing a document that includes mention of a company, Beta Corporation. Engine 106 determines that the textual representation “Beta Corporation” should be linked. Plugin 132 is aware that Alice is “friends,” on a social networking site, with a person that works at Beta Corporation (as gleaned from his profile). Accordingly, rather than linking to Beta Corporation\'s website, plugin 132 instead decides to direct Alice to the social networking site. As another example, instead of inserting a hyperlink into the document, plugin 132 could also provide a popup text or other notification for Alice that her friend works at Beta Corporation.

FIG. 9 illustrates an embodiment of a process for categorizing a document. In various embodiments, the process shown in FIG. 9 is performed by document processing engine 106. The process begins at 902 when a document is received. As one example, a document is received at 902 when Joe submits a blog post to site 118 and site 118 provides the post to system 106 via an API. At 904, entity pairs are determined, such as in accordance with the processing shown at portions 804-808 of the process shown in FIG. 8. Finally, at 906 a categorization of the document is determined. In some embodiments this is document vector 314. The tags shown at 254 in FIG. 2B are an example of output that is generated as a result of the determination made at 906. In various embodiments, the categorization determined at 906 is thresholded prior to output, such as by being limited to the top three categories of the document vector.

Example Embodiment Including Hyperlinks in a Web Forum

FIG. 10 illustrates an embodiment of a portion of a webpage as rendered in a browser. The example shown is a portion of site 120, which provides a forum in which users discuss various medical conditions and other topics with one another. In the example shown, a user, “Fred22,” has recently discovered that he has diabetes and is exchanging messages with other users, such as “JanetQ,” about his diagnosis.

An administrator of site 120 has provided to system 104 configuration information pertaining to site 120. Specifically, the administrator has indicated that site 120 is a health site (e.g., by listing “Health” as a vertical to which it pertains). As mentioned above, users of site 120 are unable to insert hyperlinks into their posts for security reasons. However, the administrator of site 120 would like visitors to have as positive an experience as possible and thus would like them to benefit from the techniques described herein by having various hyperlinks automatically included in forum posts. The administrator has specified, in configuration 318, that any textual representations selected by engine 106 that are associated with pharmaceuticals be hyperlinked to entries in an online pharmaceutical encyclopedia. The administrator has further specified that selected textual representations for which site 318 has informational pages (e.g., basic medical concepts) be hyperlinked to those pages. One way of accomplishing this is to, as a periodic process, crawl all or some subset of the pages on site 120 and categorize them (e.g., by determining their respective document vectors). Pages on site 120 with appropriate document vectors can be used for linking Finally, the administrator has specified that any textual representations selected by engine 106 which are not covered by the two previous specifications be hyperlinked to topic pages automatically generated for those concepts (e.g., based on techniques described below).

In the example shown, engine 106 has determined that the textual representation, “Type 2 Diabetes” be hyperlinked to a page about diabetes that is also hosted by site 118 (1002). Using the techniques herein, engine 106 is able to disambiguate “T1” to mean the concept “type 1 diabetes” and has also hyperlinked it to the page about diabetes (1004). Engine 106 determined that the textual representation “glucose meter” should be hyperlinked to an automatically generated topic page. Glucose meter is a concept that monetizes well, and so, during the scoring performed by engine 106, textual representation 1006 was selected over other candidate representations. Engine 106 has determined that the textual representation, “Metformin,” should be hyperlinked to an entry for that pharmaceutical in an online pharmaceutical encyclopedia.

Link Distribution

In the example shown in FIG. 10, the hyperlinks generated by system 104 are concentrated within the first post, at the top of the page. Publishers can specify which regions of the page should be considered for hyperlinking and which should not. Publishers can also indicate their preferences for how the links should be distributed, such as evenly throughout the page, or concentrated in various areas. In some embodiments, information such as click-through information is used by system 104 to automatically determine whether the users of a given site tend to concentrate their clicking activity in particular regions or whether their clicking activities are evenly spread throughout a given page. One way of accomplishing this is for system 104 to record, for each click event, the position of the text being clicked. The positions are discretized (e.g., into the first tenth of the document, the next tenth of the document, and so on) and a determination is made as to what, if any, impact the position of a link has on a visitor\'s odds of activating the link. The position of the textual representation can be compared against the historical click-through information and used as yet another signal that can be used by engine 106 when selecting entities. For example, if visitors tend to click on links appearing at the beginning of an article, textual representations toward the end of the article may not be selected by ranker 330. However, if a textual representation appearing toward the end of an article has sufficiently high scores for other signals, it may nonetheless be selected over earlier appearing textual representations.

FIG. 11 is a chart illustrating the distance between textual representations. In the example shown, textual representations are arranged in one axis and sorted according to position. The first selected textual representation appearing in a document is shown in FIG. 11 as item 0, the second selected textual representation appearing in the document is shown as item 1, and so on. On the y axis is the letter distance between a given link and the next link appearing after it. Link 1104 is relatively far in letter distance from link 1102. Link 1106 is very close to link 1104. In some embodiments, engine 106 is configured to minimize the low values (e.g., link 1106) in the graph and to maximize the high values (e.g., link 1104). The letter distance between candidate links can be used as yet another signal by ranker 330 when selecting entities for output. In some cases, such as with lists of items (e.g., the names of the Seven Dwarves), engine 106 may select all seven textual representations despite all seven names being adjacent to one another. Such lists of items can be determined based on the document similarity score and can also be based on information stored in the taxonomy.

Intents

As explained above, a variety of destination URLs can be selected as the destination of hyperlinks of textual representations based on various configuration preferences and the available inventory of destination pages for a given topic. In some embodiments, additional information such as an intent of a visitor (e.g., “shopping”) or the context of the original page (e.g., “health”) can be used to influence the destination to which the visitor will be directed when activating a link.

Examples of intents include “shopping” and “symptoms.” To fully understand the intent, a companion subject is required, such as “shopping for shoes” or “symptoms of the measles.” Without a companion subject, an intent is unlikely to be a worthwhile entity. However, the combination of an intent with its companion subject is potentially of significant interest. Accordingly, in various embodiments, engine 106 is configured with a list of intents and, as applicable, concepts that are appropriate companions for the respective intents. Textual representations associated with an intent will have an Intent=1 signal set.



Download full PDF for full patent description/claims.




You can also Monitor Keywords and Search for tracking patents relating to this Including hyperlinks in a document patent application.

Patent Applications in related categories:

20130124954 - Method and apparatus for merging digital content - Embodiments for merging digital content are disclosed. ...


###
monitor keywords

Other recent patent applications listed under the agent Wal-mart Stores, Inc.:



Keyword Monitor How KEYWORD MONITOR works... a FREE service from FreshPatents
1. Sign up (takes 30 seconds). 2. Fill in the keywords to be monitored.
3. Each week you receive an email with patent applications related to your keywords.  
Start now! - Receive info on patent apps like Including hyperlinks in a document or other areas of interest.
###


Previous Patent Application:
Document enhancement system and method
Next Patent Application:
Method and apparatus for providing supplemental video content for third party websites
Industry Class:
Data processing: presentation processing of document

###

FreshPatents.com Support - Terms & Conditions
Thank you for viewing the Including hyperlinks in a document patent info.
- - - AAPL - Apple, BA - Boeing, GOOG - Google, IBM, JBL - Jabil, KO - Coca Cola, MOT - Motorla

Results in 1.0242 seconds


Other interesting Freshpatents.com categories:
Computers:  Graphics I/O Processors Dyn. Storage Static Storage Printers g2