| Unsupervised learning tool for feature correction -> Monitor Keywords |
|
Unsupervised learning tool for feature correctionRelated Patent Categories: Data Processing: Database And File Management Or Data Structures, Database Or File Accessing, Query Processing (i.e., Searching)Unsupervised learning tool for feature correction description/claimsThe Patent Description & Claims data below is from USPTO Patent Application 20070043707, Unsupervised learning tool for feature correction. Brief Patent Description - Full Patent Description - Patent Application Claims CLAIM OF PRIORITY TO APPLICATION FILED IN FOREIGN COUNTRY [0001] The present application claims priority under 35 USC .sctn.119(a) to an application for patent filed in India on Aug. 17, 2005, the title of that application being "UNSUPERVISED LEARNING TOOL FOR FEATURE CORRECTION," and the application number of that application being 753/KOL/05. FIELD OF THE INVENTION [0002] The present invention relates to data processing and, more specifically, to automatically validating and correcting the feature extraction results in search indices by identifying recurrent patterns in HTML/text documents. BACKGROUND [0003] Web sites present information on various topics in various formats. A great amount of effort is often required for a user to manually locate and extract useful data from the web sites. Therefore, there is a great need for value-added services that integrate information from multiple sources. For example, such services include customizable web information gathering robots/crawlers, comparison-shopping agents, meta-search engines and news bots, etc. [0004] To facilitate the development of these information integration systems, good tools are needed for information gathering and extraction. In situations where data has been collected from different web sites, a conventional approach for extracting data from various web pages uses programs called "wrappers" or "extractors" to extract or excerpt data items, or "features," from the contents of the web pages. [0005] For example, an extractor might attempt to categorize different data items that occur within a particular web page. If the web page comprises an advertisement for an employment opportunity, for example, then the extractor might attempt to locate, within the web page, separate data items that fit into "job title" and "job location" categories. The extractor might attempt to categorize data items on multiple separate web pages in this manner. When the extractor locates a data item that the extractor deems to fit a particular category, the extractor may insert that data item into a search index, and establish an association between that data item and the category that the data item is deemed to fit. When a user later queries a search engine, the search engine may consult the search index to find search results in which the user may be interested. The accuracy and completeness of the contents of the search index strongly influences the relevance and value of the results. [0006] For a particular web page and a particular category, the extractor might or might not be able to locate, on that page, a data item that fits that category. If the criteria used to identify a data item that fits a particular category are not well adapted to the construction of the page, then the extractor might mistakenly determine that a data item other than the "correct" data item fits the category. For example, the extractor might mistakenly determine that the "job location" data in a web page (rather than the actual "job title" data in that web page) fits into the "job title" category. [0007] Based on how many of the criteria that the data item selected for a category satisfies, the extractor might assign, to the selected data item, an indication of how likely it is that the data item actually was the "correct" data item on the page-how likely it is that the data item actually did fit the category. This indication is commonly called a "confidence measure." Data items that are very likely to be the "correct" data items may be associated with relatively high confidence measures, while data items that are less likely to be the "correct" data items may be associated with lower confidence measures. If the confidence measure for a particular data item is lower than a certain threshold, then the extractor might refrain from inserting the data item into the search index at all. [0008] After an extractor has automatically populated the search index, the search index may contain some incorrect entries, and may omit some correct entries. One approach for revising the search index involves employing a human being to look through the extracted data items manually, determine which data items have relatively low confidence measures, read the pages from which the low-confidence data items were excerpted, and determine whether any data items in those pages actually do fit the categories at issue. Although human beings are consistent and accurate in some cases, they usually operate slowly, and they can cost a considerable amount of money to train and maintain. Some human beings are less consistent and accurate than others, especially after they have been working uninterrupted for long periods of time. Mistakes happen. [0009] Other approaches for revising the search index rely on the web pages being formatted in a known way, and, as a result, are inapplicable if the web pages are not formatted in that known way or if the structure of the web pages deviates over time from that known way. For example, some approaches might require the web pages to be HTML documents that conform to a specified scheme. These approaches fail when applied to documents that are not in HTML or which depart from the scheme even to a minor extent, sometimes due to changes in the documents after the extraction process has occurred. [0010] The approaches described in this section are approaches that could be pursued, but not necessarily approaches that have been previously conceived or pursued. Therefore, unless otherwise indicated, it should not be assumed that any of the approaches described in this section qualify as prior art merely by virtue of their inclusion in this section. BRIEF DESCRIPTION OF THE DRAWINGS [0011] The present invention is illustrated by way of example, and not by way of limitation, in the figures of the accompanying drawings and in which like reference numerals refer to similar elements and in which: [0012] FIG. 1 is a block diagram illustrating a high-level functional view of the architecture of an example system that employs techniques described herein in order to revise a search index, according to an embodiment of the invention; [0013] FIG. 2 is an illustration of a sample web page from which data items might be excerpted and categorized, according to an embodiment of the invention. [0014] FIG. 3 is a flow diagram illustrating a technique for identifying patterns in a fuzzy manner, according to an embodiment of the invention; [0015] FIG. 4 is a flow diagram illustrating a technique for selecting, from among a plurality of candidate symbol subsequences, a symbol subsequence that represents a set of shared characteristics for a particular data item category and web site, according to an embodiment of the invention; and [0016] FIG. 5 is a block diagram of a computer system on which embodiments of the invention may be implemented. DETAILED DESCRIPTION [0017] In the following description, for the purposes of explanation, numerous specific details are set forth in order to provide a thorough understanding of the present invention. It will be apparent, however, that the present invention may be practiced without these specific details. In other instances, well-known structures and devices are shown in block diagram form in order to avoid unnecessarily obscuring the present invention. Overview [0018] According to one embodiment of the invention, it is assumed that an extractor has deemed certain data items (or "features") to be "high confidence measure" data items. High confidence measure data items are data items that are considered to likely (although not certainly) have been categorized correctly. For example, assume that data items X, Y and Z have been categorized into category Q. The confidence measures that data items X, Y and Z belong to category Q may be 90%, 75% and 25%, respectively. Under these circumstances, data item X and Y may be considered to be high confidence measure data items for category Q. Continue reading about Unsupervised learning tool for feature correction... Full patent description for Unsupervised learning tool for feature correction Brief Patent Description - Full Patent Description - Patent Application Claims Click on the above for other options relating to this Unsupervised learning tool for feature correction patent application. ### 1. Sign up (takes 30 seconds). 2. Fill in the keywords to be monitored. 3. Each week you receive an email with patent applications related to your keywords. Start now! - Receive info on patent apps like Unsupervised learning tool for feature correction or other areas of interest. ### Previous Patent Application: Temporal ranking scheme for desktop searching Next Patent Application: Video directory Industry Class: Data processing: database and file management or data structures ### FreshPatents.com Support Thank you for viewing the Unsupervised learning tool for feature correction patent info. IP-related news and info Results in 0.12037 seconds Other interesting Feshpatents.com categories: Novartis , Pfizer , Philips , Polaroid , Procter & Gamble , 174 |
* Protect your Inventions * US Patent Office filing
PATENT INFO |
|