Rules-based extraction of data from web pages -> Monitor Keywords
Fresh Patents
Monitor Patents Patent Organizer File a Provisional Patent Browse Inventors Browse Industry Browse Agents Browse Locations
site info Site News  |  monitor Monitor Keywords  |  monitor archive Monitor Archive  |  organizer Organizer  |  account info Account Info  |  
10/26/06 - USPTO Class 709 |  102 views | #20060242266 | Prev - Next | About this Page  709 rss/xml feed  monitor keywords

Rules-based extraction of data from web pages

USPTO Application #: 20060242266
Title: Rules-based extraction of data from web pages
Abstract: A rule creation application uses a reference web page, and user input regarding information displayed thereon, to generate a rule for extracting such information from the web page. The rule uses a structured graph representation of the web page, such as the page's Document Object Model (DOM), to extract the information. In addition to being applicable to the reference web page, the rule may be used to extract information from other web pages that have a similar structure. (end of abstract)



Agent: Knobbe Martens Olson & Bear LLP - Irvine, CA, US
Inventors: Paula Keezer, Brad Tofel
USPTO Applicaton #: 20060242266 - Class: 709218000 (USPTO)

Related Patent Categories: Electrical Computers And Digital Processing Systems: Multicomputer Data Transferring, Remote Data Accessing, Using Interconnected Networks

Rules-based extraction of data from web pages description/claims


The Patent Description & Claims data below is from USPTO Patent Application 20060242266, Rules-based extraction of data from web pages.

Brief Patent Description - Full Patent Description - Patent Application Claims
  monitor keywords



PRIORITY CLAIM

[0001] This application is a division of U.S. patent application Ser. No. 09/794,952, filed Feb. 27, 2001, the disclosure of which is hereby incorporated by reference.

COMPUTER PROGRAM LISTING APPENDIX

[0002] The Computer Program Listing Appendix submitted on duplicate compact discs in parent application Ser. No. 09/794,952, filed Feb. 27, 2001, is incorporated herein by reference. The copyright owner has no objection to the reproduction of this Computer Program Listing Appendix as part of this patent document, but reserves all other copyrights whatsoever.

BACKGROUND OF THE INVENTION

[0003] 1. Field of the Invention

[0004] This invention relates generally to the Internet and the World Wide Web and, more particularly, the invention relates to methods and systems for identifying items represented on web pages and for providing supplemental information about items represented on web pages.

[0005] 2. Description of the Related Art

[0006] Web pages provide a highly flexible and effective medium for presenting information to people. The information on any particular web page is generally not, however, optimized for substantive analysis by machine or computer.

[0007] One type of substantive analysis of a web page that can be automated is a determination as to what item or items are represented on a web page. An item can be any identifiable thing, such as a product, a service, a job listing, a company, or a person. Prior technology has generally relied upon regular expression matching, which can be unreliable and which may require substantial processing. The present invention seeks to address this problem among others.

SUMMARY OF THE INVENTION

[0008] In a preferred embodiment, the present invention utilizes the Document Object Model (DOM) representation of a sampled web page to create a rule that extracts data from web pages having a similar DOM structure to the sampled web page. The DOM is an object-oriented interface supported by most popular web browsers through which a displayed web page can be accessed and manipulated. The DOM provides a structured graph representation of a web page with nodes that represent each HTML (Hypertext Markup Language) tag.

[0009] In general, within a single domain or web site, when information on different web pages appears to be displayed in a similar structure, a similar structure is actually being used. Pages that have a similar HTML structure will also have a similar DOM. Furthermore, pages that may even appear substantially different may have a similar DOM structure with respect to the nodes of the DOM that are relevant to a representation of an item of interest. Generally, content providers or web retailers benefit from code or template reuse, and as a result, multiple web pages will have a similar HTML/DOM structure. A preferred embodiment utilizes the structural representation of web pages provided by the DOM to provide a powerful tool through which items can be identified on web pages.

[0010] The DOM is a well-documented utility that has been dealt with at length by the World Wide Web Consortium (www.w3.org). One skilled in the art will be familiar with the DOM and therefore the details of the DOM will not be presented herein. Although the present invention refers the Document Object Model in particular, it will be apparent to one skilled in the art that other representations of web pages that allow the identification of page elements based upon the structure of a page can be used.

[0011] In one embodiment, a rule is created based upon the DOM or other structured graph representation of a first web page and subsequently applied to a second, structurally similar web page, in order to extract data related to an item represented on the second web page. The item-related data that are extracted preferably include item-identifying data that can be used to identify the item represented on the second web page. The item-identifying data can include any data by which an item is represented on a web page, such as the name of an item. The item-identifying data can then be used to identify the item by matching it to an item within a database of item-identifying data.

[0012] In one embodiment, a client application executes on a user computer in conjunction with a web browser. The client application retrieves a rule from a data server based upon the URL of a web page loaded by the web browser. The client application then applies the rule to the web page to extract item-identifying data from the web page. The client application then provides the item-identifying data to the data server. The data server identifies the item by matching the item-identifying data to an item in a database. The data server retrieves supplemental information about the item from the database and supplies the supplemental information to the client application to be displayed on the user computer. The tasks of retrieving and applying the rule may alternatively be performed in-whole or in-part by a computing device other than an end user's computer, such as a special proxy server, the data server, or a computer system used to crawl and index web pages.

[0013] In one embodiment, the present invention is used to identify products on web pages of web-based retailers. Retail web sites generally use pages of similar structure to display items for sale. Accordingly, a rule is created based upon one web page of a retailer, and the rule is then applied to identify products on other pages hosted by the retailer. Once a product is identified, supplemental information about the identified product, such as alternative retailers from which the product can be purchased, can be provided to a user (e.g., as web page metadata).

[0014] In one embodiment, a data server is configured to crawl through a web site and apply rules to target pages in order to identify and catalog the representation of products on web pages. In addition, rules can be configured such that the extracted item-related data includes supplemental item information, such as the price of a product, in addition to the item-identifying data. The extracted supplemental item information can be stored in association with an identification of the item in a database. The stored supplemental item information can then be supplied to users, such as in response to subsequent requests for information about the item.

[0015] Item data extracted from web pages according to the invention may be used for a variety of other purposes. For example, the collected data may be used to build a database that can be queried by users to locate information about product offerings, auctions, job listings, apartment rentals, or other types of items.

BRIEF DESCRIPTION OF THE DRAWINGS

[0016] FIG. 1 illustrates a system in accordance with one embodiment of the present invention.

[0017] FIG. 2 illustrates a web browser user interface as it would be viewed by a user in accordance with one embodiment of the present invention.

[0018] FIG. 3 illustrates an example web browser user interface as it would be viewed by a tagger (person creating rules) in accordance with one embodiment of the invention.

[0019] FIG. 4 illustrates a general method for creating rules in accordance with one embodiment of the present invention.

Continue reading about Rules-based extraction of data from web pages...
Full patent description for Rules-based extraction of data from web pages

Brief Patent Description - Full Patent Description - Patent Application Claims

Click on the above for other options relating to this Rules-based extraction of data from web pages patent application.
###
monitor keywords

How KEYWORD MONITOR works... a FREE service from FreshPatents
1. Sign up (takes 30 seconds). 2. Fill in the keywords to be monitored.
3. Each week you receive an email with patent applications related to your keywords.  
Start now! - Receive info on patent apps like Rules-based extraction of data from web pages or other areas of interest.
###


Previous Patent Application:
Projector and image generating method thereof
Next Patent Application:
System and method for consumer engagement and revenue optimization
Industry Class:
Electrical computers and digital processing systems: multicomputer data transferring or plural processor synchronization

###

FreshPatents.com Support
Thank you for viewing the Rules-based extraction of data from web pages patent info.
IP-related news and info


Results in 1.2646 seconds


Other interesting Feshpatents.com categories:
Daimler Chrysler , DirecTV , Exxonmobil Chemical Company , Goodyear , Intel , Kyocera Wireless , 174
filepatents (1K)

* Protect your Inventions
* US Patent Office filing
patentexpress PATENT INFO