Method and system for subject relevant web page filtering based on navigation paths information -> Monitor Keywords
Fresh Patents
Monitor Patents Patent Organizer File a Provisional Patent Browse Inventors Browse Industry Browse Agents Browse Locations
site info Site News  |  monitor Monitor Keywords  |  monitor archive Monitor Archive  |  organizer Organizer  |  account info Account Info  |  
03/26/09 - USPTO Class 707 |  1 views | #20090083244 | Prev - Next | About this Page  707 rss/xml feed  monitor keywords

Method and system for subject relevant web page filtering based on navigation paths information

USPTO Application #: 20090083244
Title: Method and system for subject relevant web page filtering based on navigation paths information
Abstract: Method and system to utilize the set of navigation paths of web pages as the contextual information for subject relevant web page filtering with high accuracy are provided. The method comprises the steps of: obtaining all web pages in one or more web pages collections; collecting information of the links among the obtained web pages; extracting, based on the collected links, a set of navigation paths of each of the obtained web pages; and filtering the obtained web pages based on the extracted set of navigation paths to obtain desired web pages. In some embodiments, the extraction of the navigation paths is preferably performed on the navigation links of the web pages. Therefore, the method also comprises the process for deleting non-navigation links from all the links of the web pages. Compared with the prior art, the present invention can utilize the contextual information of the web pages for web page filtering in a more sufficient way, thereby improving the accuracy and objectivity of the web page filtering. (end of abstract)



Agent: Sughrue Mion, PLLC - Washington, DC, US
Inventors: Jianqiang LI, Yu ZHAO
USPTO Applicaton #: 20090083244 - Class: 707 4 (USPTO)

Method and system for subject relevant web page filtering based on navigation paths information description/claims


The Patent Description & Claims data below is from USPTO Patent Application 20090083244, Method and system for subject relevant web page filtering based on navigation paths information.

Brief Patent Description - Full Patent Description - Patent Application Claims
  monitor keywords FIELD OF THE INVENTION

This invention relates to information retrieval or information extraction, and especially the web page search or web page mining. More particularly, this invention provides methods and systems to utilize the set of navigation paths of web pages as the contextual information for subject relevant web page filtering with high accuracy.

BACKGROUND

With the electronic information explosion caused by Internet, a huge amount of diversified information is accumulated on the Web, and still continues to grow at a staggering rate. It is a challenging task to help net-citizens find useful information amongst this enormous information pool.

Information retrieval (IR) is the science of searching for information in a set of documents, which can further be divided into searching for a piece of information contained in documents, searching for documents themselves, searching for metadata which describe documents, or searching within databases, whether relational stand-alone databases or hypertext networked databases such as the Internet or intranets, for texts, sounds, images or data. Information extraction (IE) is a type of information retrieval whose goal is to automatically extract structured or semi-structured information from unstructured machine-readable documents. Originated from these two long-established research disciplines, web search engine (e.g., Google or Baidu) is a document retrieval system designed specifically to help find information stored on the Web, which allows one to ask for the contents that meet specific criteria (typically those containing a given word or phrase) and to retrieve a list of items that match those criteria. Recently, a new type of web search engine, i.e., vertical search engine, becomes popular on the Web. Utilizing certain information extraction or web mining technologies, it extracts structured information from a highly refined database or some websites about a specific topic to provide more accurate and valuable information to people interested in a particular area.

In all these information retrieval or information extraction solutions of the Internet era, web page filtering plays an important role inside, no matter for a general (vertical) web search engine or a specific web mining system.

Technically, the process for web page filtering is mainly composed of two steps: first, to select proper and efficient web page features for specific filtering purpose; and then, to model filtering mechanisms based on these selected features. From the aspect of selected features, the current approaches for web page filtering can be roughly classified into four categories, i.e., content based filtering, PageType based filtering, link-based filtering, and extended anchor based filtering. The four categories of web page filtering approaches will be simply introduced below.

Content-based approach: This approach derived directly from the information retrieval research [1]-[2], which is query dependent algorithm, i.e., it assigns a similarity score to each web page whenever a query is submitted. Its basic ideas is that: The words appeared in a web page are employed for retrieving the relevant web pages, i.e., higher scores are given to those web pages that contain the query terms early on in the document or in a large or boldfaced font. Based on Vector Space Model (VSM), the cosine measure can be adopted for computing the similarity between the web page and the corresponding query, and then the relevant web page filtering is realized from the similarity scores.

PageType-based approach: Considering the fact that most Internet users can recognize a certain document type to which a particular web page belongs just by casually looking at it, the conclusion that human's evaluation of a web page based on not only from its contents but also from its various format and design information is drawn. From this observation, the content of a web page together with its structural characteristics are employed in a rule-based classifier for web page type classification. The basic structural characteristics include typical pairs of a tag and strings, the size and number of inline images, the kind and number of links, and URL strings. Based on the inside features (e.g., anchor text, keywords, title, URL, etc) of similar Web page, a machine learning based method can be adopted for web page classification.

Link-based approach: Since the Web is a collection of hyperlinks, in addition to the textual content of the individual pages, the link structure of such collections contains information which can, and should, be utilized for web page filtering. Based on the assumed “random surfer” model of web browser's behavior, a link-based method is proposed for web page importance ranking. It makes use of the link structure of the Web to calculate a quality ranking for each web page, which is called PageRank score. It is computed by weighting each in-link to a page proportionally to the quality of the page containing the in-link. Since the ranking score of a web page is determined solely by a page's location in the Web's graph structure (external information of the web page), then it is query independent and can be computed ahead of the query time. At last, the combination of rank values respectively from content-based and linked-based methods is conducted to determine the final score for measuring the relativity between the web page and the subject.

Extended anchors based approach: When exploiting the hyperlink structure of the Web for web page filtering, the text appeared on the link, i.e., anchor text can also be utilized for web page ranking. The anchor text can not only be associated with the page that the link is on but also be associated with the page the link points to. Especially for the second case, anchor text often provide more accurate descriptions of web pages than the pages themselves; also it helps search non-text information, and expands the search coverage with fewer downloaded documents, such as images, programs, and databases. Based on above consideration, an extended anchor based approach for web page filtering is proposed. First, all the anchor text which appear in the web page and navigate a web browser from the top home page to each target web page is collected to build the extended anchor list. Then, the keywords appeared in the extended anchor list are employed for target web page filtering.

However, the existing web page filtering solutions have disadvantages. First, the information retrieval models adopted by content, PageType, and link based approaches treat each web page as an independent document, i.e., single page based indexing and ranking, which means that the returned page must include all the keywords in a query. They ignore the fact that the internal content of a web page is often not self-contained. Since the indexing function of such solutions indexes web pages solely based on their internal content, the web page filtering results generated from such limited content can't have a satisfied quality.

Typically, during a user's Web navigation, the contextual information of a specific web page (e.g., its domain, directory, and navigational hyperlinks from other pages to this one) are also within the mind of the user and provide an important indication on the content of the web page. However, in the prior art, the contextual information has not been utilized sufficiently.

The content based approach handles the Web as a traditional document repository, the special characteristics of the Web and web pages, such as the contextual information, are not exploited for web page filtering. The textual content of a web page is incomplete for high accurate web page filtering.

For the PageType based approach, although some structural characteristics of a web page are utilized for web page filtering, the hyperlink information in the Web is not considered inside. Since the link structure of hyperlinks collection reflects human's implicit recommendation about the targeted web page, it should make a good contribution to improve the quality of the web page filtering results.

The hyperlink information in the Web is utilized in the link based and extended anchors based approaches, but it is not exploited to its full potential. For the link-based approach, the assumed random surfer's clicking on links might not be at random. The user also utilizes the anchor text to navigate their web browsing. Therefore, besides the number of in-links and their weighting, the anchor text appeared in the navigational path also provides an important indication about the destination web page. However, in the extended anchors based approach, only the anchor text information is considered for web page filtering, the text in the page title, URL text, even the domain or host also provide important indications about the content of the web page, but are not involved.

SUMMARY OF THE INVENTION

In view of the above deficiencies in the prior art, the present invention is made to provide web page filtering method and system, which can solve the technical problems present in the prior art and improve the quality of the web page filtering results.

According to one aspect of the present invention, it is provided a method for web pages filtering, which comprises: obtaining all web pages in one or more web pages collections; collecting information of the links among the obtained web pages; extracting, based on the collected links, a set of navigation paths of each of the obtained web pages; and filtering the obtained web pages based on the extracted set of navigation paths to obtain desired web pages. The navigation path is a list of combination of URLs, anchor texts and web page titles, and the contents and domain names for web pages on the path from a top web page to a target web page. In some embodiments, the web pages collection can be a domain, a sub-domain or a directory. Preferably, in order to implement more accurate and effective web page filtering, the set of navigation paths can be extracted from only the navigation links instead of all the links among the web pages. Therefore, in some embodiments, the collected set of links need to be filtered first before or during the extraction of the set of navigation paths to get the navigation links, which are then used to obtain the desired set of navigation paths. Also preferably, the web pages filtering can be a subject relevant web pages filtering.

According to another aspect of the present invention, it is provided a system for web pages filtering, which comprises: a web page obtaining means for obtaining all web pages in one or more web pages collections; a link information collecting means for collecting information of the links among the obtained web pages; a navigation path extracting means for extracting, based on the collected links, a set of navigation paths of each of the obtained web pages; and a web page filtering means for filtering the obtained web pages based on the extracted set of navigation paths to obtain desired web pages. The navigation path is a list of combination of URLs, anchor texts and web page titles, and the contents and domain names for web pages on the path from a top web page to a target web page. In some embodiments, the web pages collection can be a domain, a sub-domain or a directory. Preferably, in order to implement more accurate and effective web page filtering, the navigation path extracting means can extract the set of navigation paths from only the navigation links instead of all the links among the web pages. Therefore, in some embodiments, the collected set of links need to be filtered first before or during the extraction of the set of navigation paths to get the navigation links, which are then used to obtain the desired set of navigation paths. Also preferably, the web page filtering means can perform a subject relevant web pages filtering on the web pages.

According to the present invention, the navigation paths of the web pages are extracted as context information for the corresponding web pages and are indexed with each of the web pages to generate an index table. As such, not only the link structure of the web pages but also all the potential texts guiding the user's navigation in the Web are exploited for high quality web page filtering.

Furthermore, one given web page might have multiple navigation paths, each of them might be designed by one author who makes his web pages point to this page. If the texts appeared in each navigation path are regarded as a kind of summarization or statement on the content of the targeted web page from a specific aspect, the multiple point views from multiple authors or contexts can be reflected through this set of navigation paths, which can guarantee the objectivity of the web page filtering.

Furthermore, since each navigation path relates to information which is not restricted to one web page but encompasses a set of related web pages, from an ontological point of view, the hyperlink graph in the Web implies many statements directly or indirectly, where the subject is the source page, predicate is the anchor text, and the object is the pointed destination page. Based on this, the semantic inference functionality can be potentially incorporated into the web page filtering process.



Continue reading about Method and system for subject relevant web page filtering based on navigation paths information...
Full patent description for Method and system for subject relevant web page filtering based on navigation paths information

Brief Patent Description - Full Patent Description - Patent Application Claims

Click on the above for other options relating to this Method and system for subject relevant web page filtering based on navigation paths information patent application.
###
monitor keywords

How KEYWORD MONITOR works... a FREE service from FreshPatents
1. Sign up (takes 30 seconds). 2. Fill in the keywords to be monitored.
3. Each week you receive an email with patent applications related to your keywords.  
Start now! - Receive info on patent apps like Method and system for subject relevant web page filtering based on navigation paths information or other areas of interest.
###


Previous Patent Application:
Data paging with a stateless service
Next Patent Application:
Method, apparatus and computer program product for providing a visual search interface
Industry Class:
Data processing: database and file management or data structures

###

FreshPatents.com Support
Thank you for viewing the Method and system for subject relevant web page filtering based on navigation paths information patent info.
IP-related news and info


Results in 3.86451 seconds


Other interesting Feshpatents.com categories:
Novartis , Pfizer , Philips , Polaroid , Procter & Gamble , orig
filepatents (1K)

* Protect your Inventions
* US Patent Office filing
patentexpress PATENT INFO