Estimation and use of access plan statistics -> Monitor Keywords
Fresh Patents
Monitor Patents Patent Organizer File a Provisional Patent Browse Inventors Browse Industry Browse Agents Browse Locations
site info Site News  |  monitor Monitor Keywords  |  monitor archive Monitor Archive  |  organizer Organizer  |  account info Account Info  |  
02/28/08 - USPTO Class 707 |  68 views | #20080052269 | Prev - Next | About this Page  707 rss/xml feed  monitor keywords

Estimation and use of access plan statistics

USPTO Application #: 20080052269
Title: Estimation and use of access plan statistics
Abstract: In processing a query including a selection criterion on one or more attributes of a relation, a prior statistic generated for a prior different selection criterion on the same one or more attributes of the relation, may be revalidated for use in processing the query, based upon a measure of the entropy of the one or more attributes of the relation. In this way, the re-validation of statistics may be performed more efficiently. Furthermore, attribute groups of a relation for which multi-dimensional indexes are to be formed, are identified by evaluating the correlation of attribute values within tuples of the relation and determining that the correlation of attribute values within tuples of the relation exceeds a threshold. (end of abstract)



Agent: Wood, Herron & Evans, L.L.P. (ibm) - Cincinnati, OH, US
Inventors: Abdo Esmail Abdo, Larry Wayne Loen
USPTO Applicaton #: 20080052269 - Class: 707002000 (USPTO)

Related Patent Categories: Data Processing: Database And File Management Or Data Structures, Database Or File Accessing, Access Augmentation Or Optimizing

Estimation and use of access plan statistics description/claims


The Patent Description & Claims data below is from USPTO Patent Application 20080052269, Estimation and use of access plan statistics.

Brief Patent Description - Full Patent Description - Patent Application Claims
  monitor keywords

CROSS-REFERENCE TO RELATED APPLICATIONS

[0001] This application is a divisional of U.S. patent application Ser. No. 10/017,783 filed on Dec. 13, 2001 by Abdo Esmail Abdo et al., the entire disclosure of which is incorporated by reference herein.

FIELD OF THE INVENTION

[0002] This invention generally relates to a database management system performed by computers.

BACKGROUND OF THE INVENTION

[0003] Statistics are frequently accumulated to describe data in a database, to facilitate accesses made to the data. For example, when a query seeks records meeting multiple selection criteria, the process of assembling the results may be made substantially more efficient by applying the selection criteria in an appropriate order. Ordering is important because the process of scanning a database for matching records is time consuming.

[0004] For example, consider a database table (otherwise known as a relation) including columns (otherwise known as attributes) identifying vehicle owners by name and address, and the make, model, model year and other information about their vehicles. An exemplary query into such a relation may seek rows (otherwise known as tuples) identifying the following attribute values: surname "Smith", city name "New York", and vehicle manufacturer "Packard"; that is, seeking New Yorkers named Smith who own Packard vehicles. This query involves forming the intersection or "AND" of the results of three selection criteria, "Smith", "New York", and "Packard". However, two of these three criteria are likely to produce a large number of intermediate results--specifically, the surname "Smith" is popular in the United States, and a large number of individuals reside in "New York". However, "Packard" is a relatively rare vehicle manufacturer. Therefore, this query would be most efficiently performed by selecting those rows having the value "Packard" for the vehicle manufacturer attribute, which will be a relatively small set, and then identifying those rows that also have the surname "Smith" and city "New York". A far less efficient approach would be to identify all persons with the surname "Smith" or addresses in the city "New York", and then selecting from either large set those rows meeting the remaining selection criteria.

[0005] To attempt to optimize query processing, modern database software often generates statistics prior to executing a query, to estimate the likely size of the solution sets that will be generated from each selection criterion in a query. One way of forming these statistics is from indexes such as the encoded vector index (EVI), disclosed U.S. Pat. No. 5,706,495, Chadha et al., Jan. 6, 1998, "Encoded-Vector Indices For Decision Support and Warehousing", which is incorporated by reference. Other forms of indexes are used in other circumstances, as is found to be efficient for the particular type of data in use. Typically, statistics are generated using an index and the specific selection criterion of the query being processed. For example, in the above case a statistic would be generated indicating the approximate number of rows having a surname of "Smith", a city of "New York", and a manufacturer of "Packard". These statistics would ideally show that "Packard" is the most selective criterion and should be used first.

[0006] After collecting the needed statistics for a given query, those statistics may be cached for later re-use. For example, subsequent queries re-using the selection criterion of the city of "New York", could re-use the statistic previously generated for that same criterion. However, in order to re-use statistics, those statistics must be validated for the criterion with which they are to be used. The number of rows having a city name of "Brainerd" might be substantially less than the number having a city name of "New York". Thus, statistics generated for a first selection criterion on a given attribute, cannot be reused for a second selection criterion but rather must be re-validated by re-accessing the associated index. Unfortunately, the time required to access indexes to generate statistics can be a substantial fraction of total query optimization time; thus, re-validation of statistics represents a substantial loss of efficiency in database processing.

[0007] It will be appreciated that indexes may be single-dimensional or multi-dimensional. Thus, in the above example, there may be an index formed over both the surname and city attributes, in which case the query optimizer can generate an accurate statistic for the number of rows that will meet the criteria for both the surname and city set forth in a query. In the absence of a multi-dimensional index of this kind, the query optimizer will need to estimate, from the number of records with the surname "Smith", and the number of records with the city "New York", the number of records that will meet both criteria, which is typically done by assuming that the same proportion of persons in "New York" are named "Smith" as there are persons named "Smith" in the database as a whole. The resulting statistic is only as accurate as the underlying assumption that the distribution of persons named "Smith" is roughly similar for all cities. Another way to state this assumption, is that there is no correlation between the surname and city attributes, or that these attributes are "independent". Unfortunately, an assumption of independence of attributes is often incorrect. For example, surname and city are not likely to be independent (because surnames are often reflective of ethnic heritage, and cities have varied ethnic backgrounds). When there is a correlation between attributes of a database, statistics generated using an assumption of independence of those attributes will not be accurate. In such cases, a multi-dimensional index will prove useful, in that statistics formed using a multi-dimensional index for those attributes, will be more accurate than those formed using separate one-dimensional indexes.

[0008] Highly correlated attributes raise particular difficulties. Specifically, records in the exemplary database discussed above, may include a vehicle manufacturer and vehicle model attribute. These fields will be highly correlated, in that vehicle model names are typically used by a single manufacturer. For example, the model name "Escort" has been used by the manufacturer "Ford" whereas the model name "Camry" has been used by the manufacturer "Toyota". Such strong correlation will lead to dramatic over- or under-estimation of statistics by a query optimizer. A query optimizer generating statistics for a query seeking vehicles manufactured by "Toyota", and named "Camry", will dramatically underestimate the number of results if those attributes are assumed to be uncorrelated. A query optimizer generating statistics for a query seeking vehicles manufactured by "Toyota" or vehicles named "Camry", will dramatically overestimate the number of results if those attributes are assumed to be uncorrelated. Further information on these issues can be found in U.S. Pat. No. 5,995,957, Beavin et al., Nov. 30, 1999, "Query Optimization Through the Use of Multi-column Statistics to Avoid the Problems of Column Correlation", which is incorporated by reference.

[0009] While the above demonstrates the usefulness of multi-dimensional indexes, it is often not practical to form multi-dimensional indexes for every possible combination of attributes that might be used in a query, because the resources consumed in generating and storing the indexes will exceed the efficiencies achieved through their use. For this reason, in the past methods have been used to identify attribute sets that are likely candidates for inclusion in a multi-dimensional index. For example, one known method is to monitor the queries that use multiple attributes from a single relation, to identify attribute pairs that are frequently used together, so that multi-dimensional indexes may be formed on these attributes. For further details on this method, see U.S. Pat. No. 5,899,986, Ziauddin, May 4, 1999, "Methods for Collecting Query Workload Based Statistics on Column Groups Identified by RDBMS Optimizer", which is incorporated herein by reference. Unfortunately, this method may form multi-dimensional indexes on attributes that are independent merely because they are often used together in queries, while not forming indexes on attribute pairs that have high correlation simply because they are used infrequently together. Thus, this method has limited value in increasing the efficiency of queries.

[0010] Accordingly, new ways to use statistics and to identify appropriate attributes for multi-dimensional indexes, are needed in order to continue to provide significant improvements in query performance; otherwise, database users will be hampered in their ability to maximize intelligent information retrieval.

SUMMARY OF THE INVENTION

[0011] In accordance with principles of the present invention, these needs are met through the use of a method for revalidating previously generated statistics generated for an attribute. Specifically, in processing a query including a selection criterion on one or more attributes of a relation, a prior statistic generated for a prior different selection criterion on the same one or more attributes of the relation, may be revalidated for use in processing the query, based upon a measure of the entropy of the one or more attributes of the relation. In this way, the re-validation of statistics may be performed more efficiently.

[0012] In the described embodiment, a measure is generated for the entropy of the one or more attributes of a relation, by collecting a sample of tuples of the relation, and computing the frequencies of different values for the attributes in the sample, and then combining the measured frequencies into a measure of the entropy of the attributes.

[0013] In a second aspect, the invention features a method for identifying an attribute groups of a relation for which multi-dimensional indexes are to be formed, by evaluating the correlation of attribute values within tuples of the relation and determining that the correlation of attribute values within tuples of the relation exceeds a threshold.

[0014] In the described embodiment, a measure for the correlation of an attribute group is formed by comparing, for a common set of tuples, the sum of individual entropies of values of each attribute, to the joint entropy of the values of all attributes. The margin by which the sum of individual entropies of values of each attribute exceeds the joint entropy of the values of all attributes, for the same tuples, represents the information gain or amount of correlation in the group of attributes. The attribute groups that are found to have correlation are then evaluated to identify whether (a) specific attribute subgroup(s) are the primary source of correlation. This is achieved by comparing the information gain for the group of attributes, to the largest information gain of any sub-group of one fewer of the same attributes, over the same set of tuples. The margin by which the information gain of a group of attributes exceeds the largest information gain of any sub-group of one fewer of the same attributes, represents the mutual information gain of that group of attributes. A multi-dimensional index can then be formed for those attribute groups that are determined to have a substantial information gain and a substantial mutual information gain, thus indexing the smallest attribute groups in which correlation is identified. These groups may be found by an exhaustive search of all combinations of attributes of a relation, or alternatively by sampling a set of attribute groups and then evaluating other related groups of those found to have substantial correlation.

[0015] The above and other objects and advantages of the present invention shall be made apparent from the accompanying drawings and the description thereof.

BRIEF DESCRIPTION OF THE DRAWINGS

[0016] The accompanying drawings, which are incorporated in and constitute a part of this specification, illustrate embodiments of the invention and, together with a general description of the invention given above, and the detailed description of the embodiments given below, serve to explain the principles of the invention.

[0017] FIG. 1 is a block diagram of a computer system managing a database according to an embodiment of the present invention;

[0018] FIG. 2 is a flow chart of a process for generating entropy measures for individual or joint attributes in relations of a database;

[0019] FIG. 3A is a flow chart of a process of evaluating spaces and subspaces of attributes in a relation to identify those that are recommended for indexing, and FIG. 3B is an illustration of the spaces evaluated by the process of FIG. 3A;

Continue reading about Estimation and use of access plan statistics...
Full patent description for Estimation and use of access plan statistics

Brief Patent Description - Full Patent Description - Patent Application Claims

Click on the above for other options relating to this Estimation and use of access plan statistics patent application.
###
monitor keywords

How KEYWORD MONITOR works... a FREE service from FreshPatents
1. Sign up (takes 30 seconds). 2. Fill in the keywords to be monitored.
3. Each week you receive an email with patent applications related to your keywords.  
Start now! - Receive info on patent apps like Estimation and use of access plan statistics or other areas of interest.
###


Previous Patent Application:
Database management system and method which monitors activity levels and determines appropriate schedule times
Next Patent Application:
Method and apparatus for optimizing queries under parametric aggregation constraints
Industry Class:
Data processing: database and file management or data structures

###

FreshPatents.com Support
Thank you for viewing the Estimation and use of access plan statistics patent info.
IP-related news and info


Results in 0.17851 seconds


Other interesting Feshpatents.com categories:
Daimler Chrysler , DirecTV , Exxonmobil Chemical Company , Goodyear , Intel , Kyocera Wireless , 174
filepatents (1K)

* Protect your Inventions
* US Patent Office filing
patentexpress PATENT INFO