freshpatentsnav7small (2K)

6

views for this patent on FreshPatents.com
updated 06/14/13

    Free Services  

  • MONITOR KEYWORDS
  • Enter keywords & we'll notify you when a new patent matches your request (weekly update).

  • ORGANIZER
  • Save & organize patents so you can view them later.

  • RSS rss
  • Create custom RSS feeds. Track keywords without receiving email.

  • ARCHIVE
  • View the last few months of your Keyword emails.

  • COMPANY PATENTS
  • Patents sorted by company.

System and method for identifying hierarchical heavy hitters in multi-dimensional data   

pdficondownload pdfimage preview


Abstract: A method including receiving a plurality of elements of a data stream, storing a multi-dimensional data structure in a memory, said multi-dimensional data structure storing the plurality of elements as a hierarchy of nodes, each node having a frequency count corresponding to the number of elements stored therein, comparing the frequency count of each node to a threshold value based on a total number of the elements stored in the nodes and identifying each node for which the frequency count is at least as great as the threshold value as a hierarchical heavy hitter (HHH) node and propagating the frequency count of each non-HHH nodes to its corresponding parent nodes. ...


USPTO Applicaton #: #20090292726 - Class: 707103 Y (USPTO) - 11/26/09 - Class 707 

view organizer monitor keywords


The Patent Description & Claims data below is from USPTO Patent Application 20090292726, System and method for identifying hierarchical heavy hitters in multi-dimensional data.

pdficondownload pdf

INCORPORATION BY REFERENCE

The entire disclosure of U.S. patent application Ser. No. 10/802,605, entitled “Method and Apparatus for Identifying Hierarchical Heavy Hitters in a Data Stream” filed Mar. 17, 2004 is incorporated, in its entirety, herein. The entire disclosure of U.S. Provisional Patent Appln. 60/560,666, entitled “Diamond in the Rough: Finding Hierarchical Heavy Hitters in Multi-Dimensional Data” filed Apr. 8, 2004 is incorporated, in its entirety, herein.

BACKGROUND

Aggregation along hierarchies is a critical data summarization technique in a large variety of online applications, including decision support (e.g, online analytical processing (OLAP)), network management (e.g., internet protocol (IP) clustering, denial-of-service (DoS) attack monitoring), text (e.g., on prefixes of strings occurring in the text), and extensible markup language (XML) summarization (i.e., on prefixes of root-to-leaf paths in an XML data tree). In such applications, data is inherently hierarchical and it is desirable to monitor and maintain aggregates of the data at different levels of the hierarchy over time in a dynamic fashion.

A heavy hitter (HH) is an element of a data set having a frequency which is greater than or equal to a user-defined threshold. A conventional algorithm for identifying the HHs in the data set maintains a summary structure which allows the frequencies of the elements to be estimated within a pre-defined error bound. The conventional HH algorithm, however, did not account for any hierarchy in the data set. It is also possible to store information for each node in a hierarchy and calculate HHs based on this information. However, the storing of data for all nodes and the amount of calculation is prohibitive. In addition, this method provides superfluous results. A need exists for identifying hierarchical heavy hitters (“HHHs”) in data sets having multiple dimensions.

SUMMARY

OF THE INVENTION

A method including receiving a plurality of elements of a data stream, storing a multi-dimensional data structure in a memory, said multi-dimensional data structure storing the plurality of elements as a hierarchy of nodes, each node having a frequency count corresponding to the number of elements stored therein, comparing the frequency count of each node to a threshold value based on a total number of the elements stored in the nodes and identifying each node for which the frequency count is at least as great as the threshold value as a hierarchical heavy hitter (HHH) node and propagating the frequency count of each non-HHH nodes to its corresponding parent nodes.

A system which includes a receiving element receiving a plurality of elements of a data stream, a storage element storing a multi-dimensional data structure in a memory, said multi-dimensional data structure storing the plurality of elements as a hierarchy of nodes, each node having a frequency count corresponding to a number of elements stored therein, a comparator element comparing the frequency count of each node to a threshold value based on a total number of the elements stored in the nodes, wherein, when the frequency count is at least as great as the fraction, the node is identified as a hierarchical heavy hitter (HHH) node and a propagation element propagating the frequency count of each non-HHH node to its corresponding parent nodes.

A computer readable storage medium including a set of instructions executable by a processor, the set of instructions operable to receive a plurality of elements of a data stream, store a multi-dimensional data structure in a memory, said multidimensional data structure storing the plurality of elements as a hierarchy of nodes, each node having a frequency count corresponding to a number of elements stored therein, compare the frequency count of each node to a threshold value based on a total number of the elements stored in the plurality of nodes, wherein, when the frequency count is at least as great as the threshold value, the node is identified as a hierarchical heavy hitter (HHH) node and propagate the frequency count of each non-HHH node to its corresponding parent nodes.

BRIEF DESCRIPTION OF THE DRAWINGS

FIG. 1 shows an exemplary two-dimensional (“2-D”) data structure.

FIGS. 2A-B shows an exemplary embodiment of a portion of a data structure for the purpose of demonstrating an exemplary frequency count propagation according to the present invention.

FIG. 3 shows an exemplary method for inserting and compressing data elements in a summary data structure for identifying HHHs in a data structure implementing the overlap case for streaming data according to the present invention.

FIG. 4 shows an exemplary method for identifying HHHs in a data structure implementing the overlap case for streaming data according to the present invention.

DETAILED DESCRIPTION

The present invention may be further understood with reference to the following description and the appended drawings, wherein like elements are referred to with the same reference numerals. The exemplary embodiment of the present invention describes a method for identifying hierarchical heavy hitters (“HHHs”) in a multidimensional data structure. The multidimensional data structure and methods for identifying the HHHs therein will be discussed in detail below.

In the exemplary embodiments, the exemplary hierarchical data is described as data representing IP addresses in IP traffic data. The IP addresses are by their nature hierarchical, i.e., each individual address is arranged into subnets, which are within networks, which are within the IP address space. Therefore the collection of multiple data points based on IP addresses, and the generalization of these IP addresses, will result in a hierarchical data structure. The concept of generalization will be described in greater detail below.

However, those of skill in the art will understand that the use of IP addresses is only exemplary and that the present invention may be applied to any type of data which may be represented hierarchically. Other examples of hierarchical data include data collected based on time (e.g., hour, day, week, etc.) or data collected based on location (e.g., city, county, state, etc.). This type of data may also be stored, arranged and viewed in a hierarchical manner.

The hierarchical data may be static or streamed data and the exemplary embodiments of the present invention may be applied to either static or streamed data. For example, the data collected in the IP traffic scenario may be considered streaming data because new data points are continually being added to the set of data points in the data structure. Thus, determining HHHs may be continuous as the data changes. However, it is also possible to take a snapshot of the data at a particular point in time (static data) and perform the HHH analysis on this static data. An example of static hierarchical data may be sales information which is based on time and location. This information may be collected and stored for analysis at a later time. Again, there are any number of examples of hierarchical data that may be streaming, static or either depending on the data collection methods.

The general purpose of collecting and storing this data is to mine the data to determine patterns and information from the data. For example, if a specific IP address (or range of IP addresses in the hierarchy) is receiving an unusually high amount of traffic, this may indicate a denial of service attack on the network. In another example, a specific region may show a high number of sales at a particular time indicating that additional salespeople should be staffed at these times. These high traffic points or paths will be indicated by identifying HHHs in the data structure.

U.S. patent application Ser. No. 10/802,605, entitled “Method and Apparatus for Identifying Hierarchical Heavy Hitters in a Data Stream” filed Mar. 17, 2004 which is incorporated by reference, in its entirety, herein, describes exemplary methods for identifying HHH\'s in a one-dimensional hierarchical data structure. The exemplary embodiment of the present invention is directed at identifying HHHs in multi-dimensional data structures. These multi-dimensional data structures present problems for identifying HHHs that are not present in a one-dimensional data structure. For example, one-dimensional data structures do not present the issue of common ancestors that multi-dimensional data structures present (e.g., a child node having two parent nodes with one common grandparent node). The exemplary embodiments will provide solutions for the unique issues presented for identifying HHHs in multi-dimensional data structures.

Initially, FIG. 1 shows an exemplary two-dimensional (“2-D”) data structure 1 for which exemplary embodiments of the present invention may be used to determine HHHs. The description of data structure 1 will include terminology and notations that are presented in the formulations that follow. The exemplary 2-D data structure 1 may be used to model two dimensional data associated with IP traffic data. In this example, the data is considered two dimensional because there are two attributes which are being used to populate the data structure, i.e., the source address and the destination address. Those of skill in the art will understand that additional dimensions may be added to the data structure by collecting and storing additional information. For example, if the port numbers associated with the source and destination addresses and a time attribute were collected and stored, a data structure with five (5) dimensions could be created. Thus, even though described with reference to a 2-D data structure, the exemplary embodiments of the present invention may be applied to any multi-dimensional data structure.

A typical 32 bit source and destination IP address is in the form of “xxx.xxx.xxx.x” with each octet (8 bits) of data (e.g., xxx) representing a sub-attribute of the attribute. Thus, in the example of data structure 1, each level of the hierarchy may be considered to correspond to an octet of the IP address, wherein the source address attribute is represented as 1.2.3.4 and the destination address attribute is represented as 5.6.7.8.

The data structure 1 models the collected data as N d-dimensional tuples. A tuple refers to a collection of one or more attributes. As shown in FIG. 1, each node 5-125 of data structure 1 is a tuple. Thus, throughout this description, the terms node and tuple may be used interchangeably to describe a collection of one or more attributes. The maximum depth of the ith dimension is defined as hi. In this example, N is the total number of data points collected (e.g., the number of hits for the particular source/destination nodes in the data structure), d=2 for the two dimensional attribute data (e.g., source address, destination address) and h1=h2=4 since each of the attributes have four sub-attributes.

The generalization of any element on an attribute means that the element is rolled up one level in the hierarchy of that attribute. For example, the generalization of the IP address pair 1.2.3.4, 5.6.7.8 (shown as node 5) on the second attribute is 1.2.3.4, 5.6.7.* (shown as node 10). An element is fully general on an attribute if it cannot be generalized further. In the data structure 1, this generalization is denoted by the symbol *. For example, the pair *, 5.6.7.* (shown as node 95) is fully general on the first attribute, but not the second. The root node 125 is fully general. Thus, the act of generalizing over a defined set of hierarchies generates a hierarchical lattice structure as shown by data structure 1.

Each node in the data structure 1 may be labeled with a vector length d whose ith entry is a non-negative integer that is at most hi, indicating the level of generalization of the node. For example, the pair (1.2.3.4, 5.6.7.8) is at a generalization level [4,4] (node 5), the pair (*, 5.6.7.*) is at [0,3] (node 95) and the pair (1.2.*, 5.*) is at [2,1] (node 85). The parents of any node are those nodes where one attribute has been generalized in one dimension. For example, the parents of a node at level [4,4] (node 5) are at levels [3,4] (node 15) and [4,3] (node 10). A node that has one attribute that is fully generalized will only have a single parent, e.g. node 95 at level [0,3] has only one parent node 100 at level [0,2] because the first attribute is fully generalized. For notation purposes, a parent of any element e may be referred to as par(e).

Two nodes are comparable if every attribute and the specified portion of the label of one node is a prefix of the other on every attribute. For example, a node having level [3,4] is comparable to a node having level [3,2]. In contrast, a node at level [3,4] is not comparable to a node at level [4,3]. A Level(i) is the ith level in the data structure corresponding to the sum of the values in the level label. For example, Level(8)=[4,4] (node 5); Level(5)=[1,4] (node 50), [2,3] (node 45), [3,2] (node 40) and [4,1] (node 35); and Level(0)=[0,0] (node 125). No pair of nodes with a distinct label in a particular level (e.g., Level(5)) are comparable. These nodes are described as forming an anti-chain. Other nodes which are not comparable can also form an anti-chain. For example, consider labels [2,2] and [1,4] with prefixes (1.2.*, 5.6.*) and (1.*, 5.6.7.8), respectively. The total number of levels in any data structure is given by L=1+Σihi. Thus, in the example of data structure 1, L=1+4+4=9.

Finally, a sub-lattice of an element (e) is defined as the set of elements which are related to e under the closure of the parent relation. For example, elements (1.2.3.4,5.6.7.8), (1.2.3.8, 5.6.4.5) and (1.2.3*,5.6.8.*) are all in sub-lattice (1.2.3.*, 5.6.*). Thus, the sub-lattice of a set of elements P is defined as sub-lattice(P)=∪pεPsub-lattice(p). It should be noted that in the above example, element (1.2.3.8,5,6.4.5) is in a sub-lattice of (1.2.3.*,5.6.*) and (1.2.3.*,5.6.8.*) is in a sub-lattice of (1.2.3.*,5.6.*), but (1.2.3.8,5.6.4.5) and (1.2.3.*, 5.6.8.*) are in separate sub-lattices.

As elements are collected and nodes are added to the data structure 1, a frequency count is incremented which represents an occurrence of data at the node. The general problem of finding HHHs is to find all items in the structure whose frequency count exceeds a given fraction φ of the total data points. In a one-dimensional data structure, the propagation of frequency counts is fairly straightforward, i.e., add the count of a rolled up node to its one and only parent. However, in the multi-dimensional case, it is not readily apparent how to compute the frequency counts at various nodes within the data structure 1 because, for example, each node may have two or more parents.

In a first exemplary embodiment, referred to as the overlap case, the frequency count for any child node is passed to all its parents, except where the child node has been identified as an HHH. However, as will be described in greater detail below, there are subtleties to the overlap case which prevents overcounting due to the roll up of frequency counts to both parents, In the overlap case, an HHH is defined as follows: Given a set S of elements e having corresponding frequency counts fe and L=Σihi. An HHH may be defined inductively based on a threshold φ, HHHL contains all heavy hitters eεS such that fe≧└φN┘. The overlap count of an element p at Level(l) in the lattice where l<L is given by f′(p)=Σfe: eεS∩{sub-lattice(p)−Sub-lattice(HHH1+L)}. The set HHHl is defined as the set HHHl+L∪{pεLevel(l)̂f′(p)}≧└φN┘. The HHHs in the overlap case for the set S is the set HHH0.

The methods described herein may be implemented on any computing device which samples and/or processes data in an online or offline state. For example, the computing device may include a central processing unit (CPU), a memory, an input/output (I/O) interface, etc. The I/O interface may be adapted to receive a data stream from a source, such as a network, database, server, etc. The memory may store all or portions of one or more programs and/or data to implement the described methods. In addition, the methods may be implemented in hardware, software, or a combination thereof.

FIG. 3 shows an exemplary method 400 for inserting and compressing data elements in a summary data structure for identifying HHHs in a data structure implementing the overlap case for streaming data. As would be understood by those of skill in the art, streaming data means that new data will be continuously added to the data set. Thus, for a streaming case, it is very important that any methods for determining HHHs have a minimal processing time so that the results are current. In addition, since new data is being continuously added, the method 400 compresses the data to eliminate certain data which may be omitted for the purposes of calculating the set of HHHs. While it is possible to maintain multiple independent data structures and information for every label in a lattice data structure in order to calculate the HHHs for a particular point in the lattice, this becomes very expensive in terms of storage space and computation time.

Thus, the method 400 presents a single data structure that summarizes the whole lattice. This allows for an approximation of the HHHs for the data structure in a single pass (within a defined error amount). The method 400 uses a very small amount of storage space and updates the set of HHHs as the data stream unravels. More specifically, a summary structure T consisting of a set of nodes that correspond to samples from the input stream is maintained. Each node teεT consists of an element e from the lattice and a bounded amount of auxiliary information.

In the 2-D summary data structure example, auxiliary information fe, Δe, ge and me are maintained, where: fe is a lower bound on the total count that is straightforwardly rolled up (directly or indirectly) into e, Δe is the difference between an upper bound on the total count that is straightforwardly rolled up into e and the lower bound fe, ge is an upper bound on the total compensating count, based on counts of rolled up grandchildren of e, and me=max (fd(e)−gd(e)+Δd(e)), over all descendants d(e) of e that have been rolled up into e.

Referring to FIG. 3, the method 400 begins with step 405 where the user supplies an error parameter ε. As described above, the method 400 will take one pass through the summary data structure and approximate the HHHs for the streamed data using a minimal amount of storage space and computation time. The approximation of the HHHs is based on the user supplied error parameter. From the following description and formulations, those of skill in the art will understand that as a user specifies tighter error tolerances, the storage space and computation time requirements may increase. Each user will select an error parameter that suits the particular application. In step 410, the input stream is conceptually divided into buckets of width (w=┌1/ε┐). The current bucket number is defined as bcurrent=└εN┘.

The method will then go through two alternating phases of insertion and compression. The following steps are related to the insertion phase. In step 415, an element is received from the data stream. In step 420 it is determined if the node te exists for the element in the summary data structure T. If the node te exists, the process continues to step 425 where the fe count of the node is updated and the process loops back to step 415 to retrieve the next element in the stream.

If it was determined in step 420 that the node te did not exist, the process continues to step 430 where a new node te is created for the element and the auxiliary information fe; Δe, ge and me values are stored in the newly created node. Specifically, fe=f of the element, ge is set to 0 and Δe=me=bcurrent−1. However, then the two parent elements (if they exist in the data structure) are also used estimate the values of the auxiliary information. Specifically, if the left parent exists and mlpar(e)<me, then Δe=me=mlpar(e). Similarly, if the right parent exists and mrpar(e)<me, then Δe=me=mrpar(e).

This completes the insertion phase of the method 400. The following is exemplary pseudo code for the insertion process:

Insert (e,f): 01 if te exists then fe + = f; 02 else { 03 if (lpar(e) in domain) then Insert (lpar(e), 0); 04 if (rpar(e) in domain) then Insert (rpar(e), 0); 05 create te with (fe = f, ge = 0); 06 Δe = me = b current −1; 07 if (lpar(e) in domain) and (mlpar(e) <me) { 08 Δe = me = m lpar(e); } 09 if (rpar(e) in domain) and (mrpar(e) <me) { 10 Δe = me = mrpar(e) ; }}

The following steps are related to the compression phase of the method 400. In step 435, fringe nodes are identified. A fringe node is one that does not have any descendants. The compression phase of the method is iterative and is carried out for each of the identified fringe node. For each of the identified fringe nodes, in step 440, it is determined whether the upper bound on the total count is larger than the current bucket number, i.e., is fe−ge+Δe, <bcurrent.

If the total count is less than the current bucket number, the fiinge node is deleted as part of the compression step 445. However, since the node is deleted, the auxiliary values of the parent elements also need to be updated in the compression step 445. The updating will be described with reference to the left parent, but the same process will be carried out for the right parent. If the left parent exists, the flpar(e) is updated using the f and g, of the deleted node, i.e. flpar(e)+=fe−ge. Similarly, mlpar(e) is updated in the form mlpar(e)=max(mlpar(e), fe−ge+Δe). Finally, it is determined if the left parent has become a fringe node as a result of the deletion of the originally scanned node. If it has become a fringe node, it will be an analyzed node in the iterative compression phase. As described above, the same process will be carried out for the right parent. In addition, the compression step also reduces the compensating count of the common grandparent (ggpar(e)) by the value fe−ge to account for possible overcounting.

For non-fringe nodes in the summary structure T, the compensating count ge is speculative and is not taken into account for estimating the upper bound on the total count (e.g., upper bound=fe+Δe). However, for fringe nodes of the summary structure, ge is no longer speculative and a tighter upper bound can be obtained using fe-ge+Δe. As described above, it is this tighter upper bound that is used to determine the fringe nodes to be compressed.

FIGS. 2A-2B depict a portion 300 of the 2-D data structure 1 initially shown in FIG. 1. The portion 300 will be used to show an example of propagating frequency counts in the compression phase of the streaming overlap case. The portion 300 of the data structure 1 shows a diamond property that is a region of the lattice corresponding to an inclusion-exclusion principle to prevent overcounting frequency counts. The example shows the principle of having a compensating count ge for the common grandparent depicted at the top of the diamond structure. For the purpose of this example, the node 5 of portion 300 in FIGS. 2A-B will be referred to as a child node, nodes 10 and 15 will be referred to as parent nodes and nodes 20-30 will be referred to as grandparent nodes with node 25 being referred to as the common grandparent node.

As shown in FIG. 2A, the exemplary frequency count [4] of the child node 5. As described above, this frequency count should be propagated to the frequency counts of parent nodes 10 and 15. The initial frequency count [0] (shown in FIG. 2A) of each parent node 10 and 15 becomes [4] (shown in FIG. 2B) after the frequency count [4] from the child node 5 is propagated. It should be noted that the initial frequency count of [0] is only exemplary and may be any value based on the actual monitored data.

However, the frequency count [4] of the child node 5 is also subtracted from the frequency count [0] of the common grandparent node 25. The initial frequency count [0] (shown in FIG. 2A) of the common grandparent node 25 becomes [−4] 25 (shown in FIG. 2B) after the frequency count [4] of the child node 5 is subtracted therefrom. As described above, the [−4] frequency count of the common grandparent node may be considered the compensating count so that when the frequency counts of the parent nodes 10 and 15 are each propagated to the common grandparent node 25, the frequency count will be equal to [4] (−4+4+4=4). Without implementing compensating count, propagation of the frequency count [4] of the child node 5 would result in the frequency count [8] of the common grandparent node 25. This overcounting would lead to erroneous determinations of HHH nodes in the 2-D data structure 1.

This completes the compression phase of the method 400. The following is exemplary pseudo code for the compression process:

Compress: 01 for each te in fringe do { 02 if (fe + Δe ≦ bcurrent) { 03 if (lpar(e) in domain) { 04 fl par (e) + = fe −ge

Download full PDF for full patent description/claims.




You can also Monitor Keywords and Search for tracking patents relating to this System and method for identifying hierarchical heavy hitters in multi-dimensional data patent application.
###
monitor keywords

Other recent patent applications listed under the agent :



Keyword Monitor How KEYWORD MONITOR works... a FREE service from FreshPatents
1. Sign up (takes 30 seconds). 2. Fill in the keywords to be monitored.
3. Each week you receive an email with patent applications related to your keywords.  
Start now! - Receive info on patent apps like System and method for identifying hierarchical heavy hitters in multi-dimensional data or other areas of interest.
###


Previous Patent Application:
Acquisition and presentation of data indicative of an extent of congruence between inferred mental states of authoring users
Next Patent Application:
Acquisition and particular association of data indicative of an inferred mental state of an authoring user
Industry Class:
Data processing: database and file management or data structures

###

FreshPatents.com Support - Terms & Conditions
Thank you for viewing the System and method for identifying hierarchical heavy hitters in multi-dimensional data patent info.
- - - AAPL - Apple, BA - Boeing, GOOG - Google, IBM, JBL - Jabil, KO - Coca Cola, MOT - Motorla

Results in 0.91902 seconds


Other interesting Freshpatents.com categories:
Exxonmobil Chemical Company , Intel , g2