Methods and apparatus for clustering evolving data streams through online and offline components -> Monitor Keywords
Fresh Patents
Monitor Patents Patent Organizer File a Provisional Patent Browse Inventors Browse Industry Browse Agents Browse Locations
site info Site News  |  monitor Monitor Keywords  |  monitor archive Monitor Archive  |  organizer Organizer  |  account info Account Info  |  
09/27/07 - USPTO Class 707 |  80 views | #20070226209 | Prev - Next | About this Page  707 rss/xml feed  monitor keywords

Methods and apparatus for clustering evolving data streams through online and offline components

USPTO Application #: 20070226209
Title: Methods and apparatus for clustering evolving data streams through online and offline components
Abstract: A technique of clustering data of a data stream is provided. Online statistics are first created from the data stream. Offline processing of the online statistics is then performed when offline processing either required or desired. Online statistics may be created through the reception of data points from the data stream and the formation and updating of data groups. Offline processing may be performed by reclustering groups of data points around sampled data points and reporting the newly formed clusters. (end of abstract)



Agent: Ryan, Mason & Lewis, LLP - Locust Valley, NY, US
Inventors: Charu C. Aggarwal, Philip Shi-Lung Yu
USPTO Applicaton #: 20070226209 - Class: 707005000 (USPTO)

Related Patent Categories: Data Processing: Database And File Management Or Data Structures, Database Or File Accessing, Query Processing (i.e., Searching), Query Augmenting And Refining (e.g., Inexact Access)

Methods and apparatus for clustering evolving data streams through online and offline components description/claims


The Patent Description & Claims data below is from USPTO Patent Application 20070226209, Methods and apparatus for clustering evolving data streams through online and offline components.

Brief Patent Description - Full Patent Description - Patent Application Claims
  monitor keywords

CROSS-REFERENCE TO RELATED APPLICATION(S)

[0001] This application is a continuation of U.S. application Ser. No. 10/641,951, filed on Aug. 14, 2003, the disclosure of which is incorporated by reference herein.

FIELD OF THE INVENTION

[0002] The present invention relates to data clustering and, more particularly, to the clustering of evolving data streams through generated online statistics and offline processing.

BACKGROUND OF THE INVENTION

[0003] Recent advances in hardware technology have allowed companies and organizations to automatically and rapidly record transactions of everyday life (e.g., banking, credit card, stock, telecommunications, etc.). The recordation of such processes leads to large amounts of data, which grows at an unlimited rate. The continuous arrival of data is referred to as a data stream. Data streams have been extensively researched in recent years due to their use in a wide range of applications, see, e.g., B. Babcock et al., "Models and Issues in Data Stream Systems," ACM PODS Conference, 2002; P. Domingos et al., "Mining High-Speed Data Streams," ACM SIGKDD Conference, 2000; S. Guha et al., "Clustering Data Streams," IEEE FOCS Conference, 2000; and L. O'Callaghan et al., "Streaming-Data Algorithms For High-Quality Clustering," ICDE Conference, 2002.

[0004] The clustering of a data stream partitions a given set of data points into one or more groups of similar data points. Clustering has been widely researched in the database, data mining and statistics communities, see, e.g., P. Bradley et al., "Scaling Clustering Algorithms to Large Databases," SIGKDD Conference, 1998; S. Guha et al., "CURE: An Efficient Clustering Algorithm for Large Databases," ACM SIGMOD Conference, 1998; R. Ng et al., "Efficient and Effective Clustering Methods for Spatial Data Mining," Very Large Data Bases Conference, 1994; R. Dubes et al., "Algorithms for Clustering Data," Prentice Hall, New Jersey, 1998; and L. Kaufman et al., "Finding Groups in Data--An Introduction to Cluster Analysis," Wiley Series in Probability and Math Sciences, 1990. Clustering has also been studied in the context of the data stream environment, see, e.g., S. Guha et al. and L. O'Callaghan et al.

[0005] Since the clustering of data streams results in the arrival of a large volume of data, it renders most traditional clustering methodologies inefficient. In recent years, one-pass clustering methodologies have been developed for utilization with data streams. However, the results of a simple one-pass clustering methodology provided over a data stream for a few years would be dominated by the outdated history of the stream.

[0006] Other existing methodologies for clustering data streams compute clusters over the entire data stream, see, e.g. L. O'Callaghan et al. These techniques view data stream clustering as a variant of one-pass clustering methodologies. Although such techniques may be useful in many clustering applications, the clustering of data streams requires careful defining in the data stream context. A data stream should be viewed as an infinite process having data that continuously evolves with time. As a result, the underlying clusters may also change considerably with time. The nature of the clusters may vary depending on the moment at which they are computed as well as the time horizon over which they are measured. For example, a user may wish to examine clusters occurring in the last month, year, or decade, each of which may be considerably distinct.

[0007] Data streams inherently impose a one-pass constraint on methodology design. It becomes very difficult to provide flexibility in computing clusters over different kinds of time horizons using conventional methodologies. For example, a direct extension of the stream based k-means methodology (see, e.g., L. O'Callaghan et al.) would require simultaneous maintenance of the intermediate results of clustering methodologies over all possible time horizons. Such a computational burden increases with progression of the data stream and can rapidly become a bottleneck for online implementation. Furthermore, in many cases, an analyst may wish to determine the clusters at a previous moment in time, and compare them to the current clusters. This requires even greater bookkeeping, which can rapidly become unwieldy for fast data streams.

[0008] Since a data stream cannot be revisited over the course of the computation, a clustering methodology needs to maintain a substantial amount of information so that important details are not lost. For example, a continuous version of k-means methodology maintains a number of cluster centers which change or merge as necessary throughout the execution of the methodology, see, e.g., L. O'Callaghan et al. This approach is unpredictable when the characteristics of the stream evolve over time since the k-means approach is highly sensitive to the order of arrival of the data points. For example, once two cluster centers are merged, there is no way to informatively split the clusters when required by the evolution of the stream at a later time.

[0009] A need therefore exists to improve the quality of the clusters when the data evolves considerably over time. A further need exists to provide greater functionality in discovering and exploring clusters over different portions of the stream.

SUMMARY OF THE INVENTION

[0010] The present invention relates to data clustering techniques. More particularly, the present invention relates to clustering evolving data streams through generated online statistics and offline processing.

[0011] For example, in one aspect of the invention, a technique of clustering data of a data stream comprises the following steps. First, online statistics are created from the data stream. Then, offline processing of the online statistics is performed when offline processing is either required or desired.

[0012] Advantageously, the inventive technique efficiently and effectively clusters data streams in a manner that is easily accessible and manageable by a user.

[0013] Another advantageous property of the inventive technique is its flexibility in computing clusters over user defined time periods. Additionally, the inventive technique may provide for the comparison of clusters in a previous moment in time to current clusters of a data stream. A user may discover and explore clusters over different portions of the data stream.

[0014] These and other objects, features, and advantages of the present invention will become apparent from the following detailed description of illustrative embodiments thereof, which is to be read in connection with the accompanying drawings.

BRIEF DESCRIPTION OF THE DRAWINGS

[0015] FIG. 1 is a block diagram illustrating a hardware implementation suitable for employing methodologies, according to an embodiment of the present invention;

[0016] FIG. 2 is a flow diagram illustrating an online micro-clustering and offline macro-clustering interaction methodology, according to an embodiment of the present invention;

[0017] FIG. 3 is a flow diagram illustrating a micro-cluster maintenance methodology, according to an embodiment of the present invention;

[0018] FIG. 4 is a flow diagram illustrating a higher level cluster creation methodology, according to an embodiment of the present invention; and

[0019] FIG. 5 is a flow diagram illustrating a micro-cluster evolution analysis methodology, according to an embodiment of the present invention.

Continue reading about Methods and apparatus for clustering evolving data streams through online and offline components...
Full patent description for Methods and apparatus for clustering evolving data streams through online and offline components

Brief Patent Description - Full Patent Description - Patent Application Claims

Click on the above for other options relating to this Methods and apparatus for clustering evolving data streams through online and offline components patent application.
###
monitor keywords

How KEYWORD MONITOR works... a FREE service from FreshPatents
1. Sign up (takes 30 seconds). 2. Fill in the keywords to be monitored.
3. Each week you receive an email with patent applications related to your keywords.  
Start now! - Receive info on patent apps like Methods and apparatus for clustering evolving data streams through online and offline components or other areas of interest.
###


Previous Patent Application:
Effort based relevance
Next Patent Application:
System and method for clustering content items from content feeds
Industry Class:
Data processing: database and file management or data structures

###

FreshPatents.com Support
Thank you for viewing the Methods and apparatus for clustering evolving data streams through online and offline components patent info.
IP-related news and info


Results in 0.28104 seconds


Other interesting Feshpatents.com categories:
Electronics: Semiconductor Audio Illumination Connectors Crypto 174
filepatents (1K)

* Protect your Inventions
* US Patent Office filing
patentexpress PATENT INFO