Mass estimation
In data mining, mass estimation is a recently proposed data modelling mechanism, as an alternative to density estimation, to solve data mining problems. Unlike density estimation, it models data distribution without using distance measure. Mass estimation has been successfully applied in solving different data mining tasks such as anomaly detection, clustering, information retrieval, classification and regression. Mass-based data mining methods often perform better than or at least as well as the state-of-the-art methods, but they run orders of magnitude faster than their distance/density-based counterparts. It is because mass is more fundamental than density (density = mass/volume) which can be estimated more efficiently.
Definition
In the simplest form, mass (or data mass) is the number of instances in a bounded region. Given a function $\Tau($·) that defines local regions by partitioning the data space, the mass of an instance x that lies in the region $\Tau(x)$ can be estimated as:
$m(x)=|\Tau(x)|$ Eqn.(1)
The mass base function m(·) depends on the implementation of $\Tau($·). Two regions can have the same mass regardless of the characteristics of regions such as shape, volume or density. All the instances in a region have the same mass, i.e., the cardinality of the region.
One-dimensional mass estimation
Let D be a set of n instances in R. The data is divided into two non-empty regions with a binary split at s on the real line. The mass distribution at x is estimated as the sum of mi(x) weighted by p(si) as a result of n − 1 possible si.
$mass(x|D) = \sum_{i=1}^{n-1}m_i(x)\times p(s_i)$ Eqn.(2)
where p(si) is the probability of selecting si. If x1 < x2 < ... < xn − 1 < xn are instances in D, then $p(s_i)=\frac{x_{i+1}-x_i}{x_n-x_1} > 0$.
In the example shown in Figure 1:
mass(x1|D) = 1p(s1) + 2p(s2) + 3p(s3) + 4p(s4)
mass(x2|D) = 4p(s1) + 2p(s2) + 3p(s3) + 4p(s4)
mass(x3|D) = 4p(s1) + 3p(s2) + 3p(s3) + 4p(s4)
mass(x4|D) = 4p(s1) + 3p(s2) + 2p(s3) + 4p(s4)
mass(x5|D) = 4p(s1) + 3p(s2) + 2p(s3) + 1p(s4)
Mass estimation using Eqn. 2 is a level-1 mass estimation. Level-h > 1 mass can be estimated recursively as follows:
$mass(x,h|D) = \sum_{i=1}^{|D|-1}mass(x,h-1|\{y|y\in \Tau_i(x)\})\times p(s_i)$ Eqn.(3)
Examples of data modelling in terms of mass distribution are shown in Figure 2.
Properties
Mass estimation has the following characteristics:
- A mass distribution stipulates an ordering from core points to fringe points in a data cloud. In addition, this ordering accentuates the fringe points with a concave function - fringe points have markedly smaller mass than points close to the core points.
- Mass estimation is more efficient than density estimation because mass is computed by simple counting and it requires only a small sample through an ensemble approach.
- Mass can be interpreted as a measure of relevance with respect to the concept underlying the data, i.e., core points indicate that they are highly relevant and fringe points indicates that they are less relevance.
Mass estimation has two advantages in relation to efficacy and efficiency. First, the concavity property mentioned above ensures that fringe points are ‘stretched’ to be farther from the core points in a mass space - making it easier to separate fringe points from those points close to core points. This property, otherwise hidden, can then be exploited by a data mining algorithm to achieve a better result for the intended task than the one without it. Second, mass estimation offers to solve a problem more efficiently using the ordering derived from data directly, without distance or related expensive calculation, when the problem demands ranking.
Multi-dimensional mass estimation
In multi-dimensional space, Rd, the mass distribution is estimated by constructing multiple random partitions of the data space $\Tau_i($·)(i=1,2,⋯,t) where each $\Tau_i($·) partitions the space using a fixed sized sub-sample of data, Si ⊂ D(|Si|=ψ<n). The mass distribution for instance x, mass(x), is the aggregation of mi(x) from t different regions $\Tau_i(x)$ where x lies in each of the t partitions as shown in Figure 3. $\Tau_i($·) can be implemented in different ways.
1. Tree based implementation
Each $\Tau_i($·) is constructed from a random subsample Si by recursively dividing the data space into two regions with a random axis parallel split on an attribute at each node of the tree. The height of the tree h determines the level of mass estimation. The mass is estimated by aggregating masses from the leaf nodes $\Tau_i(x)$ where x in each tree $\Tau_i($·) as follows:
$mass(x)=\frac{1}{t}\sum_{i=1}^t m_i(x)\times 2^{\ell_i(x)}$ Eqn.(4)
where ℓi(x) is the path length of the leaf node where x traversed in $\Tau_i($·).
If balanced binary trees are constructed such that path lengths of all leaf nodes are the same, Eqn. 4 can be simplified by ignoring the constant path length term as:
$mass(x)=\frac{1}{t}\sum_{i=1}^t m_i(x)$ Eqn.(5)
An example of two-dimensional data modelling using Eqn. 5 is shown in Figure 4.
If trees are constructed in such a way so that the mass in all leaf nodes is 1, Eqn. 4 can be simplified by ignoring the mass term as:
$mass(x)=\frac{1}{t}\sum_{i=1}^t 2^{\ell_i(x)}$ Eqn.(6)
2. Nearest neighbour based implementation
From a sample of data Si ⊂ D, local regions are defined by a set of ψ hypercubes (Πi) centered at each instance in Si. A hypercube region Hp is centered at p ∈ D with length $l_p=\frac{||p-q||_\infty}{2}$ where q = mino ∈ Si(||q−o||∞) is the nearest neighbour of p in Si and Πi = ∪o ∈ SiHo. The mass in each hypercube in Πi is estimated using another random sub-sample of data Mi ⊂ D(|Mi|=Ψ;ψ<Ψ<n).
LiNearN is a density estimator based on mass that uses nearest neighbour based implementation of local regions. The distinguishing characteristic of LiNearN is that both the number of instances and the volume of regions are adaptive to local data distribution.
Applications
The significance of mass estimation in solving different data mining tasks is summarised in the table below. The last three columns in the table provide the comparison of mass-based methods with state-of-the-art distance or density based methods (DBSCAN for clustering, LOF for anomaly detection, Naive Bayes for classification, InstRank and Qsim for information retrieval) in terms of time complexity, space complexity and task specific performance.
Task |
Interpretation |
Time complexity |
Space complexity |
Task specific performance |
|---|---|---|---|---|
Clustering |
High mass indicates core regions; Low mass indicates noise regions |
O(n) vs O(n2) |
O(1) vs O(n) |
Comparable |
Anomaly detection |
High mass signifies normal instances; Low mass signifies anomalies |
O(n) vs O(n2) |
O(1) vs O(n) |
Comparable |
Classification |
Use mass to estimate multidimensional likelihood in a Bayesian classifier |
O(1) vs O(n) |
O(1) vs O(n) |
Comparable |
Information retrieval |
High (low) mass signifies that a database object is highly (less) relevant to the query |
Comparable |
Comparable |
Better |
Density can also be estimated using mass. The density function at x ∈ Rd using mass base function is estimated as follows:
$\bar{f}_m(x) = \frac{1}{t}\sum_{i=1}^t \frac{m_i(x)}{\psi\times v(\Tau_i(x))}$ Eqn.(7) [Using tree-based implementation]
where v(·) is the volume of the region.
$\bar{f}_l(x) = \frac{1}{t}\sum_{i=1}^t \rho(x|\Pi_i,M_i)$ Eqn.(8) [Using LiNearN]
where $\rho(x|\Pi_i,M_i) = \frac{m(H_{(x|\Pi_i)}|M_i)}{|M_i|\times l(H_{(x|\Pi_i)})}$; H(x|Πi) is the hypercube in Πi where x falls into; m(H|M) is the number of instances from M that fall in the hubercube H and l(H) is the length of the hypercube H.
The density estimator based on mass (DEMass) has been applied to replace existing density estimator in LOF and DBSCAN, improving the time complexity from O(n2) to O(n), and space complexity from O(n) to O(1).
Mass estimation can also be used as a method for feature projection. Instead of aggregating mass from t models, each mass model can be treated as a new feature. Instances in the original Rd space are projected to Rt mass space. Each x ∈ Rd is projected to y ∈ Rt where y = ⟨m1(x), m2(x), ⋯, mt(x)⟩. Now, the problem can be solved in the projected mass space using existing data mining methods.
So far, mass is used directly and indirectly (through density estimation and feature projection) to solve different data mining problems. A summary of algorithms based on 'Mass' and 'DEMass' is provided in the table below.
Task |
DEMass |
Mass |
|---|---|---|
Clustering |
DEMass-DBSCAN, LiNearN-Cluster |
MassTER |
Anomaly detection |
DEMass-LOF, LiNearN |
iForest, SciForest, HS-Trees, ReMass-iForest |
Classification |
DEMass-Bayes |
MassBayes |
Information retrieval |
ReFeat, ReMass-ReFeat |
Relationship with data depth
There is a close relationship between the proposed mass and data depth. Mass estimation and data depth both delineate the centrality of a data cloud (as opposed to compactness in the case of the density measure). The properties common to both measures are:
- The centre of a data cloud has the maximum value of the measure
- An ordering from the centre (having the maximum value) to the fringe points (having the minimum values).
However, there are two key differences.
- Not until recently (see ) data depth always models a given data with one centre, regardless whether the data is unimodal or multi-modal; whereas mass can model both unimodal and multimodal data by setting h = 1 or h > 1. Local data depth has a parameter (τ) which allows it to model multi-modal data as well as unimodal data. However, the performance of local data depth appears to be sensitive to the setting of τ. In contrast, a single setting of h in mass estimation produce good task-specific performance.
- Mass is a simple and straightforward measure, and has efficient estimation methods based on axis-parallel partitions only. Data depth has many different definitions, depending on the construct used to define depth. The constructs could be Mahalanobis, Convex Hull, simplicial, halfspace and so on, all of which are expensive to compute—this has been the main obstacle in applying data depth to real applications in multi-dimensional problems.
Existing density estimators versus DEMass
"Estimation of densities is a universal problem of statistics (knowing the densities one can solve various problems)" – Vapnik (1998).
Density estimators approximate a probability density function f̂(x) from the observed data D = {x(1), x(2), ⋯, x(n)}, where x(i) ∈ Rd. Kernel density estimator (KDE) and k-nearest neighbour density estimator (kNN) are the two widely used density estimators in data mining.
1. Kernel density estimator (KDE)
$\hat f_{KDE}(x) = \frac{1}{n\times b^d}\sum_{i=1}^{n}K\bigg(\frac{dist(x,x^{(i)})}{b}\bigg)$ Eqn.(9)
where K(⋅) is a kernel function and b is a smoothing parameter called kernel bandwidth.
2. k-nearest neighbour density estimator (kNN)
$\hat f_{kNN}(x) = \frac{|N(x,k)|}{n\times V_{N(x,k)}}$ Eqn.(10)
where N(x,k) is the set of k-Nearest Neighbours of x in D and VN(x,k) is the volume enclosed byN(x,k).
3. Average distance k-nearest neighbour density estimator
$\hat f_{kNNDist}(x) = \frac{|N(x,k)|}{n\times \sum_{y\in N(x,k)}dist(x,y)}$ Eqn.(11)
4. ϵ-neighbourhood density estimator
$\hat f_{\epsilon}(x) = \frac{|N_\epsilon(x)|}{n\times \epsilon}$ Eqn.(12)
where Nϵ(x) = {y ∈ D|dist(x,y) ≤ ϵ}.
All existing density estimators (Eqn. 9–12) require some form of distance function to measure pairwise distances between instances that leads to quadratic time complexity (O(n2d)) and linear space complexity (O(nd)). Hence, the existing distance-based density estimator can not be applied to large data sets because of their high time and space complexities.
In contrast, density estimators based on mass (Eqn. 7 and 8) allow to estimate density without using pairwise distance measure. DEMass reduces the time complexity to linear (O(n)) and the space complexity to constant (O(1)).
Source codes
- One-dimensional mass estimation(Mass_1.0.0.zip) [MATLAB]
- Multi-dimensional mass estimation(Mass_1.1.0.zip) [MATLAB]
- DEMass-DBSCAN [JAVA under WEKA platform]
- DEMass-Bayes [JAVA under WEKA platform]
- MassBayes [JAVA under WEKA platform]