Mass estimation

In data mining, mass estimation is a recently proposed data modelling mechanism, as an alternative to density estimation, to solve data mining problems. Unlike density estimation, it models data distribution without using distance measure. Mass estimation has been successfully applied in solving different data mining tasks such as anomaly detection, clustering, information retrieval, classification and regression. Mass-based data mining methods often perform better than or at least as well as the state-of-the-art methods, but they run orders of magnitude faster than their distance/density-based counterparts. It is because mass is more fundamental than density (density = mass/volume) which can be estimated more efficiently.

Definition

In the simplest form, mass (or data mass) is the number of instances in a bounded region. Given a function $\Tau($·) that defines local regions by partitioning the data space, the mass of an instance x that lies in the region $\Tau(x)$ can be estimated as:

$m(x)=|\Tau(x)|$ Eqn.(1)

The mass base function m(·) depends on the implementation of $\Tau($·). Two regions can have the same mass regardless of the characteristics of regions such as shape, volume or density. All the instances in a region have the same mass, i.e., the cardinality of the region.

One-dimensional mass estimation

Let D be a set of n instances in R. The data is divided into two non-empty regions with a binary split at s on the real line. The mass distribution at x is estimated as the sum of m_i(x) weighted by p(s_i) as a result of n − 1 possible s_i.
$mass(x|D) = \sum_{i=1}^{n-1}m_i(x)\times p(s_i)$ Eqn.(2)
where p(s_i) is the probability of selecting s_i. If x₁ < x₂ < ... < x_n − 1 < x_n are instances in D, then $p(s_i)=\frac{x_{i+1}-x_i}{x_n-x_1} > 0$.

In the example shown in Figure 1:
mass(x₁|D) = 1p(s₁) + 2p(s₂) + 3p(s₃) + 4p(s₄)
mass(x₂|D) = 4p(s₁) + 2p(s₂) + 3p(s₃) + 4p(s₄)
mass(x₃|D) = 4p(s₁) + 3p(s₂) + 3p(s₃) + 4p(s₄)
mass(x₄|D) = 4p(s₁) + 3p(s₂) + 2p(s₃) + 4p(s₄)
mass(x₅|D) = 4p(s₁) + 3p(s₂) + 2p(s₃) + 1p(s₄)

Mass estimation using Eqn. 2 is a level-1 mass estimation. Level-h > 1 mass can be estimated recursively as follows:
$mass(x,h|D) = \sum_{i=1}^{|D|-1}mass(x,h-1|\{y|y\in \Tau_i(x)\})\times p(s_i)$ Eqn.(3)
Examples of data modelling in terms of mass distribution are shown in Figure 2.

Properties

Mass estimation has the following characteristics:

A mass distribution stipulates an ordering from core points to fringe points in a data cloud. In addition, this ordering accentuates the fringe points with a concave function - fringe points have markedly smaller mass than points close to the core points.
Mass estimation is more efficient than density estimation because mass is computed by simple counting and it requires only a small sample through an ensemble approach.
Mass can be interpreted as a measure of relevance with respect to the concept underlying the data, i.e., core points indicate that they are highly relevant and fringe points indicates that they are less relevance.

Mass estimation has two advantages in relation to efficacy and efficiency. First, the concavity property mentioned above ensures that fringe points are ‘stretched’ to be farther from the core points in a mass space - making it easier to separate fringe points from those points close to core points. This property, otherwise hidden, can then be exploited by a data mining algorithm to achieve a better result for the intended task than the one without it. Second, mass estimation offers to solve a problem more efficiently using the ordering derived from data directly, without distance or related expensive calculation, when the problem demands ranking.

Multi-dimensional mass estimation

In multi-dimensional space, R^d, the mass distribution is estimated by constructing multiple random partitions of the data space $\Tau_i($·)(i=1,2,⋯,t) where each $\Tau_i($·) partitions the space using a fixed sized sub-sample of data, S_i ⊂ D(|S_i|=ψ<n). The mass distribution for instance x, mass(x), is the aggregation of m_i(x) from t different regions $\Tau_i(x)$ where x lies in each of the t partitions as shown in Figure 3. $\Tau_i($·) can be implemented in different ways.

1. Tree based implementation
Each $\Tau_i($·) is constructed from a random subsample S_i by recursively dividing the data space into two regions with a random axis parallel split on an attribute at each node of the tree. The height of the tree h determines the level of mass estimation. The mass is estimated by aggregating masses from the leaf nodes $\Tau_i(x)$ where x in each tree $\Tau_i($·) as follows:
$mass(x)=\frac{1}{t}\sum_{i=1}^t m_i(x)\times 2^{\ell_i(x)}$ Eqn.(4)
where ℓ_i(x) is the path length of the leaf node where x traversed in $\Tau_i($·).

If balanced binary trees are constructed such that path lengths of all leaf nodes are the same, Eqn. 4 can be simplified by ignoring the constant path length term as:
$mass(x)=\frac{1}{t}\sum_{i=1}^t m_i(x)$ Eqn.(5)
An example of two-dimensional data modelling using Eqn. 5 is shown in Figure 4.

If trees are constructed in such a way so that the mass in all leaf nodes is 1, Eqn. 4 can be simplified by ignoring the mass term as:
$mass(x)=\frac{1}{t}\sum_{i=1}^t 2^{\ell_i(x)}$ Eqn.(6)
2. Nearest neighbour based implementation
From a sample of data S_i ⊂ D, local regions are defined by a set of ψ hypercubes (Π_i) centered at each instance in S_i. A hypercube region H_p is centered at p ∈ D with length $l_p=\frac{||p-q||_\infty}{2}$ where q = min_{o ∈ S_i}(||q−o||_∞) is the nearest neighbour of p in S_i and Π_i = ∪_{o ∈ S_i}H_o. The mass in each hypercube in Π_i is estimated using another random sub-sample of data M_i ⊂ D(|M_i|=Ψ;ψ<Ψ<n).
LiNearN is a density estimator based on mass that uses nearest neighbour based implementation of local regions. The distinguishing characteristic of LiNearN is that both the number of instances and the volume of regions are adaptive to local data distribution.

Applications

The significance of mass estimation in solving different data mining tasks is summarised in the table below. The last three columns in the table provide the comparison of mass-based methods with state-of-the-art distance or density based methods (DBSCAN for clustering, LOF for anomaly detection, Naive Bayes for classification, InstRank and Qsim for information retrieval) in terms of time complexity, space complexity and task specific performance.

Task	Interpretation	Time complexity	Space complexity	Task specific performance
Clustering	High mass indicates core regions; Low mass indicates noise regions	O(n) vs O(n²)	O(1) vs O(n)	Comparable
Anomaly detection	High mass signifies normal instances; Low mass signifies anomalies	O(n) vs O(n²)	O(1) vs O(n)	Comparable
Classification	Use mass to estimate multidimensional likelihood in a Bayesian classifier	O(1) vs O(n)	O(1) vs O(n)	Comparable
Information retrieval	High (low) mass signifies that a database object is highly (less) relevant to the query	Comparable	Comparable	Better

Density can also be estimated using mass. The density function at x ∈ R^d using mass base function is estimated as follows:
$\bar{f}_m(x) = \frac{1}{t}\sum_{i=1}^t \frac{m_i(x)}{\psi\times v(\Tau_i(x))}$ Eqn.(7) [Using tree-based implementation]
where v(·) is the volume of the region.

$\bar{f}_l(x) = \frac{1}{t}\sum_{i=1}^t \rho(x|\Pi_i,M_i)$ Eqn.(8) [Using LiNearN]
where $\rho(x|\Pi_i,M_i) = \frac{m(H_{(x|\Pi_i)}|M_i)}{|M_i|\times l(H_{(x|\Pi_i)})}$; H_{(x|Π_i)} is the hypercube in Π_i where x falls into; m(H|M) is the number of instances from M that fall in the hubercube H and l(H) is the length of the hypercube H.

The density estimator based on mass (DEMass) has been applied to replace existing density estimator in LOF and DBSCAN, improving the time complexity from O(n²) to O(n), and space complexity from O(n) to O(1).
Mass estimation can also be used as a method for feature projection. Instead of aggregating mass from t models, each mass model can be treated as a new feature. Instances in the original R^d space are projected to R^t mass space. Each x ∈ R^d is projected to y ∈ R^t where y = ⟨m₁(x), m₂(x), ⋯, m_t(x)⟩. Now, the problem can be solved in the projected mass space using existing data mining methods.

So far, mass is used directly and indirectly (through density estimation and feature projection) to solve different data mining problems. A summary of algorithms based on 'Mass' and 'DEMass' is provided in the table below.

Task	DEMass	Mass
Clustering	DEMass-DBSCAN, LiNearN-Cluster	MassTER
Anomaly detection	DEMass-LOF, LiNearN	iForest, SciForest, HS-Trees, ReMass-iForest
Classification	DEMass-Bayes	MassBayes
Information retrieval		ReFeat, ReMass-ReFeat

Relationship with data depth

There is a close relationship between the proposed mass and data depth. Mass estimation and data depth both delineate the centrality of a data cloud (as opposed to compactness in the case of the density measure). The properties common to both measures are:

The centre of a data cloud has the maximum value of the measure
An ordering from the centre (having the maximum value) to the fringe points (having the minimum values).

However, there are two key differences.

Not until recently (see ) data depth always models a given data with one centre, regardless whether the data is unimodal or multi-modal; whereas mass can model both unimodal and multimodal data by setting h = 1 or h > 1. Local data depth has a parameter (τ) which allows it to model multi-modal data as well as unimodal data. However, the performance of local data depth appears to be sensitive to the setting of τ. In contrast, a single setting of h in mass estimation produce good task-specific performance.
Mass is a simple and straightforward measure, and has efficient estimation methods based on axis-parallel partitions only. Data depth has many different definitions, depending on the construct used to define depth. The constructs could be Mahalanobis, Convex Hull, simplicial, halfspace and so on, all of which are expensive to compute—this has been the main obstacle in applying data depth to real applications in multi-dimensional problems.

Existing density estimators versus DEMass

"Estimation of densities is a universal problem of statistics (knowing the densities one can solve various problems)" – Vapnik (1998).

Density estimators approximate a probability density function f̂(x) from the observed data D = {x⁽¹⁾, x⁽²⁾, ⋯, x⁽ⁿ⁾}, where x⁽ⁱ⁾ ∈ R^d. Kernel density estimator (KDE) and k-nearest neighbour density estimator (kNN) are the two widely used density estimators in data mining.

1. Kernel density estimator (KDE)
$\hat f_{KDE}(x) = \frac{1}{n\times b^d}\sum_{i=1}^{n}K\bigg(\frac{dist(x,x^{(i)})}{b}\bigg)$ Eqn.(9)
where K(⋅) is a kernel function and b is a smoothing parameter called kernel bandwidth.

2. k-nearest neighbour density estimator (kNN)
$\hat f_{kNN}(x) = \frac{|N(x,k)|}{n\times V_{N(x,k)}}$ Eqn.(10)
where N(x,k) is the set of k-Nearest Neighbours of x in D and V_N(x,k) is the volume enclosed byN(x,k).

3. Average distance k-nearest neighbour density estimator
$\hat f_{kNNDist}(x) = \frac{|N(x,k)|}{n\times \sum_{y\in N(x,k)}dist(x,y)}$ Eqn.(11)

4. ϵ-neighbourhood density estimator
$\hat f_{\epsilon}(x) = \frac{|N_\epsilon(x)|}{n\times \epsilon}$ Eqn.(12)
where N_ϵ(x) = {y ∈ D|dist(x,y) ≤ ϵ}.

All existing density estimators (Eqn. 9–12) require some form of distance function to measure pairwise distances between instances that leads to quadratic time complexity (O(n²d)) and linear space complexity (O(nd)). Hence, the existing distance-based density estimator can not be applied to large data sets because of their high time and space complexities.

In contrast, density estimators based on mass (Eqn. 7 and 8) allow to estimate density without using pairwise distance measure. DEMass reduces the time complexity to linear (O(n)) and the space complexity to constant (O(1)).

Source codes

One-dimensional mass estimation(Mass_1.0.0.zip) [MATLAB]
Multi-dimensional mass estimation(Mass_1.1.0.zip) [MATLAB]
DEMass-DBSCAN [JAVA under WEKA platform]
DEMass-Bayes [JAVA under WEKA platform]
MassBayes [JAVA under WEKA platform]

Mass estimation

Definition

One-dimensional mass estimation

Properties

Multi-dimensional mass estimation

Applications

Relationship with data depth

Existing density estimators versus DEMass

Source codes

See Also

FSA-Red Algorithm

Data mining in meteorology

Data mining agent

Java Machine Learning Library

Weather Data Mining