MomentSketch

MomentSketch is a moment-based quantile approximation algorithm that provides accurate quantile estimates using only a small number of statistical moments.

Overview

MomentSketch offers the following features:

Compact Representation: Stores only a fixed number of statistical moments regardless of data size
Maximum Entropy Optimization: Uses maximum entropy optimization to estimate the underlying distribution
Mergeable: Multiple sketches can be combined without loss of accuracy
Summary Statistics: Provides comprehensive statistics including mean, variance, skewness, and kurtosis
Compression Support: Optional data compression for handling widely distributed values

Main Class

class QuantileFlow.MomentSketch(num_moments: int = 20, compress_values: bool = False)[source]

Bases: object

MomentSketch implementation for quantile approximation using the moment-based approach.

This implementation uses power sums, Chebyshev moment conversion, and maximum entropy optimization to estimate the probability distribution of data and compute quantiles. It supports merging sketches from distributed sources and provides accurate quantile estimates with a compact representation.

Reference:: “Space- and Computationally-Efficient Set Similarity via Locality Sensitive Sketching” by Anshumali Shrivastava

__init__(num_moments: int = 20, compress_values: bool = False)[source]

Initialize MomentSketch.

Parameters:

num_moments – Number of moments to track (default 20). Higher values increase accuracy at the cost of computation.
compress_values – Whether to compress values using arcsinh transformation (default False). Useful for handling widely distributed data with extreme values.

classmethod from_dict(data: Dict) → MomentSketch[source]

Create a sketch from a dictionary.

Parameters:: data – Dictionary representation of a sketch.
Returns:: New MomentSketch instance.

insert(value: int | float) → None[source]

Insert a single value into the sketch.

Parameters:: value – The value to insert.

insert_batch(values: List[float] | ndarray) → None[source]

Insert multiple values into the sketch.

Parameters:: values – Array or list of values to insert.

interquartile_range() → float[source]

Get the interquartile range (IQR).

Returns:: Estimated IQR (difference between 75th and 25th percentiles).

median() → float[source]

Get the median value (50th percentile).

Returns:: Estimated median value.

merge(other: MomentSketch) → None[source]

Merge another MomentSketch into this one.

Parameters:: other – Another MomentSketch instance to merge.
Raises:: ValueError – If the sketches are incompatible (different compression settings).

percentile(p: float) → float[source]

Get the p-th percentile value.

Parameters:: p – Percentile between 0 and 100 (e.g., 75 for 75th percentile).
Returns:: Estimated value at the requested percentile.
Raises:: ValueError – If p is not between 0 and 100.

plot_distribution(figsize=(10, 6))[source]

Plot the estimated probability distribution.

Parameters:: figsize – Figure size (width, height) in inches.
Returns:: Matplotlib figure object.

quantile(fraction: float) → float[source]

Get the value at a given quantile.

Parameters:: fraction – Quantile fraction between 0 and 1 (e.g., 0.5 for median).
Returns:: Estimated value at the requested quantile.
Raises:: ValueError – If fraction is not between 0 and 1.

quantiles(fractions: List[float]) → List[float][source]

Get values at multiple quantiles.

Parameters:: fractions – List of quantile fractions between 0 and 1.
Returns:: List of estimated values at the requested quantiles.
Raises:: ValueError – If any fraction is not between 0 and 1.

summary_statistics() → Dict[str, float][source]

Get summary statistics.

Returns:: Dictionary containing min, q1, median, q3, max, count, and mean.

to_dict() → Dict[source]

Convert sketch to a dictionary for serialization.

Returns:: Dictionary representation of the sketch.

Optimizer

The optimizer module handles the maximum entropy optimization required for quantile estimation:

Optimization algorithms for the MomentSketch implementation.

This module provides optimization algorithms used to solve the maximum entropy problem in the MomentSketch implementation. It includes:

BaseOptimizer: An abstract base class defining the common interface for optimizers
NewtonOptimizer: An implementation of damped Newton’s method for convex optimization

The optimizers handle numerical stability issues that can arise in the maximum entropy optimization problem, including matrix conditioning problems and numerical precision issues.

class QuantileFlow.momentsketch.optimizer.BaseOptimizer[source]

Bases: object

Base class for optimization algorithms

get_function()[source]: Get the function being optimized

get_iteration_count()[source]: Get the number of steps taken

is_converged()[source]: Check if optimization has converged

set_max_iterations(max_iterations)[source]: Set maximum iterations

set_verbose(flag)[source]: Set verbose output flag

solve(initial_point, gradient_tolerance)[source]: Solve the optimization problem

class QuantileFlow.momentsketch.optimizer.NewtonOptimizer(objective_function)[source]

Bases: BaseOptimizer

Minimizes a convex function using damped Newton’s

__init__(objective_function)[source]

Initialize the optimizer

Parameters:: objective_function – FunctionWithHessian to optimize

get_backtracking_count()[source]: Get the number of damped steps

get_function()[source]: Get the function being optimized

get_iteration_count()[source]: Get the number of steps taken

is_converged()[source]: Check if optimization has converged

set_max_iterations(max_iterations)[source]: Set maximum iterations

set_verbose(flag)[source]: Set verbose output flag

solve(initial_point, grad_tolerance)[source]

Solve the optimization problem

Parameters:

initial_point – Initial point
grad_tolerance – Grad tolerance for convergence

Returns:

Optimal point

Utilities

class QuantileFlow.momentsketch.utils.MaxEntropy(target_moments, grid_size)[source]

Bases: object

Maximum entropy loss function for optimization

compute_all(point, precision=1e-10)[source]: Compute weights, moments, gradient, and hessian

get_gradient()[source]: Get the gradient

get_hessian()[source]: Get the Hessian matrix

get_value()[source]: Get the function value

set_lambda(new_lambda)[source]: Set lambda values

class QuantileFlow.momentsketch.utils.QuadraticFunction(dimension)[source]

Bases: object

Simple quadratic function for testing optimization algorithms

__init__(dimension)[source]

Initialize with dimension

Parameters:: dimension – Dimension of the quadratic function

compute_all(point, precision)[source]: Compute function value, gradient, and Hessian

compute_only_value(point, precision)[source]: Compute only the function value

dim()[source]: Get function dimension

get_gradient()[source]: Get function gradient

get_hessian()[source]: Get function Hessian

get_value()[source]: Get function value

class QuantileFlow.momentsketch.utils.Util[source]

Bases: object

static calculate_entropy(probabilities: List[float]) → float[source]

Compute the entropy of a probability distribution

Parameters:: probabilities (List[float]) – A list of probabilities
Returns:: The entropy value
Return type:: float

static calculate_mean(values: List[float]) → float[source]: Calculate the mean of an array

static calculate_powers(base: float, num_powers: int) → List[float][source]

Calculate the powers of base up to num_powers - 1

Parameters:

base (float) – Base value.
num_powers (int) – Number of power terms to compute

Returns:

A list of powers

Return type:

List[float]

static get_binomial_coefficients(max_degree: int) → List[List[int]][source]

Compute a binomial coefficient table up to max_degree

Parameters:: max_degree (int) – max value for which to compute binomials
Returns:: 2D list of binomial coeffs
Return type:: List[List[int]]

static get_cheby_coefficients(max_degree: int) → List[List[int]][source]

Compute Chebyshev polynomial coefficients up to degree max_degree

Parameters:: max_degree (int) – max degree
Returns:: 2D list of coeffs
Return type:: List[List[int]]

static get_mse(errors: List[float]) → float[source]

Compute the Mean Squared Error (MSE) of an error array

Parameters:: errors (List[float]) – A list of error values
Returns:: The MSE value
Return type:: float

static power_sums_to_cheby_moments(min_val: float, max_val: float, power_sums: List[float]) → List[float][source]

Convert power sums to Chebyshev moments.

Parameters:

min_val (float) – Minimum observed value.
max_val (float) – Maximum observed value.
power_sums (List[float]) – Original power sums.

Returns:

Chebyshev moments.

Return type:

List[float]

static power_sums_to_normalized_moments(power_sums: List[float], min_val: float, max_val: float) → List[float][source]

Convert power sums to normalized moments

Parameters:

power_sums (List[float]) – Original powers
min_val (float) – Min observed val
max_val (float) – Max observed val

Returns:

Norm moments

Return type:

List[float]

static shift_power_sums(power_sums: List[float], scaling_factor: float, center: float) → List[float][source]

Calculate shifted and scaled power sums for vals

Parameters:

power_sums (List[float]) – Original powers
scaling_factor (float) – Scaling fact
center (float) – Translation val

Returns:

Shifted and scaled powers

Return type:

List[float]

Mathematical Background

MomentSketch works by maintaining a fixed number of statistical moments (typically 10-20) of the data distribution. These moments are used to reconstruct an approximation of the original distribution using maximum entropy optimization.

The key steps in the algorithm are:

Moment Collection: Compute and store statistical moments (mean, variance, etc.) from the data stream
Distribution Estimation: Solve a maximum entropy optimization problem to find a distribution matching the stored moments
Quantile Estimation: Query the estimated distribution to find approximate quantiles

Maximum Entropy Optimization

The maximum entropy principle seeks the probability distribution that maximizes entropy while satisfying constraints (in this case, matching the observed moments). This approach provides the least biased estimate possible based on the limited information available.

Mathematically, we solve:

\[\max_{p} -\int p(x) \log p(x) dx\]

Subject to:

\[\int x^k p(x) dx = \mu_k \quad \text{for} \quad k = 1, 2, \ldots, n\]

Where \(\mu_k\) are the observed moments and \(p(x)\) is the probability density function.

Usage Examples

For basic usage, see the Getting Started guide.

For more advanced examples, see the Examples page.