MomentSketch
MomentSketch is a moment-based quantile approximation algorithm that provides accurate quantile estimates using only a small number of statistical moments.
Overview
MomentSketch offers the following features:
Compact Representation: Stores only a fixed number of statistical moments regardless of data size
Maximum Entropy Optimization: Uses maximum entropy optimization to estimate the underlying distribution
Mergeable: Multiple sketches can be combined without loss of accuracy
Summary Statistics: Provides comprehensive statistics including mean, variance, skewness, and kurtosis
Compression Support: Optional data compression for handling widely distributed values
Main Class
- class QuantileFlow.MomentSketch(num_moments: int = 20, compress_values: bool = False)[source]
Bases:
objectMomentSketch implementation for quantile approximation using the moment-based approach.
This implementation uses power sums, Chebyshev moment conversion, and maximum entropy optimization to estimate the probability distribution of data and compute quantiles. It supports merging sketches from distributed sources and provides accurate quantile estimates with a compact representation.
- Reference:
“Space- and Computationally-Efficient Set Similarity via Locality Sensitive Sketching” by Anshumali Shrivastava
- __init__(num_moments: int = 20, compress_values: bool = False)[source]
Initialize MomentSketch.
- Parameters:
num_moments – Number of moments to track (default 20). Higher values increase accuracy at the cost of computation.
compress_values – Whether to compress values using arcsinh transformation (default False). Useful for handling widely distributed data with extreme values.
- classmethod from_dict(data: Dict) MomentSketch[source]
Create a sketch from a dictionary.
- Parameters:
data – Dictionary representation of a sketch.
- Returns:
New MomentSketch instance.
- insert(value: int | float) None[source]
Insert a single value into the sketch.
- Parameters:
value – The value to insert.
- insert_batch(values: List[float] | ndarray) None[source]
Insert multiple values into the sketch.
- Parameters:
values – Array or list of values to insert.
- interquartile_range() float[source]
Get the interquartile range (IQR).
- Returns:
Estimated IQR (difference between 75th and 25th percentiles).
- median() float[source]
Get the median value (50th percentile).
- Returns:
Estimated median value.
- merge(other: MomentSketch) None[source]
Merge another MomentSketch into this one.
- Parameters:
other – Another MomentSketch instance to merge.
- Raises:
ValueError – If the sketches are incompatible (different compression settings).
- percentile(p: float) float[source]
Get the p-th percentile value.
- Parameters:
p – Percentile between 0 and 100 (e.g., 75 for 75th percentile).
- Returns:
Estimated value at the requested percentile.
- Raises:
ValueError – If p is not between 0 and 100.
- plot_distribution(figsize=(10, 6))[source]
Plot the estimated probability distribution.
- Parameters:
figsize – Figure size (width, height) in inches.
- Returns:
Matplotlib figure object.
- quantile(fraction: float) float[source]
Get the value at a given quantile.
- Parameters:
fraction – Quantile fraction between 0 and 1 (e.g., 0.5 for median).
- Returns:
Estimated value at the requested quantile.
- Raises:
ValueError – If fraction is not between 0 and 1.
- quantiles(fractions: List[float]) List[float][source]
Get values at multiple quantiles.
- Parameters:
fractions – List of quantile fractions between 0 and 1.
- Returns:
List of estimated values at the requested quantiles.
- Raises:
ValueError – If any fraction is not between 0 and 1.
- summary_statistics() Dict[str, float][source]
Get summary statistics.
- Returns:
Dictionary containing min, q1, median, q3, max, count, and mean.
- to_dict() Dict[source]
Convert sketch to a dictionary for serialization.
- Returns:
Dictionary representation of the sketch.
Optimizer
The optimizer module handles the maximum entropy optimization required for quantile estimation:
Optimization algorithms for the MomentSketch implementation.
This module provides optimization algorithms used to solve the maximum entropy problem in the MomentSketch implementation. It includes:
BaseOptimizer: An abstract base class defining the common interface for optimizers
NewtonOptimizer: An implementation of damped Newton’s method for convex optimization
The optimizers handle numerical stability issues that can arise in the maximum entropy optimization problem, including matrix conditioning problems and numerical precision issues.
- class QuantileFlow.momentsketch.optimizer.BaseOptimizer[source]
Bases:
objectBase class for optimization algorithms
- get_function()[source]
Get the function being optimized
- get_iteration_count()[source]
Get the number of steps taken
- is_converged()[source]
Check if optimization has converged
- set_max_iterations(max_iterations)[source]
Set maximum iterations
- set_verbose(flag)[source]
Set verbose output flag
- solve(initial_point, gradient_tolerance)[source]
Solve the optimization problem
- class QuantileFlow.momentsketch.optimizer.NewtonOptimizer(objective_function)[source]
Bases:
BaseOptimizerMinimizes a convex function using damped Newton’s
- __init__(objective_function)[source]
Initialize the optimizer
- Parameters:
objective_function – FunctionWithHessian to optimize
- get_backtracking_count()[source]
Get the number of damped steps
- get_function()[source]
Get the function being optimized
- get_iteration_count()[source]
Get the number of steps taken
- is_converged()[source]
Check if optimization has converged
- set_max_iterations(max_iterations)[source]
Set maximum iterations
- set_verbose(flag)[source]
Set verbose output flag
- solve(initial_point, grad_tolerance)[source]
Solve the optimization problem
- Parameters:
initial_point – Initial point
grad_tolerance – Grad tolerance for convergence
- Returns:
Optimal point
Utilities
- class QuantileFlow.momentsketch.utils.MaxEntropy(target_moments, grid_size)[source]
Bases:
objectMaximum entropy loss function for optimization
- compute_all(point, precision=1e-10)[source]
Compute weights, moments, gradient, and hessian
- get_gradient()[source]
Get the gradient
- get_hessian()[source]
Get the Hessian matrix
- get_value()[source]
Get the function value
- set_lambda(new_lambda)[source]
Set lambda values
- class QuantileFlow.momentsketch.utils.QuadraticFunction(dimension)[source]
Bases:
objectSimple quadratic function for testing optimization algorithms
- __init__(dimension)[source]
Initialize with dimension
- Parameters:
dimension – Dimension of the quadratic function
- compute_all(point, precision)[source]
Compute function value, gradient, and Hessian
- compute_only_value(point, precision)[source]
Compute only the function value
- dim()[source]
Get function dimension
- get_gradient()[source]
Get function gradient
- get_hessian()[source]
Get function Hessian
- get_value()[source]
Get function value
- class QuantileFlow.momentsketch.utils.Util[source]
Bases:
object- static calculate_entropy(probabilities: List[float]) float[source]
Compute the entropy of a probability distribution
- Parameters:
probabilities (List[float]) – A list of probabilities
- Returns:
The entropy value
- Return type:
float
- static calculate_mean(values: List[float]) float[source]
Calculate the mean of an array
- static calculate_powers(base: float, num_powers: int) List[float][source]
Calculate the powers of base up to num_powers - 1
- Parameters:
base (float) – Base value.
num_powers (int) – Number of power terms to compute
- Returns:
A list of powers
- Return type:
List[float]
- static get_binomial_coefficients(max_degree: int) List[List[int]][source]
Compute a binomial coefficient table up to max_degree
- Parameters:
max_degree (int) – max value for which to compute binomials
- Returns:
2D list of binomial coeffs
- Return type:
List[List[int]]
- static get_cheby_coefficients(max_degree: int) List[List[int]][source]
Compute Chebyshev polynomial coefficients up to degree max_degree
- Parameters:
max_degree (int) – max degree
- Returns:
2D list of coeffs
- Return type:
List[List[int]]
- static get_mse(errors: List[float]) float[source]
Compute the Mean Squared Error (MSE) of an error array
- Parameters:
errors (List[float]) – A list of error values
- Returns:
The MSE value
- Return type:
float
- static power_sums_to_cheby_moments(min_val: float, max_val: float, power_sums: List[float]) List[float][source]
Convert power sums to Chebyshev moments.
- Parameters:
min_val (float) – Minimum observed value.
max_val (float) – Maximum observed value.
power_sums (List[float]) – Original power sums.
- Returns:
Chebyshev moments.
- Return type:
List[float]
- static power_sums_to_normalized_moments(power_sums: List[float], min_val: float, max_val: float) List[float][source]
Convert power sums to normalized moments
- Parameters:
power_sums (List[float]) – Original powers
min_val (float) – Min observed val
max_val (float) – Max observed val
- Returns:
Norm moments
- Return type:
List[float]
- static shift_power_sums(power_sums: List[float], scaling_factor: float, center: float) List[float][source]
Calculate shifted and scaled power sums for vals
- Parameters:
power_sums (List[float]) – Original powers
scaling_factor (float) – Scaling fact
center (float) – Translation val
- Returns:
Shifted and scaled powers
- Return type:
List[float]
Mathematical Background
MomentSketch works by maintaining a fixed number of statistical moments (typically 10-20) of the data distribution. These moments are used to reconstruct an approximation of the original distribution using maximum entropy optimization.
The key steps in the algorithm are:
Moment Collection: Compute and store statistical moments (mean, variance, etc.) from the data stream
Distribution Estimation: Solve a maximum entropy optimization problem to find a distribution matching the stored moments
Quantile Estimation: Query the estimated distribution to find approximate quantiles
Maximum Entropy Optimization
The maximum entropy principle seeks the probability distribution that maximizes entropy while satisfying constraints (in this case, matching the observed moments). This approach provides the least biased estimate possible based on the limited information available.
Mathematically, we solve:
Subject to:
Where \(\mu_k\) are the observed moments and \(p(x)\) is the probability density function.
Usage Examples
For basic usage, see the Getting Started guide.
For more advanced examples, see the Examples page.