Get trending papers in your email inbox once a day!
Get trending papers in your email inbox!
SubscribeTopological Quantum Compilation Using Mixed-Integer Programming
We introduce the Mixed-Integer Quadratically Constrained Quadratic Programming framework for the quantum compilation problem and apply it in the context of topological quantum computing. In this setting, quantum gates are realized by sequences of elementary braids of quasiparticles with exotic fractional statistics in certain two-dimensional topological condensed matter systems, described by effective topological quantum field theories. We specifically focus on a non-semisimple version of topological field theory, which provides a foundation for an extended theory of Ising anyons and which has recently been shown by Iulianelli et al., Nature Communications {\bf 16}, 6408 (2025), to permit universal quantum computation. While the proofs of this pioneering result are existential in nature, the mixed integer programming provides an approach to explicitly construct quantum gates in topological systems. We demonstrate this by focusing specifically on the entangling controlled-NOT operation, and its local equivalence class, using braiding operations in the non-semisimple Ising system. This illustrates the utility of the Mixed-Integer Quadratically Constrained Quadratic Programming for topological quantum compilation.
Fractional WKB Approximation
Wentzel, Kramers, Brillouin (WKB) approximation for fractional systems is investigated in this paper using the fractional calculus. In the fractional case the wave function is constructed such that the phase factor is the same as the Hamilton's principle function "S". To demonstrate our proposed approach two examples are investigated in details.
On the matrices in B-spline collocation methods for Riesz fractional equations and their spectral properties
In this work, we focus on a fractional differential equation in Riesz form discretized by a polynomial B-spline collocation method. For an arbitrary polynomial degree p, we show that the resulting coefficient matrices possess a Toeplitz-like structure. We investigate their spectral properties via their symbol and we prove that, like for second order differential problems, also in this case the given matrices are ill-conditioned both in the low and high frequencies for large p. More precisely, in the fractional scenario the symbol has a single zero at 0 of order α, with α the fractional derivative order that ranges from 1 to 2, and it presents an exponential decay to zero at π for increasing p that becomes faster as α approaches 1. This translates in a mitigated conditioning in the low frequencies and in a deterioration in the high frequencies when compared to second order problems. Furthermore, the derivation of the symbol reveals another similarity of our problem with a classical diffusion problem. Since the entries of the coefficient matrices are defined as evaluations of fractional derivatives of the B-spline basis at the collocation points, we are able to express the central entries of the coefficient matrix as inner products of two fractional derivatives of cardinal B-splines. Finally, we perform a numerical study of the approximation behavior of polynomial B-spline collocation. This study suggests that, in line with non-fractional diffusion problems, the approximation order for smooth solutions in the fractional case is p+2-α for even p, and p+1-α for odd p.
Scaling limit of a long-range random walk in time-correlated random environment
This paper concerns a long-range random walk in random environment in dimension 1+1, where the environmental disorder is independent in space but has long-range correlations in time. We prove that two types of rescaled partition functions converge weakly to the Stratonovich solution and the It\^o-Skorohod solution respectively of a fractional stochastic heat equation with multiplicative Gaussian noise which is white in space and colored in time.
Fractal Calibration for long-tailed object detection
Real-world datasets follow an imbalanced distribution, which poses significant challenges in rare-category object detection. Recent studies tackle this problem by developing re-weighting and re-sampling methods, that utilise the class frequencies of the dataset. However, these techniques focus solely on the frequency statistics and ignore the distribution of the classes in image space, missing important information. In contrast to them, we propose FRActal CALibration (FRACAL): a novel post-calibration method for long-tailed object detection. FRACAL devises a logit adjustment method that utilises the fractal dimension to estimate how uniformly classes are distributed in image space. During inference, it uses the fractal dimension to inversely downweight the probabilities of uniformly spaced class predictions achieving balance in two axes: between frequent and rare categories, and between uniformly spaced and sparsely spaced classes. FRACAL is a post-processing method and it does not require any training, also it can be combined with many off-the-shelf models such as one-stage sigmoid detectors and two-stage instance segmentation models. FRACAL boosts the rare class performance by up to 8.6% and surpasses all previous methods on LVIS dataset, while showing good generalisation to other datasets such as COCO, V3Det and OpenImages. We provide the code at https://github.com/kostas1515/FRACAL.
Kernel Density Estimators in Large Dimensions
This paper studies Kernel density estimation for a high-dimensional distribution rho(x). Traditional approaches have focused on the limit of large number of data points n and fixed dimension d. We analyze instead the regime where both the number n of data points y_i and their dimensionality d grow with a fixed ratio alpha=(log n)/d. Our study reveals three distinct statistical regimes for the kernel-based estimate of the density hat rho_h^{D}(x)=1{n h^d}sum_{i=1}^n Kleft(x-y_i{h}right), depending on the bandwidth h: a classical regime for large bandwidth where the Central Limit Theorem (CLT) holds, which is akin to the one found in traditional approaches. Below a certain value of the bandwidth, h_{CLT}(alpha), we find that the CLT breaks down. The statistics of hat rho_h^{D}(x) for a fixed x drawn from rho(x) is given by a heavy-tailed distribution (an alpha-stable distribution). In particular below a value h_G(alpha), we find that hat rho_h^{D}(x) is governed by extreme value statistics: only a few points in the database matter and give the dominant contribution to the density estimator. We provide a detailed analysis for high-dimensional multivariate Gaussian data. We show that the optimal bandwidth threshold based on Kullback-Leibler divergence lies in the new statistical regime identified in this paper. Our findings reveal limitations of classical approaches, show the relevance of these new statistical regimes, and offer new insights for Kernel density estimation in high-dimensional settings.
The probabilistic world
Physics is based on probabilities as fundamental entities of a mathematical description. Expectation values of observables are computed according to the classical statistical rule. The overall probability distribution for one world covers all times. The quantum formalism arises once one focuses on the evolution of the time-local probabilistic information. Wave functions or the density matrix allow the formulation of a general linear evolution law for classical statistics. The quantum formalism for classical statistics is a powerful tool which allows us to implement for generalized Ising models the momentum observable with the associated Fourier representation. The association of operators to observables permits the computation of expectation values in terms of the density matrix by the usual quantum rule. We show that probabilistic cellular automata are quantum systems in a formulation with discrete time steps and real wave functions. With a complex structure the evolution operator for automata can be expressed in terms of a Hamiltonian involving fermionic creation and annihilation operators. The time-local probabilistic information amounts to a subsystem of the overall probabilistic system which is correlated with its environment consisting of the past and future. Such subsystems typically involve probabilistic observables for which only a probability distribution for their possible measurement values is available. Incomplete statistics does not permit to compute classical correlation functions for arbitrary subsystem-observables. Bell's inequalities are not generally applicable.
Fractional divergence-measure fields, Leibniz rule and Gauss-Green formula
Given alphain(0,1] and pin[1,+infty], we define the space DM^{alpha,p}(mathbb R^n) of L^p vector fields whose alpha-divergence is a finite Radon measure, extending the theory of divergence-measure vector fields to the distributional fractional setting. Our main results concern the absolute continuity properties of the alpha-divergence-measure with respect to the Hausdorff measure and fractional analogues of the Leibniz rule and the Gauss-Green formula. The sharpness of our results is discussed via some explicit examples.
Linear statistics for Coulomb gases: higher order cumulants
We consider N classical particles interacting via the Coulomb potential in spatial dimension d and in the presence of an external trap, at equilibrium at inverse temperature beta. In the large N limit, the particles are confined within a droplet of finite size. We study smooth linear statistics, i.e. the fluctuations of sums of the form {cal L}_N = sum_{i=1}^N f({bf x}_i), where {bf x}_i's are the positions of the particles and where f({bf x}_i) is a sufficiently regular function. There exists at present standard results for the first and second moments of {cal L}_N in the large N limit, as well as associated Central Limit Theorems in general dimension and for a wide class of confining potentials. Here we obtain explicit expressions for the higher order cumulants of {cal L}_N at large N, when the function f({bf x})=f(|{bf x}|) and the confining potential are both rotationnally invariant. A remarkable feature of our results is that these higher cumulants depend only on the value of f'(|{bf x}|) and its higher order derivatives evaluated exactly at the boundary of the droplet, which in this case is a d-dimensional sphere. In the particular two-dimensional case d=2 at the special value beta=2, a connection to the Ginibre ensemble allows us to derive these results in an alternative way using the tools of determinantal point processes. Finally we also obtain the large deviation form of the full probability distribution function of {cal L}_N.
Unleashing the Potential of Fractional Calculus in Graph Neural Networks with FROND
We introduce the FRactional-Order graph Neural Dynamical network (FROND), a new continuous graph neural network (GNN) framework. Unlike traditional continuous GNNs that rely on integer-order differential equations, FROND employs the Caputo fractional derivative to leverage the non-local properties of fractional calculus. This approach enables the capture of long-term dependencies in feature updates, moving beyond the Markovian update mechanisms in conventional integer-order models and offering enhanced capabilities in graph representation learning. We offer an interpretation of the node feature updating process in FROND from a non-Markovian random walk perspective when the feature updating is particularly governed by a diffusion process. We demonstrate analytically that oversmoothing can be mitigated in this setting. Experimentally, we validate the FROND framework by comparing the fractional adaptations of various established integer-order continuous GNNs, demonstrating their consistently improved performance and underscoring the framework's potential as an effective extension to enhance traditional continuous GNNs. The code is available at https://github.com/zknus/ICLR2024-FROND.
Evidence of Nonlinear Signatures in Solar Wind Proton Density at the L1 Lagrange point
The solar wind is a medium characterized by strong turbulence and significant field fluctuations on various scales. Recent observations have revealed that magnetic turbulence exhibits a self-similar behavior. Similarly, high-resolution measurements of the proton density have shown comparable characteristics, prompting several studies into the multifractal properties of these density fluctuations. In this work, we show that low-resolution observations of the solar wind proton density over time, recorded by various spacecraft at Lagrange point L1, also exhibit non-linear and multifractal structures. The novelty of our study lies in the fact that this is the first systematic analysis of solar wind proton density using low-resolution (hourly) data collected by multiple spacecraft at the L1 Lagrange point over a span of 17 years. Furthermore, we interpret our results within the framework of non-extensive statistical mechanics, which appears to be consistent with the observed nonlinear behavior. Based on the data, we successfully validate the q-triplet predicted by non-extensive statistical theory. To the best of our knowledge, this represents the most rigorous and systematic validation to date of the q-triplet in the solar wind.
Unification of popular artificial neural network activation functions
We present a unified representation of the most popular neural network activation functions. Adopting Mittag-Leffler functions of fractional calculus, we propose a flexible and compact functional form that is able to interpolate between various activation functions and mitigate common problems in training neural networks such as vanishing and exploding gradients. The presented gated representation extends the scope of fixed-shape activation functions to their adaptive counterparts whose shape can be learnt from the training data. The derivatives of the proposed functional form can also be expressed in terms of Mittag-Leffler functions making it a suitable candidate for gradient-based backpropagation algorithms. By training multiple neural networks of different complexities on various datasets with different sizes, we demonstrate that adopting a unified gated representation of activation functions offers a promising and affordable alternative to individual built-in implementations of activation functions in conventional machine learning frameworks.
Fractal Generative Models
Modularization is a cornerstone of computer science, abstracting complex functions into atomic building blocks. In this paper, we introduce a new level of modularization by abstracting generative models into atomic generative modules. Analogous to fractals in mathematics, our method constructs a new type of generative model by recursively invoking atomic generative modules, resulting in self-similar fractal architectures that we call fractal generative models. As a running example, we instantiate our fractal framework using autoregressive models as the atomic generative modules and examine it on the challenging task of pixel-by-pixel image generation, demonstrating strong performance in both likelihood estimation and generation quality. We hope this work could open a new paradigm in generative modeling and provide a fertile ground for future research. Code is available at https://github.com/LTH14/fractalgen.
Generalized Kernel Thinning
The kernel thinning (KT) algorithm of Dwivedi and Mackey (2021) compresses a probability distribution more effectively than independent sampling by targeting a reproducing kernel Hilbert space (RKHS) and leveraging a less smooth square-root kernel. Here we provide four improvements. First, we show that KT applied directly to the target RKHS yields tighter, dimension-free guarantees for any kernel, any distribution, and any fixed function in the RKHS. Second, we show that, for analytic kernels like Gaussian, inverse multiquadric, and sinc, target KT admits maximum mean discrepancy (MMD) guarantees comparable to or better than those of square-root KT without making explicit use of a square-root kernel. Third, we prove that KT with a fractional power kernel yields better-than-Monte-Carlo MMD guarantees for non-smooth kernels, like Laplace and Mat\'ern, that do not have square-roots. Fourth, we establish that KT applied to a sum of the target and power kernels (a procedure we call KT+) simultaneously inherits the improved MMD guarantees of power KT and the tighter individual function guarantees of target KT. In our experiments with target KT and KT+, we witness significant improvements in integration error even in 100 dimensions and when compressing challenging differential equation posteriors.
Variational Inference for SDEs Driven by Fractional Noise
We present a novel variational framework for performing inference in (neural) stochastic differential equations (SDEs) driven by Markov-approximate fractional Brownian motion (fBM). SDEs offer a versatile tool for modeling real-world continuous-time dynamic systems with inherent noise and randomness. Combining SDEs with the powerful inference capabilities of variational methods, enables the learning of representative function distributions through stochastic gradient descent. However, conventional SDEs typically assume the underlying noise to follow a Brownian motion (BM), which hinders their ability to capture long-term dependencies. In contrast, fractional Brownian motion (fBM) extends BM to encompass non-Markovian dynamics, but existing methods for inferring fBM parameters are either computationally demanding or statistically inefficient. In this paper, building upon the Markov approximation of fBM, we derive the evidence lower bound essential for efficient variational inference of posterior path measures, drawing from the well-established field of stochastic analysis. Additionally, we provide a closed-form expression to determine optimal approximation coefficients. Furthermore, we propose the use of neural networks to learn the drift, diffusion and control terms within our variational posterior, leading to the variational training of neural-SDEs. In this framework, we also optimize the Hurst index, governing the nature of our fractional noise. Beyond validation on synthetic data, we contribute a novel architecture for variational latent video prediction,-an approach that, to the best of our knowledge, enables the first variational neural-SDE application to video perception.
A Test for Jumps in Metric-Space Conditional Means
Standard methods for detecting discontinuities in conditional means are not applicable to outcomes that are complex, non-Euclidean objects like distributions, networks, or covariance matrices. This article develops a nonparametric test for jumps in conditional means when outcomes lie in a non-Euclidean metric space. Using local Fr\'echet regressionx2014which generalizes standard regression to metric-space valued datax2014the method estimates a mean path on either side of a candidate cutoff, extending existing k-sample tests to a flexible regression setting. Key theoretical contributions include a central limit theorem for the local estimator of the conditional Fr\'echet variance and the asymptotic validity and consistency of the proposed test. Simulations confirm nominal size control and robust power in finite samples. Two applications demonstrate the method's value by revealing effects invisible to scalar-based tests. First, I detect a sharp change in work-from-home compositions at Washington State's income threshold for non-compete enforceability during COVID-19, highlighting remote work's role as a bargaining margin. Second, I find that countries restructure their input-output networks after losing preferential US trade access. These findings underscore that analyzing regression functions within their native metric spaces can reveal structural discontinuities that scalar summaries would miss.
Time-Fractional Approach to the Electrochemical Impedance: The Displacement Current
We establish, in general terms, the conditions to be satisfied by a time-fractional approach formulation of the Poisson-Nernst-Planck model in order to guarantee that the total current across the sample be solenoidal, as required by the Maxwell equation. Only in this case the electric impedance of a cell can be determined as the ratio between the applied difference of potential and the current across the cell. We show that in the case of anomalous diffusion, the model predicts for the electric impedance of the cell a constant phase element behaviour in the low frequency region. In the parametric curve of the reactance versus the resistance, the slope coincides with the order of the fractional time derivative.
Chaos as an interpretable benchmark for forecasting and data-driven modelling
The striking fractal geometry of strange attractors underscores the generative nature of chaos: like probability distributions, chaotic systems can be repeatedly measured to produce arbitrarily-detailed information about the underlying attractor. Chaotic systems thus pose a unique challenge to modern statistical learning techniques, while retaining quantifiable mathematical properties that make them controllable and interpretable as benchmarks. Here, we present a growing database currently comprising 131 known chaotic dynamical systems spanning fields such as astrophysics, climatology, and biochemistry. Each system is paired with precomputed multivariate and univariate time series. Our dataset has comparable scale to existing static time series databases; however, our systems can be re-integrated to produce additional datasets of arbitrary length and granularity. Our dataset is annotated with known mathematical properties of each system, and we perform feature analysis to broadly categorize the diverse dynamics present across the collection. Chaotic systems inherently challenge forecasting models, and across extensive benchmarks we correlate forecasting performance with the degree of chaos present. We also exploit the unique generative properties of our dataset in several proof-of-concept experiments: surrogate transfer learning to improve time series classification, importance sampling to accelerate model training, and benchmarking symbolic regression algorithms.
Central limit theorems under non-stationarity via relative weak convergence
Statistical inference for non-stationary data is hindered by the failure of classical central limit theorems (CLTs), not least because there is no fixed Gaussian limit to converge to. To resolve this, we introduce relative weak convergence, an extension of weak convergence that compares a statistic or process to a sequence of evolving processes. Relative weak convergence retains the essential consequences of classical weak convergence and coincides with it under stationarity. Crucially, it applies in general non-stationary settings where classical weak convergence fails. We establish concrete relative CLTs for random vectors and empirical processes, along with sequential, weighted, and bootstrap variants, that parallel the state-of-the-art in stationary settings. Our framework and results offer simple, plug-in replacements for classical CLTs whenever stationarity is untenable, as illustrated by applications in nonparametric trend estimation and hypothesis testing.
Discrete Galerkin Method for Fractional Integro-Differential Equations
In this paper, we develop a fully discrete Galerkin method for solving initial value fractional integro-differential equations(FIDEs). We consider Generalized Jacobi polynomials(GJPs) with indexes corresponding to the number of homogeneous initial conditions as natural basis functions for the approximate solution. The fractional derivatives are used in the Caputo sense. The numerical solvability of algebraic system obtained from implementation of proposed method for a special case of FIDEs is investigated. We also provide a suitable convergence analysis to approximate solutions under a more general regularity assumption on the exact solution.
Preserving Statistical Validity in Adaptive Data Analysis
A great deal of effort has been devoted to reducing the risk of spurious scientific discoveries, from the use of sophisticated validation techniques, to deep statistical methods for controlling the false discovery rate in multiple hypothesis testing. However, there is a fundamental disconnect between the theoretical results and the practice of data analysis: the theory of statistical inference assumes a fixed collection of hypotheses to be tested, or learning algorithms to be applied, selected non-adaptively before the data are gathered, whereas in practice data is shared and reused with hypotheses and new analyses being generated on the basis of data exploration and the outcomes of previous analyses. In this work we initiate a principled study of how to guarantee the validity of statistical inference in adaptive data analysis. As an instance of this problem, we propose and investigate the question of estimating the expectations of m adaptively chosen functions on an unknown distribution given n random samples. We show that, surprisingly, there is a way to estimate an exponential in n number of expectations accurately even if the functions are chosen adaptively. This gives an exponential improvement over standard empirical estimators that are limited to a linear number of estimates. Our result follows from a general technique that counter-intuitively involves actively perturbing and coordinating the estimates, using techniques developed for privacy preservation. We give additional applications of this technique to our question.
The Slepian model based independent interval approximation of persistency and zero-level exceedance distributions
In physics and engineering literature, the distribution of the excursion-above-zero time distribution (exceedance distribution) for a stationary Gaussian process has been approximated by a stationary switching process with independently distributed switching times. The approach matched the covariance of the clipped Gaussian process with the one for the stationary switching process and the distribution of the latter was used as the so-called independent interval approximation (IIA). The approach successfully assessed the persistency exponent for many physically important processes but left an unanswered question when such an approach leads to a mathematically meaningful and proper exceedance distribution. Here we address this question by proposing an alternative matching of the expected values of the clipped Slepian process and the corresponding switched process initiated at the origin. The method has allowed resolving the mathematical correctness of the matching method for a large subclass of the Gaussian processes with monotonic covariance, for which we provide a sufficient condition for the validity of the IIA. Within this class, the IIA produces a valid distribution for the excursion time and is represented in an explicit stochastic form that connects directly to the covariance of the underlying Gaussian process. We compare the excursion level distributions as well as the corresponding persistency exponents obtained through the IIA method with numerically computed exact distributions, and the simulated distribution for several important Gaussian models. We also argue that for stationary Gaussian processes with a non-monotonic covariance, the IIA fails and should not be used.
The SIML method without microstructure noise
The SIML (abbreviation of Separating Information Maximal Likelihood) method, has been introduced by N. Kunitomo and S. Sato and their collaborators to estimate the integrated volatility of high-frequency data that is assumed to be an It\^o process but with so-called microstructure noise. The SIML estimator turned out to share many properties with the estimator introduced by P. Malliavin and M.E. Mancino. The present paper establishes the consistency and the asymptotic normality under a general sampling scheme but without microstructure noise. Specifically, a fast convergence shown for Malliavin--Mancino estimator by E. Clement and A. Gloter is also established for the SIML estimator.
Some Properties of Large Excursions of a Stationary Gaussian Process
The present work investigates two properties of level crossings of a stationary Gaussian process X(t) with autocorrelation function R_X(tau). We show firstly that if R_X(tau) admits finite second and fourth derivatives at the origin, the length of up-excursions above a large negative level -gamma is asymptotically exponential as -gamma to -infty. Secondly, assuming that R_X(tau) admits a finite second derivative at the origin and some defined properties, we derive the mean number of crossings as well as the length of successive excursions above two subsequent large levels. The asymptotic results are shown to be effective even for moderate values of crossing level. An application of the developed results is proposed to derive the probability of successive excursions above adjacent levels during a time window.
An OFDM Signal Identification Method for Wireless Communications Systems
Distinction of OFDM signals from single carrier signals is highly important for adaptive receiver algorithms and signal identification applications. OFDM signals exhibit Gaussian characteristics in time domain and fourth order cumulants of Gaussian distributed signals vanish in contrary to the cumulants of other signals. Thus fourth order cumulants can be utilized for OFDM signal identification. In this paper, first, formulations of the estimates of the fourth order cumulants for OFDM signals are provided. Then it is shown these estimates are affected significantly from the wireless channel impairments, frequency offset, phase offset and sampling mismatch. To overcome these problems, a general chi-square constant false alarm rate Gaussianity test which employs estimates of cumulants and their covariances is adapted to the specific case of wireless OFDM signals. Estimation of the covariance matrix of the fourth order cumulants are greatly simplified peculiar to the OFDM signals. A measurement setup is developed to analyze the performance of the identification method and for comparison purposes. A parametric measurement analysis is provided depending on modulation order, signal to noise ratio, number of symbols, and degree of freedom of the underlying test. The proposed method outperforms statistical tests which are based on fixed thresholds or empirical values, while a priori information requirement and complexity of the proposed method are lower than the coherent identification techniques.
Concentrating solutions of the fractional (p,q)-Choquard equation with exponential growth
This article deals with the following fractional (p,q)-Choquard equation with exponential growth of the form: $varepsilon^{ps}(-Delta)_{p}^{s}u+varepsilon^{qs}(-Delta)_q^su+ Z(x)(|u|^{p-2}u+|u|^{q-2}u)=varepsilon^{mu-N}[|x|^{-mu}*F(u)]f(u) in R^N, where s\in (0,1), \varepsilon>0 is a parameter, 2\leq p=N{s}<q, and 0<\mu<N. The nonlinear function f has an exponential growth at infinity and the continuous potential function Z satisfies suitable natural conditions. With the help of the Ljusternik-Schnirelmann category theory and variational methods, the multiplicity and concentration of positive solutions are obtained for \varepsilon>0$ small enough. In a certain sense, we generalize some previously known results.
Faster Rates of Convergence to Stationary Points in Differentially Private Optimization
We study the problem of approximating stationary points of Lipschitz and smooth functions under (varepsilon,delta)-differential privacy (DP) in both the finite-sum and stochastic settings. A point w is called an alpha-stationary point of a function F:R^drightarrowR if |nabla F(w)|leq alpha. We provide a new efficient algorithm that finds an Obig(big[sqrt{d}{nvarepsilon}big]^{2/3}big)-stationary point in the finite-sum setting, where n is the number of samples. This improves on the previous best rate of Obig(big[sqrt{d}{nvarepsilon}big]^{1/2}big). We also give a new construction that improves over the existing rates in the stochastic optimization setting, where the goal is to find approximate stationary points of the population risk. Our construction finds a Obig(1{n^{1/3}} + big[sqrt{d}{nvarepsilon}big]^{1/2}big)-stationary point of the population risk in time linear in n. Furthermore, under the additional assumption of convexity, we completely characterize the sample complexity of finding stationary points of the population risk (up to polylog factors) and show that the optimal rate on population stationarity is tilde Thetabig(1{n}+sqrt{d}{nvarepsilon}big). Finally, we show that our methods can be used to provide dimension-independent rates of Obig(1{n}+minbig(big[sqrt{rank}{nvarepsilon}big]^{2/3},1{(nvarepsilon)^{2/5}}big)big) on population stationarity for Generalized Linear Models (GLM), where rank is the rank of the design matrix, which improves upon the previous best known rate.
Geometry of Sample Spaces
In statistics, independent, identically distributed random samples do not carry a natural ordering, and their statistics are typically invariant with respect to permutations of their order. Thus, an n-sample in a space M can be considered as an element of the quotient space of M^n modulo the permutation group. The present paper takes this definition of sample space and the related concept of orbit types as a starting point for developing a geometric perspective on statistics. We aim at deriving a general mathematical setting for studying the behavior of empirical and population means in spaces ranging from smooth Riemannian manifolds to general stratified spaces. We fully describe the orbifold and path-metric structure of the sample space when M is a manifold or path-metric space, respectively. These results are non-trivial even when M is Euclidean. We show that the infinite sample space exists in a Gromov-Hausdorff type sense and coincides with the Wasserstein space of probability distributions on M. We exhibit Fr\'echet means and k-means as metric projections onto 1-skeleta or k-skeleta in Wasserstein space, and we define a new and more general notion of polymeans. This geometric characterization via metric projections applies equally to sample and population means, and we use it to establish asymptotic properties of polymeans such as consistency and asymptotic normality.
Practical randomness amplification and privatisation with implementations on quantum computers
We present an end-to-end and practical randomness amplification and privatisation protocol based on Bell tests. This allows the building of device-independent random number generators which output (near-)perfectly unbiased and private numbers, even if using an uncharacterised quantum device potentially built by an adversary. Our generation rates are linear in the repetition rate of the quantum device and the classical randomness post-processing has quasi-linear complexity - making it efficient on a standard personal laptop. The statistical analysis is also tailored for real-world quantum devices. Our protocol is then showcased on several different quantum computers. Although not purposely built for the task, we show that quantum computers can run faithful Bell tests by adding minimal assumptions. In this semi-device-independent manner, our protocol generates (near-)perfectly unbiased and private random numbers on today's quantum computers.
Block occurrences in the binary expansion
The binary sum-of-digits function s returns the number of ones in the binary expansion of a nonnegative integer. Cusick's Hamming weight conjecture states that, for all integers tgeq 0, the set of nonnegative integers n such that s(n+t)geq s(n) has asymptotic density strictly larger than 1/2. We are concerned with the block-additive function r returning the number of (overlapping) occurrences of the block 11 in the binary expansion of n. The main result of this paper is a central limit-type theorem for the difference r(n+t)-r(n): the corresponding probability function is uniformly close to a Gaussian, where the uniform error tends to 0 as the number of blocks of ones in the binary expansion of t tends to infty.
Analysis on Riemann Hypothesis with Cross Entropy Optimization and Reasoning
In this paper, we present a novel framework for the analysis of Riemann Hypothesis [27], which is composed of three key components: a) probabilistic modeling with cross entropy optimization and reasoning; b) the application of the law of large numbers; c) the application of mathematical inductions. The analysis is mainly conducted by virtue of probabilistic modeling of cross entropy optimization and reasoning with rare event simulation techniques. The application of the law of large numbers [2, 3, 6] and the application of mathematical inductions make the analysis of Riemann Hypothesis self-contained and complete to make sure that the whole complex plane is covered as conjectured in Riemann Hypothesis. We also discuss the method of enhanced top-p sampling with large language models (LLMs) for reasoning, where next token prediction is not just based on the estimated probabilities of each possible token in the current round but also based on accumulated path probabilities among multiple top-k chain of thoughts (CoTs) paths. The probabilistic modeling of cross entropy optimization and reasoning may suit well with the analysis of Riemann Hypothesis as Riemann Zeta functions are inherently dealing with the sums of infinite components of a complex number series. We hope that our analysis in this paper could shed some light on some of the insights of Riemann Hypothesis. The framework and techniques presented in this paper, coupled with recent developments with chain of thought (CoT) or diagram of thought (DoT) reasoning in large language models (LLMs) with reinforcement learning (RL) [1, 7, 18, 21, 24, 34, 39-41], could pave the way for eventual proof of Riemann Hypothesis [27].
A Tale of Two Structures: Do LLMs Capture the Fractal Complexity of Language?
Language exhibits a fractal structure in its information-theoretic complexity (i.e. bits per token), with self-similarity across scales and long-range dependence (LRD). In this work, we investigate whether large language models (LLMs) can replicate such fractal characteristics and identify conditions-such as temperature setting and prompting method-under which they may fail. Moreover, we find that the fractal parameters observed in natural language are contained within a narrow range, whereas those of LLMs' output vary widely, suggesting that fractal parameters might prove helpful in detecting a non-trivial portion of LLM-generated texts. Notably, these findings, and many others reported in this work, are robust to the choice of the architecture; e.g. Gemini 1.0 Pro, Mistral-7B and Gemma-2B. We also release a dataset comprising of over 240,000 articles generated by various LLMs (both pretrained and instruction-tuned) with different decoding temperatures and prompting methods, along with their corresponding human-generated texts. We hope that this work highlights the complex interplay between fractal properties, prompting, and statistical mimicry in LLMs, offering insights for generating, evaluating and detecting synthetic texts.
Exploring Quantum Spacetime with Topological Data Analysis
In a novel application of the tools of topological data analysis (TDA) to nonperturbative quantum gravity, we introduce a new class of observables that allows us to assess whether quantum spacetime really resembles a ``quantum foam" near the Planck scale. The key idea is to investigate the Betti numbers of coarse-grained path integral histories, regularized in terms of dynamical triangulations, as a function of the coarse-graining scale. In two dimensions our analysis exhibits the well-known fractal structure of Euclidean quantum gravity.
Regression Discontinuity Design with Distribution-Valued Outcomes
This article introduces Regression Discontinuity Design (RDD) with Distribution-Valued Outcomes (R3D), extending the standard RDD framework to settings where the outcome is a distribution rather than a scalar. Such settings arise when treatment is assigned at a higher level of aggregation than the outcome-for example, when a subsidy is allocated based on a firm-level revenue cutoff while the outcome of interest is the distribution of employee wages within the firm. Since standard RDD methods cannot accommodate such two-level randomness, I propose a novel approach based on random distributions. The target estimand is a "local average quantile treatment effect", which averages across random quantiles. To estimate this target, I introduce two related approaches: one that extends local polynomial regression to random quantiles and another based on local Fr\'echet regression, a form of functional regression. For both estimators, I establish asymptotic normality and develop uniform, debiased confidence bands together with a data-driven bandwidth selection procedure. Simulations validate these theoretical properties and show existing methods to be biased and inconsistent in this setting. I then apply the proposed methods to study the effects of gubernatorial party control on within-state income distributions in the US, using a close-election design. The results suggest a classic equality-efficiency tradeoff under Democratic governorship, driven by reductions in income at the top of the distribution.
Disintegration and Bayesian Inversion via String Diagrams
The notions of disintegration and Bayesian inversion are fundamental in conditional probability theory. They produce channels, as conditional probabilities, from a joint state, or from an already given channel (in opposite direction). These notions exist in the literature, in concrete situations, but are presented here in abstract graphical formulations. The resulting abstract descriptions are used for proving basic results in conditional probability theory. The existence of disintegration and Bayesian inversion is discussed for discrete probability, and also for measure-theoretic probability --- via standard Borel spaces and via likelihoods. Finally, the usefulness of disintegration and Bayesian inversion is illustrated in several examples.
User-defined Event Sampling and Uncertainty Quantification in Diffusion Models for Physical Dynamical Systems
Diffusion models are a class of probabilistic generative models that have been widely used as a prior for image processing tasks like text conditional generation and inpainting. We demonstrate that these models can be adapted to make predictions and provide uncertainty quantification for chaotic dynamical systems. In these applications, diffusion models can implicitly represent knowledge about outliers and extreme events; however, querying that knowledge through conditional sampling or measuring probabilities is surprisingly difficult. Existing methods for conditional sampling at inference time seek mainly to enforce the constraints, which is insufficient to match the statistics of the distribution or compute the probability of the chosen events. To achieve these ends, optimally one would use the conditional score function, but its computation is typically intractable. In this work, we develop a probabilistic approximation scheme for the conditional score function which provably converges to the true distribution as the noise level decreases. With this scheme we are able to sample conditionally on nonlinear userdefined events at inference time, and matches data statistics even when sampling from the tails of the distribution.
On almost sure limit theorems for heavy-tailed products of long-range dependent linear processes
Marcinkiewicz strong law of large numbers, {n^{-frac1p}}sum_{k=1}^{n} (d_{k}- d)rightarrow 0 almost surely with pin(1,2), are developed for products d_k=prod_{r=1}^s x_k^{(r)}, where the x_k^{(r)} = sum_{l=-infty}^{infty}c_{k-l}^{(r)}xi_l^{(r)} are two-sided linear processes with coefficients {c_l^{(r)}}_{lin Z} and i.i.d. zero-mean innovations {xi_l^{(r)}}_{lin Z}. The decay of the coefficients c_l^{(r)} as |l|toinfty, can be slow enough for {x_k^{(r)}} to have long memory while {d_k} can have heavy tails. The long-range dependence and heavy tails for {d_k} are handled simultaneously and a decoupling property shows the convergence rate is dictated by the worst of long-range dependence and heavy tails, but not their combination. The Marcinkiewicz strong law of large numbers is also extended to the multivariate linear process case.
A New Rejection Sampling Approach to k-means++ With Improved Trade-Offs
The k-means++ seeding algorithm (Arthur & Vassilvitskii, 2007) is widely used in practice for the k-means clustering problem where the goal is to cluster a dataset X subset R ^d into k clusters. The popularity of this algorithm is due to its simplicity and provable guarantee of being O(log k) competitive with the optimal solution in expectation. However, its running time is O(|X|kd), making it expensive for large datasets. In this work, we present a simple and effective rejection sampling based approach for speeding up k-means++. Our first method runs in time O(nnz (X) + beta k^2d) while still being O(log k ) competitive in expectation. Here, beta is a parameter which is the ratio of the variance of the dataset to the optimal k-means cost in expectation and O hides logarithmic factors in k and |X|. Our second method presents a new trade-off between computational cost and solution quality. It incurs an additional scale-invariant factor of k^{-Omega( m/beta)} Var (X) in addition to the O(log k) guarantee of k-means++ improving upon a result of (Bachem et al, 2016a) who get an additional factor of m^{-1}Var(X) while still running in time O(nnz(X) + mk^2d). We perform extensive empirical evaluations to validate our theoretical results and to show the effectiveness of our approach on real datasets.
Sharp Noisy Binary Search with Monotonic Probabilities
We revisit the noisy binary search model of Karp and Kleinberg, in which we have n coins with unknown probabilities p_i that we can flip. The coins are sorted by increasing p_i, and we would like to find where the probability crosses (to within varepsilon) of a target value tau. This generalized the fixed-noise model of Burnashev and Zigangirov , in which p_i = 1{2} pm varepsilon, to a setting where coins near the target may be indistinguishable from it. Karp and Kleinberg showed that Theta(1{varepsilon^2} log n) samples are necessary and sufficient for this task. We produce a practical algorithm by solving two theoretical challenges: high-probability behavior and sharp constants. We give an algorithm that succeeds with probability 1-delta from \[ 1{C_{\tau, \varepsilon}} \cdot \left(\lg n + O(\log^{2/3} n \log^{1/3} 1{\delta} + \log 1{\delta})\right) \] samples, where C_{tau, varepsilon} is the optimal such constant achievable. For delta > n^{-o(1)} this is within 1 + o(1) of optimal, and for delta ll 1 it is the first bound within constant factors of optimal.
Estimation Beyond Data Reweighting: Kernel Method of Moments
Moment restrictions and their conditional counterparts emerge in many areas of machine learning and statistics ranging from causal inference to reinforcement learning. Estimators for these tasks, generally called methods of moments, include the prominent generalized method of moments (GMM) which has recently gained attention in causal inference. GMM is a special case of the broader family of empirical likelihood estimators which are based on approximating a population distribution by means of minimizing a varphi-divergence to an empirical distribution. However, the use of varphi-divergences effectively limits the candidate distributions to reweightings of the data samples. We lift this long-standing limitation and provide a method of moments that goes beyond data reweighting. This is achieved by defining an empirical likelihood estimator based on maximum mean discrepancy which we term the kernel method of moments (KMM). We provide a variant of our estimator for conditional moment restrictions and show that it is asymptotically first-order optimal for such problems. Finally, we show that our method achieves competitive performance on several conditional moment restriction tasks.
Denotational validation of higher-order Bayesian inference
We present a modular semantic account of Bayesian inference algorithms for probabilistic programming languages, as used in data science and machine learning. Sophisticated inference algorithms are often explained in terms of composition of smaller parts. However, neither their theoretical justification nor their implementation reflects this modularity. We show how to conceptualise and analyse such inference algorithms as manipulating intermediate representations of probabilistic programs using higher-order functions and inductive types, and their denotational semantics. Semantic accounts of continuous distributions use measurable spaces. However, our use of higher-order functions presents a substantial technical difficulty: it is impossible to define a measurable space structure over the collection of measurable functions between arbitrary measurable spaces that is compatible with standard operations on those functions, such as function application. We overcome this difficulty using quasi-Borel spaces, a recently proposed mathematical structure that supports both function spaces and continuous distributions. We define a class of semantic structures for representing probabilistic programs, and semantic validity criteria for transformations of these representations in terms of distribution preservation. We develop a collection of building blocks for composing representations. We use these building blocks to validate common inference algorithms such as Sequential Monte Carlo and Markov Chain Monte Carlo. To emphasize the connection between the semantic manipulation and its traditional measure theoretic origins, we use Kock's synthetic measure theory. We demonstrate its usefulness by proving a quasi-Borel counterpart to the Metropolis-Hastings-Green theorem.
Lattice models of random advection and diffusion and their statistics
We study in detail a one-dimensional lattice model of a continuum, conserved field (mass) that is transferred deterministically between neighbouring random sites. The model falls in a wider class of lattice models capturing the joint effect of random advection and diffusion and encompassing as specific cases, some models studied in the literature, like the Kang-Redner, Kipnis-Marchioro-Presutti, Takayasu-Taguchi, etc. The motivation for our setup comes from a straightforward interpretation as advection of particles in one-dimensional turbulence, but it is also related to a problem of synchronization of dynamical systems driven by common noise. For finite lattices, we study both the coalescence of an initially spread field (interpreted as roughening), and the statistical steady-state properties. We distinguish two main size-dependent regimes, depending on the strength of the diffusion term and on the lattice size. Using numerical simulations and mean-field approach, we study the statistics of the field. For weak diffusion, we unveil a characteristic hierarchical structure of the field. We also connect the model and the iterated function systems concept.
Analytical confidence intervals for the number of different objects in data streams
This paper develops a new mathematical-statistical approach to analyze a class of Flajolet-Martin algorithms (FMa), and provides analytical confidence intervals for the number F0 of distinct elements in a stream, based on Chernoff bounds. The class of FMa has reached a significant popularity in bigdata stream learning, and the attention of the literature has mainly been based on algorithmic aspects, basically complexity optimality, while the statistical analysis of these class of algorithms has been often faced heuristically. The analysis provided here shows deep connections with mathematical special functions and with extreme value theory. The latter connection may help in explaining heuristic considerations, while the first opens many numerical issues, faced at the end of the present paper. Finally, the algorithms are tested on an anonymized real data stream and MonteCarlo simulations are provided to support our analytical choice in this context.
Intensity statistics inside an open wave-chaotic cavity with broken time-reversal invariance
Using the supersymmetric method of random matrix theory within the Heidelberg approach framework we provide statistical description of stationary intensity sampled in locations inside an open wave-chaotic cavity, assuming that the time-reversal invariance inside the cavity is fully broken. In particular, we show that when incoming waves are fed via a finite number M of open channels the probability density {cal P}(I) for the single-point intensity I decays as a power law for large intensities: {cal P}(I)sim I^{-(M+2)}, provided there is no internal losses. This behaviour is in marked difference with the Rayleigh law {cal P}(I)sim exp(-I/I) which turns out to be valid only in the limit Mto infty. We also find the joint probability density of intensities I_1, ldots, I_L in L>1 observation points, and then extract the corresponding statistics for the maximal intensity in the observation pattern. For Lto infty the resulting limiting extreme value statistics (EVS) turns out to be different from the classical EVS distributions.
A Convenient Category for Higher-Order Probability Theory
Higher-order probabilistic programming languages allow programmers to write sophisticated models in machine learning and statistics in a succinct and structured way, but step outside the standard measure-theoretic formalization of probability theory. Programs may use both higher-order functions and continuous distributions, or even define a probability distribution on functions. But standard probability theory does not handle higher-order functions well: the category of measurable spaces is not cartesian closed. Here we introduce quasi-Borel spaces. We show that these spaces: form a new formalization of probability theory replacing measurable spaces; form a cartesian closed category and so support higher-order functions; form a well-pointed category and so support good proof principles for equational reasoning; and support continuous probability distributions. We demonstrate the use of quasi-Borel spaces for higher-order functions and probability by: showing that a well-known construction of probability theory involving random functions gains a cleaner expression; and generalizing de Finetti's theorem, that is a crucial theorem in probability theory, to quasi-Borel spaces.
Generalized Polya's theorem on connected locally compact Abelian groups of dimension 1
According to the generalized Polya theorem, the Gaussian distribution on the real line is characterized by the property of equidistribution of a monomial and a linear form of independent identically distributed random variables. We give a complete description of a-adic solenoids for which an analog of this theorem is true. The proof of the main theorem is reduced to solving some functional equation in the class of continuous positive definite functions on the character group of an a-adic solenoid
Blackbox Model Provenance via Palimpsestic Membership Inference
Suppose Alice trains an open-weight language model and Bob uses a blackbox derivative of Alice's model to produce text. Can Alice prove that Bob is using her model, either by querying Bob's derivative model (query setting) or from the text alone (observational setting)? We formulate this question as an independence testing problem--in which the null hypothesis is that Bob's model or text is independent of Alice's randomized training run--and investigate it through the lens of palimpsestic memorization in language models: models are more likely to memorize data seen later in training, so we can test whether Bob is using Alice's model using test statistics that capture correlation between Bob's model or text and the ordering of training examples in Alice's training run. If Alice has randomly shuffled her training data, then any significant correlation amounts to exactly quantifiable statistical evidence against the null hypothesis, regardless of the composition of Alice's training data. In the query setting, we directly estimate (via prompting) the likelihood Bob's model gives to Alice's training examples and order; we correlate the likelihoods of over 40 fine-tunes of various Pythia and OLMo base models ranging from 1B to 12B parameters with the base model's training data order, achieving a p-value on the order of at most 1e-8 in all but six cases. In the observational setting, we try two approaches based on estimating 1) the likelihood of Bob's text overlapping with spans of Alice's training examples and 2) the likelihood of Bob's text with respect to different versions of Alice's model we obtain by repeating the last phase (e.g., 1%) of her training run on reshuffled data. The second approach can reliably distinguish Bob's text from as little as a few hundred tokens; the first does not involve any retraining but requires many more tokens (several hundred thousand) to achieve high power.
Sampling Multimodal Distributions with the Vanilla Score: Benefits of Data-Based Initialization
There is a long history, as well as a recent explosion of interest, in statistical and generative modeling approaches based on score functions -- derivatives of the log-likelihood of a distribution. In seminal works, Hyv\"arinen proposed vanilla score matching as a way to learn distributions from data by computing an estimate of the score function of the underlying ground truth, and established connections between this method and established techniques like Contrastive Divergence and Pseudolikelihood estimation. It is by now well-known that vanilla score matching has significant difficulties learning multimodal distributions. Although there are various ways to overcome this difficulty, the following question has remained unanswered -- is there a natural way to sample multimodal distributions using just the vanilla score? Inspired by a long line of related experimental works, we prove that the Langevin diffusion with early stopping, initialized at the empirical distribution, and run on a score function estimated from data successfully generates natural multimodal distributions (mixtures of log-concave distributions).
SGMM: Stochastic Approximation to Generalized Method of Moments
We introduce a new class of algorithms, Stochastic Generalized Method of Moments (SGMM), for estimation and inference on (overidentified) moment restriction models. Our SGMM is a novel stochastic approximation alternative to the popular Hansen (1982) (offline) GMM, and offers fast and scalable implementation with the ability to handle streaming datasets in real time. We establish the almost sure convergence, and the (functional) central limit theorem for the inefficient online 2SLS and the efficient SGMM. Moreover, we propose online versions of the Durbin-Wu-Hausman and Sargan-Hansen tests that can be seamlessly integrated within the SGMM framework. Extensive Monte Carlo simulations show that as the sample size increases, the SGMM matches the standard (offline) GMM in terms of estimation accuracy and gains over computational efficiency, indicating its practical value for both large-scale and online datasets. We demonstrate the efficacy of our approach by a proof of concept using two well known empirical examples with large sample sizes.
Quantifying Network Similarity using Graph Cumulants
How might one test the hypothesis that networks were sampled from the same distribution? Here, we compare two statistical tests that use subgraph counts to address this question. The first uses the empirical subgraph densities themselves as estimates of those of the underlying distribution. The second test uses a new approach that converts these subgraph densities into estimates of the graph cumulants of the distribution (without any increase in computational complexity). We demonstrate -- via theory, simulation, and application to real data -- the superior statistical power of using graph cumulants. In summary, when analyzing data using subgraph/motif densities, we suggest using the corresponding graph cumulants instead.
Cluster-Specific Predictions with Multi-Task Gaussian Processes
A model involving Gaussian processes (GPs) is introduced to simultaneously handle multi-task learning, clustering, and prediction for multiple functional data. This procedure acts as a model-based clustering method for functional data as well as a learning step for subsequent predictions for new tasks. The model is instantiated as a mixture of multi-task GPs with common mean processes. A variational EM algorithm is derived for dealing with the optimisation of the hyper-parameters along with the hyper-posteriors' estimation of latent variables and processes. We establish explicit formulas for integrating the mean processes and the latent clustering variables within a predictive distribution, accounting for uncertainty on both aspects. This distribution is defined as a mixture of cluster-specific GP predictions, which enhances the performances when dealing with group-structured data. The model handles irregular grid of observations and offers different hypotheses on the covariance structure for sharing additional information across tasks. The performances on both clustering and prediction tasks are assessed through various simulated scenarios and real datasets. The overall algorithm, called MagmaClust, is publicly available as an R package.
Deep Neural Network Initialization with Sparsity Inducing Activations
Inducing and leveraging sparse activations during training and inference is a promising avenue for improving the computational efficiency of deep networks, which is increasingly important as network sizes continue to grow and their application becomes more widespread. Here we use the large width Gaussian process limit to analyze the behaviour, at random initialization, of nonlinear activations that induce sparsity in the hidden outputs. A previously unreported form of training instability is proven for arguably two of the most natural candidates for hidden layer sparsification; those being a shifted ReLU (phi(x)=max(0, x-tau) for tauge 0) and soft thresholding (phi(x)=0 for |x|letau and x-sign(x)tau for |x|>tau). We show that this instability is overcome by clipping the nonlinear activation magnitude, at a level prescribed by the shape of the associated Gaussian process variance map. Numerical experiments verify the theory and show that the proposed magnitude clipped sparsifying activations can be trained with training and test fractional sparsity as high as 85\% while retaining close to full accuracy.
Probabilistic Generating Circuits
Generating functions, which are widely used in combinatorics and probability theory, encode function values into the coefficients of a polynomial. In this paper, we explore their use as a tractable probabilistic model, and propose probabilistic generating circuits (PGCs) for their efficient representation. PGCs are strictly more expressive efficient than many existing tractable probabilistic models, including determinantal point processes (DPPs), probabilistic circuits (PCs) such as sum-product networks, and tractable graphical models. We contend that PGCs are not just a theoretical framework that unifies vastly different existing models, but also show great potential in modeling realistic data. We exhibit a simple class of PGCs that are not trivially subsumed by simple combinations of PCs and DPPs, and obtain competitive performance on a suite of density estimation benchmarks. We also highlight PGCs' connection to the theory of strongly Rayleigh distributions.
Phase Transitions in the Detection of Correlated Databases
We study the problem of detecting the correlation between two Gaussian databases XinR^{ntimes d} and Y^{ntimes d}, each composed of n users with d features. This problem is relevant in the analysis of social media, computational biology, etc. We formulate this as a hypothesis testing problem: under the null hypothesis, these two databases are statistically independent. Under the alternative, however, there exists an unknown permutation sigma over the set of n users (or, row permutation), such that X is rho-correlated with Y^sigma, a permuted version of Y. We determine sharp thresholds at which optimal testing exhibits a phase transition, depending on the asymptotic regime of n and d. Specifically, we prove that if rho^2dto0, as dtoinfty, then weak detection (performing slightly better than random guessing) is statistically impossible, irrespectively of the value of n. This compliments the performance of a simple test that thresholds the sum all entries of X^TY. Furthermore, when d is fixed, we prove that strong detection (vanishing error probability) is impossible for any rho<rho^star, where rho^star is an explicit function of d, while weak detection is again impossible as long as rho^2dto0. These results close significant gaps in current recent related studies.
Do logarithmic proximity measures outperform plain ones in graph clustering?
We consider a number of graph kernels and proximity measures including commute time kernel, regularized Laplacian kernel, heat kernel, exponential diffusion kernel (also called "communicability"), etc., and the corresponding distances as applied to clustering nodes in random graphs and several well-known datasets. The model of generating random graphs involves edge probabilities for the pairs of nodes that belong to the same class or different predefined classes of nodes. It turns out that in most cases, logarithmic measures (i.e., measures resulting after taking logarithm of the proximities) perform better while distinguishing underlying classes than the "plain" measures. A comparison in terms of reject curves of inter-class and intra-class distances confirms this conclusion. A similar conclusion can be made for several well-known datasets. A possible origin of this effect is that most kernels have a multiplicative nature, while the nature of distances used in cluster algorithms is an additive one (cf. the triangle inequality). The logarithmic transformation is a tool to transform the first nature to the second one. Moreover, some distances corresponding to the logarithmic measures possess a meaningful cutpoint additivity property. In our experiments, the leader is usually the logarithmic Communicability measure. However, we indicate some more complicated cases in which other measures, typically, Communicability and plain Walk, can be the winners.
Approximation of the truncated Zeta distribution and Zipf's law
Zipf's law appears in many application areas but does not have a closed form expression, which may make its use cumbersome. Since it coincides with the truncated version of the Zeta distribution, in this paper we propose three approximate closed form expressions for the truncated Zeta distribution, which may be employed for Zipf's law as well. The three approximations are based on the replacement of the sum occurring in Zipf's law with an integral, and are named respectively the integral approximation, the average integral approximation, and the trapezoidal approximation. While the first one is shown to be of little use, the trapezoidal approximation exhibits an error which is typically lower than 1\%, but is as low as 0.1\% for the range of values of the Zipf parameter below 1.
Probing small-scale power spectrum with gravitational-wave diffractive lensing
We develop a novel way to probe subgalactic-scale matter distribution with diffractive lensing on gravitational waves. Five-year observations from Einstein Telescope and DECIGO are expected to probe k= 10^5sim 10^8 ,{rm Mpc}^{-1} down to P(k) = 10^{-16} sim 10^{-14} ,{rm Mpc}^3 level. These results can be interpreted in terms of primordial black holes in the range M_{rm PBH} gtrsim 10^{-3}M_odot down to f_{rm PBH} = 10^{-6} level, or QCD axion minihalos in the range m_a = 10^{-3} sim 10^{-12} ,{rm eV}. A key result of the paper is the approximate relation between the scale k and the gravitational wave frequency f, derived in an ensemble of `multi-lensing' events. This relation enables direct measurement of the power spectrum at specific scales, with sensitivities characterized by model-independent kernels delta P(k). Additionally, we delineate the statistical properties of `multi-lensing' based on the `Fresnel number' N_F. When N_F cal O(1), the statistical significance can be approximately calculated by Variance of lensing effects, which is directly related to the power spectrum among other moments of matter distribution.
Differentially Private Distributed Bayesian Linear Regression with MCMC
We propose a novel Bayesian inference framework for distributed differentially private linear regression. We consider a distributed setting where multiple parties hold parts of the data and share certain summary statistics of their portions in privacy-preserving noise. We develop a novel generative statistical model for privately shared statistics, which exploits a useful distributional relation between the summary statistics of linear regression. Bayesian estimation of the regression coefficients is conducted mainly using Markov chain Monte Carlo algorithms, while we also provide a fast version to perform Bayesian estimation in one iteration. The proposed methods have computational advantages over their competitors. We provide numerical results on both real and simulated data, which demonstrate that the proposed algorithms provide well-rounded estimation and prediction.
Causal Inference by String Diagram Surgery
Extracting causal relationships from observed correlations is a growing area in probabilistic reasoning, originating with the seminal work of Pearl and others from the early 1990s. This paper develops a new, categorically oriented view based on a clear distinction between syntax (string diagrams) and semantics (stochastic matrices), connected via interpretations as structure-preserving functors. A key notion in the identification of causal effects is that of an intervention, whereby a variable is forcefully set to a particular value independent of any prior propensities. We represent the effect of such an intervention as an endofunctor which performs `string diagram surgery' within the syntactic category of string diagrams. This diagram surgery in turn yields a new, interventional distribution via the interpretation functor. While in general there is no way to compute interventional distributions purely from observed data, we show that this is possible in certain special cases using a calculational tool called comb disintegration. We demonstrate the use of this technique on a well-known toy example, where we predict the causal effect of smoking on cancer in the presence of a confounding common cause. After developing this specific example, we show this technique provides simple sufficient conditions for computing interventions which apply to a wide variety of situations considered in the causal inference literature.
MINDE: Mutual Information Neural Diffusion Estimation
In this work we present a new method for the estimation of Mutual Information (MI) between random variables. Our approach is based on an original interpretation of the Girsanov theorem, which allows us to use score-based diffusion models to estimate the Kullback Leibler divergence between two densities as a difference between their score functions. As a by-product, our method also enables the estimation of the entropy of random variables. Armed with such building blocks, we present a general recipe to measure MI, which unfolds in two directions: one uses conditional diffusion process, whereas the other uses joint diffusion processes that allow simultaneous modelling of two random variables. Our results, which derive from a thorough experimental protocol over all the variants of our approach, indicate that our method is more accurate than the main alternatives from the literature, especially for challenging distributions. Furthermore, our methods pass MI self-consistency tests, including data processing and additivity under independence, which instead are a pain-point of existing methods.
Two-parameter superposable S-curves
Straight line equation y=mx with slope m, when singularly perturbed as ay^3+y=mx with a positive parameter a, results in S-shaped curves or S-curves on a real plane. As arightarrow 0, we get back y=mx which is a cumulative distribution function of a continuous uniform distribution that describes the occurrence of every event in an interval to be equally probable. As arightarrowinfty, the derivative of y has finite support only at y=0 resembling a degenerate distribution. Based on these arguments, in this work, we propose that these S-curves can represent maximum entropy uniform distribution to a zero entropy single value. We also argue that these S-curves are superposable as they are only parametrically nonlinear but fundamentally linear. So far, the superposed forms have been used to capture the patterns of natural systems such as nonlinear dynamics of biological growth and kinetics of enzyme reactions. Here, we attempt to use the S-curve and its superposed form as statistical models. We fit the models on a classical dataset containing flower measurements of iris plants and analyze their usefulness in pattern recognition. Based on these models, we claim that any non-uniform pattern can be represented as a singular perturbation to uniform distribution. However, our parametric estimation procedure have some limitations such as sensitivity to initial conditions depending on the data at hand.
Statistical Inference and A/B Testing for First-Price Pacing Equilibria
We initiate the study of statistical inference and A/B testing for first-price pacing equilibria (FPPE). The FPPE model captures the dynamics resulting from large-scale first-price auction markets where buyers use pacing-based budget management. Such markets arise in the context of internet advertising, where budgets are prevalent. We propose a statistical framework for the FPPE model, in which a limit FPPE with a continuum of items models the long-run steady-state behavior of the auction platform, and an observable FPPE consisting of a finite number of items provides the data to estimate primitives of the limit FPPE, such as revenue, Nash social welfare (a fair metric of efficiency), and other parameters of interest. We develop central limit theorems and asymptotically valid confidence intervals. Furthermore, we establish the asymptotic local minimax optimality of our estimators. We then show that the theory can be used for conducting statistically valid A/B testing on auction platforms. Numerical simulations verify our central limit theorems, and empirical coverage rates for our confidence intervals agree with our theory.
Improved Analysis of Sparse Linear Regression in Local Differential Privacy Model
In this paper, we revisit the problem of sparse linear regression in the local differential privacy (LDP) model. Existing research in the non-interactive and sequentially local models has focused on obtaining the lower bounds for the case where the underlying parameter is 1-sparse, and extending such bounds to the more general k-sparse case has proven to be challenging. Moreover, it is unclear whether efficient non-interactive LDP (NLDP) algorithms exist. To address these issues, we first consider the problem in the epsilon non-interactive LDP model and provide a lower bound of Omega(sqrt{dklog d}{nepsilon}) on the ell_2-norm estimation error for sub-Gaussian data, where n is the sample size and d is the dimension of the space. We propose an innovative NLDP algorithm, the very first of its kind for the problem. As a remarkable outcome, this algorithm also yields a novel and highly efficient estimator as a valuable by-product. Our algorithm achieves an upper bound of O({dsqrt{k}{nepsilon}}) for the estimation error when the data is sub-Gaussian, which can be further improved by a factor of O(d) if the server has additional public but unlabeled data. For the sequentially interactive LDP model, we show a similar lower bound of Omega({sqrt{dk}{nepsilon}}). As for the upper bound, we rectify a previous method and show that it is possible to achieve a bound of O(ksqrt{d}{nepsilon}). Our findings reveal fundamental differences between the non-private case, central DP model, and local DP model in the sparse linear regression problem.
Concentration of Measure for Distributions Generated via Diffusion Models
We show via a combination of mathematical arguments and empirical evidence that data distributions sampled from diffusion models satisfy a Concentration of Measure Property saying that any Lipschitz 1-dimensional projection of a random vector is not too far from its mean with high probability. This implies that such models are quite restrictive and gives an explanation for a fact previously observed in the literature that conventional diffusion models cannot capture "heavy-tailed" data (i.e. data x for which the norm |x|_2 does not possess a sub-Gaussian tail) well. We then proceed to train a generalized linear model using stochastic gradient descent (SGD) on the diffusion-generated data for a multiclass classification task and observe empirically that a Gaussian universality result holds for the test error. In other words, the test error depends only on the first and second order statistics of the diffusion-generated data in the linear setting. Results of such forms are desirable because they allow one to assume the data itself is Gaussian for analyzing performance of the trained classifier. Finally, we note that current approaches to proving universality do not apply to this case as the covariance matrices of the data tend to have vanishing minimum singular values for the diffusion-generated data, while the current proofs assume that this is not the case (see Subsection 3.4 for more details). This leaves extending previous mathematical universality results as an intriguing open question.
A New Bound on the Cumulant Generating Function of Dirichlet Processes
In this paper, we introduce a novel approach for bounding the cumulant generating function (CGF) of a Dirichlet process (DP) X sim DP(αν_0), using superadditivity. In particular, our key technical contribution is the demonstration of the superadditivity of αmapsto log E_{X sim DP(αν_0)}[exp( E_X[αf])], where E_X[f] = int f dX. This result, combined with Fekete's lemma and Varadhan's integral lemma, converts the known asymptotic large deviation principle into a practical upper bound on the CGF logE_{Xsim DP(αν_0)}{exp(E_{X}{[f]})} for any α> 0. The bound is given by the convex conjugate of the scaled reversed Kullback-Leibler divergence αKL(ν_0Vert cdot). This new bound provides particularly effective confidence regions for sums of independent DPs, making it applicable across various fields.
Sharp Deviations Bounds for Dirichlet Weighted Sums with Application to analysis of Bayesian algorithms
In this work, we derive sharp non-asymptotic deviation bounds for weighted sums of Dirichlet random variables. These bounds are based on a novel integral representation of the density of a weighted Dirichlet sum. This representation allows us to obtain a Gaussian-like approximation for the sum distribution using geometry and complex analysis methods. Our results generalize similar bounds for the Beta distribution obtained in the seminal paper Alfers and Dinges [1984]. Additionally, our results can be considered a sharp non-asymptotic version of the inverse of Sanov's theorem studied by Ganesh and O'Connell [1999] in the Bayesian setting. Based on these results, we derive new deviation bounds for the Dirichlet process posterior means with application to Bayesian bootstrap. Finally, we apply our estimates to the analysis of the Multinomial Thompson Sampling (TS) algorithm in multi-armed bandits and significantly sharpen the existing regret bounds by making them independent of the size of the arms distribution support.
Chain of Log-Concave Markov Chains
We introduce a theoretical framework for sampling from unnormalized densities based on a smoothing scheme that uses an isotropic Gaussian kernel with a single fixed noise scale. We prove one can decompose sampling from a density (minimal assumptions made on the density) into a sequence of sampling from log-concave conditional densities via accumulation of noisy measurements with equal noise levels. Our construction is unique in that it keeps track of a history of samples, making it non-Markovian as a whole, but it is lightweight algorithmically as the history only shows up in the form of a running empirical mean of samples. Our sampling algorithm generalizes walk-jump sampling (Saremi & Hyv\"arinen, 2019). The "walk" phase becomes a (non-Markovian) chain of (log-concave) Markov chains. The "jump" from the accumulated measurements is obtained by empirical Bayes. We study our sampling algorithm quantitatively using the 2-Wasserstein metric and compare it with various Langevin MCMC algorithms. We also report a remarkable capacity of our algorithm to "tunnel" between modes of a distribution.
Computable Stochastic Processes
The aim of this paper is to present an elementary computable theory of probability, random variables and stochastic processes. The probability theory is baed on existing approaches using valuations and lower integrals. Various approaches to random variables are discussed, including the approach based on completions in a Polish space. We apply the theory to the study of stochastic dynamical systems in discrete-time, and give a brief exposition of the Wiener process as a foundation for stochastic differential equations. The theory is based within the framework of type-two effectivity, so has an explicit direct link with Turing computation, and is expressed in a system of computable types and operations, so has a clean mathematical description.
Continuous-Time Functional Diffusion Processes
We introduce Functional Diffusion Processes (FDPs), which generalize score-based diffusion models to infinite-dimensional function spaces. FDPs require a new mathematical framework to describe the forward and backward dynamics, and several extensions to derive practical training objectives. These include infinite-dimensional versions of Girsanov theorem, in order to be able to compute an ELBO, and of the sampling theorem, in order to guarantee that functional evaluations in a countable set of points are equivalent to infinite-dimensional functions. We use FDPs to build a new breed of generative models in function spaces, which do not require specialized network architectures, and that can work with any kind of continuous data. Our results on real data show that FDPs achieve high-quality image generation, using a simple MLP architecture with orders of magnitude fewer parameters than existing diffusion models.
Fractal Patterns May Unravel the Intelligence in Next-Token Prediction
We study the fractal structure of language, aiming to provide a precise formalism for quantifying properties that may have been previously suspected but not formally shown. We establish that language is: (1) self-similar, exhibiting complexities at all levels of granularity, with no particular characteristic context length, and (2) long-range dependent (LRD), with a Hurst parameter of approximately H=0.70. Based on these findings, we argue that short-term patterns/dependencies in language, such as in paragraphs, mirror the patterns/dependencies over larger scopes, like entire documents. This may shed some light on how next-token prediction can lead to a comprehension of the structure of text at multiple levels of granularity, from words and clauses to broader contexts and intents. We also demonstrate that fractal parameters improve upon perplexity-based bits-per-byte (BPB) in predicting downstream performance. We hope these findings offer a fresh perspective on language and the mechanisms underlying the success of LLMs.
Learning Mixtures of Gaussians with Censored Data
We study the problem of learning mixtures of Gaussians with censored data. Statistical learning with censored data is a classical problem, with numerous practical applications, however, finite-sample guarantees for even simple latent variable models such as Gaussian mixtures are missing. Formally, we are given censored data from a mixture of univariate Gaussians $sum_{i=1}^k w_i N(mu_i,sigma^2), i.e. the sample is observed only if it lies inside a set S. The goal is to learn the weights w_i and the means \mu_i. We propose an algorithm that takes only 1{\varepsilon^{O(k)}} samples to estimate the weights w_i and the means \mu_i within \varepsilon$ error.
Self-Distillation for Gaussian Process Regression and Classification
We propose two approaches to extend the notion of knowledge distillation to Gaussian Process Regression (GPR) and Gaussian Process Classification (GPC); data-centric and distribution-centric. The data-centric approach resembles most current distillation techniques for machine learning, and refits a model on deterministic predictions from the teacher, while the distribution-centric approach, re-uses the full probabilistic posterior for the next iteration. By analyzing the properties of these approaches, we show that the data-centric approach for GPR closely relates to known results for self-distillation of kernel ridge regression and that the distribution-centric approach for GPR corresponds to ordinary GPR with a very particular choice of hyperparameters. Furthermore, we demonstrate that the distribution-centric approach for GPC approximately corresponds to data duplication and a particular scaling of the covariance and that the data-centric approach for GPC requires redefining the model from a Binomial likelihood to a continuous Bernoulli likelihood to be well-specified. To the best of our knowledge, our proposed approaches are the first to formulate knowledge distillation specifically for Gaussian Process models.
Algorithm-assisted discovery of an intrinsic order among mathematical constants
In recent decades, a growing number of discoveries in fields of mathematics have been assisted by computer algorithms, primarily for exploring large parameter spaces that humans would take too long to investigate. As computers and algorithms become more powerful, an intriguing possibility arises - the interplay between human intuition and computer algorithms can lead to discoveries of novel mathematical concepts that would otherwise remain elusive. To realize this perspective, we have developed a massively parallel computer algorithm that discovers an unprecedented number of continued fraction formulas for fundamental mathematical constants. The sheer number of formulas discovered by the algorithm unveils a novel mathematical structure that we call the conservative matrix field. Such matrix fields (1) unify thousands of existing formulas, (2) generate infinitely many new formulas, and most importantly, (3) lead to unexpected relations between different mathematical constants, including multiple integer values of the Riemann zeta function. Conservative matrix fields also enable new mathematical proofs of irrationality. In particular, we can use them to generalize the celebrated proof by Ap\'ery for the irrationality of zeta(3). Utilizing thousands of personal computers worldwide, our computer-supported research strategy demonstrates the power of experimental mathematics, highlighting the prospects of large-scale computational approaches to tackle longstanding open problems and discover unexpected connections across diverse fields of science.
Sample Complexity Bounds for Learning High-dimensional Simplices in Noisy Regimes
In this paper, we find a sample complexity bound for learning a simplex from noisy samples. Assume a dataset of size n is given which includes i.i.d. samples drawn from a uniform distribution over an unknown simplex in R^K, where samples are assumed to be corrupted by a multi-variate additive Gaussian noise of an arbitrary magnitude. We prove the existence of an algorithm that with high probability outputs a simplex having a ell_2 distance of at most varepsilon from the true simplex (for any varepsilon>0). Also, we theoretically show that in order to achieve this bound, it is sufficient to have ngeleft(K^2/varepsilon^2right)e^{Omegaleft(K/SNR^2right)} samples, where SNR stands for the signal-to-noise ratio. This result solves an important open problem and shows as long as SNRgeOmegaleft(K^{1/2}right), the sample complexity of the noisy regime has the same order to that of the noiseless case. Our proofs are a combination of the so-called sample compression technique in ashtiani2018nearly, mathematical tools from high-dimensional geometry, and Fourier analysis. In particular, we have proposed a general Fourier-based technique for recovery of a more general class of distribution families from additive Gaussian noise, which can be further used in a variety of other related problems.
A structural equation formulation for general quasi-periodic Gaussian processes
This paper introduces a structural equation formulation that gives rise to a new family of quasi-periodic Gaussian processes, useful to process a broad class of natural and physiological signals. The proposed formulation simplifies generation and forecasting, and provides hyperparameter estimates, which we exploit in a convergent and consistent iterative estimation algorithm. A bootstrap approach for standard error estimation and confidence intervals is also provided. We demonstrate the computational and scaling benefits of the proposed approach on a broad class of problems, including water level tidal analysis, CO_{2} emission data, and sunspot numbers data. By leveraging the structural equations, our method reduces the cost of likelihood evaluations and predictions from O(k^2 p^2) to O(p^2), significantly improving scalability.
A synthetic approach to Markov kernels, conditional independence and theorems on sufficient statistics
We develop Markov categories as a framework for synthetic probability and statistics, following work of Golubtsov as well as Cho and Jacobs. This means that we treat the following concepts in purely abstract categorical terms: conditioning and disintegration; various versions of conditional independence and its standard properties; conditional products; almost surely; sufficient statistics; versions of theorems on sufficient statistics due to Fisher--Neyman, Basu, and Bahadur. Besides the conceptual clarity offered by our categorical setup, its main advantage is that it provides a uniform treatment of various types of probability theory, including discrete probability theory, measure-theoretic probability with general measurable spaces, Gaussian probability, stochastic processes of either of these kinds, and many others.
Private Statistical Estimation of Many Quantiles
This work studies the estimation of many statistical quantiles under differential privacy. More precisely, given a distribution and access to i.i.d. samples from it, we study the estimation of the inverse of its cumulative distribution function (the quantile function) at specific points. For instance, this task is of key importance in private data generation. We present two different approaches. The first one consists in privately estimating the empirical quantiles of the samples and using this result as an estimator of the quantiles of the distribution. In particular, we study the statistical properties of the recently published algorithm introduced by Kaplan et al. 2022 that privately estimates the quantiles recursively. The second approach is to use techniques of density estimation in order to uniformly estimate the quantile function on an interval. In particular, we show that there is a tradeoff between the two methods. When we want to estimate many quantiles, it is better to estimate the density rather than estimating the quantile function at specific points.
A Note on Shumailov et al. (2024): `AI Models Collapse When Trained on Recursively Generated Data'
The study conducted by Shumailov et al. (2024) demonstrates that repeatedly training a generative model on synthetic data leads to model collapse. This finding has generated considerable interest and debate, particularly given that current models have nearly exhausted the available data. In this work, we investigate the effects of fitting a distribution (through Kernel Density Estimation, or KDE) or a model to the data, followed by repeated sampling from it. Our objective is to develop a theoretical understanding of the phenomenon observed by Shumailov et al. (2024). Our results indicate that the outcomes reported are a statistical phenomenon and may be unavoidable.
Search for dark matter subhalos among unassociated Fermi-LAT sources in presence of dataset shift
We search for dark matter (DM) annihilating subhalos of the Milky Way halo among the Fermi Large Area Telescope (LAT) unassociated sources. We construct, for the first time, a statistical model of the unassociated sources at latitudes above 10 degrees. The latter is built as a combination of both DM annihilation subhalos as well as Galactic and extragalactic astrophysical components. The astrophysical components are constructed based on distributions of associated sources, while the distribution of DM subhalos is derived from Monte Carlo simulations. In this model we take into account the differences in the distributions of associated and unassociated sources including both covariate and prior probability shifts (both being forms of ``dataset shifts''). Previous searches of DM subhalos were based on classify-and-count strategies, while the approach adopted in this work is based on quantification learning, which allows one to determine a well-defined statistical interpretation of the contribution of a population of DM subhalos to the unassociated Fermi-LAT sources. In the bb annihilation channel and for a range of DM masses from 10 GeV to 1 TeV, we don't find a significant contribution from DM subhalos and derive a statistical 95% confidence upper limit on the DM annihilation cross section in this channel. While the derived limits are consistent with previous classify-and-count approaches, our generative statistical model opens new avenues for population studies of Fermi-LAT sources and, more generally, for searches of anomalies on top of backgrounds in presence of statistical and systematic uncertainties.
Bitcoin Price Predictive Modeling Using Expert Correction
The paper studies the linear model for Bitcoin price which includes regression features based on Bitcoin currency statistics, mining processes, Google search trends, Wikipedia pages visits. The pattern of deviation of regression model prediction from real prices is simpler comparing to price time series. It is assumed that this pattern can be predicted by an experienced expert. In such a way, using the combination of the regression model and expert correction, one can receive better results than with either regression model or expert opinion only. It is shown that Bayesian approach makes it possible to utilize the probabilistic approach using distributions with fat tails and take into account the outliers in Bitcoin price time series.
Untangling Gaussian Mixtures
Tangles were originally introduced as a concept to formalize regions of high connectivity in graphs. In recent years, they have also been discovered as a link between structural graph theory and data science: when interpreting similarity in data sets as connectivity between points, finding clusters in the data essentially amounts to finding tangles in the underlying graphs. This paper further explores the potential of tangles in data sets as a means for a formal study of clusters. Real-world data often follow a normal distribution. Accounting for this, we develop a quantitative theory of tangles in data sets drawn from Gaussian mixtures. To this end, we equip the data with a graph structure that models similarity between the points and allows us to apply tangle theory to the data. We provide explicit conditions under which tangles associated with the marginal Gaussian distributions exist asymptotically almost surely. This can be considered as a sufficient formal criterion for the separabability of clusters in the data.
Feature Shift Detection: Localizing Which Features Have Shifted via Conditional Distribution Tests
While previous distribution shift detection approaches can identify if a shift has occurred, these approaches cannot localize which specific features have caused a distribution shift -- a critical step in diagnosing or fixing any underlying issue. For example, in military sensor networks, users will want to detect when one or more of the sensors has been compromised, and critically, they will want to know which specific sensors might be compromised. Thus, we first define a formalization of this problem as multiple conditional distribution hypothesis tests and propose both non-parametric and parametric statistical tests. For both efficiency and flexibility, we then propose to use a test statistic based on the density model score function (i.e. gradient with respect to the input) -- which can easily compute test statistics for all dimensions in a single forward and backward pass. Any density model could be used for computing the necessary statistics including deep density models such as normalizing flows or autoregressive models. We additionally develop methods for identifying when and where a shift occurs in multivariate time-series data and show results for multiple scenarios using realistic attack models on both simulated and real world data.
Exploiting locality in high-dimensional factorial hidden Markov models
We propose algorithms for approximate filtering and smoothing in high-dimensional Factorial hidden Markov models. The approximation involves discarding, in a principled way, likelihood factors according to a notion of locality in a factor graph associated with the emission distribution. This allows the exponential-in-dimension cost of exact filtering and smoothing to be avoided. We prove that the approximation accuracy, measured in a local total variation norm, is "dimension-free" in the sense that as the overall dimension of the model increases the error bounds we derive do not necessarily degrade. A key step in the analysis is to quantify the error introduced by localizing the likelihood function in a Bayes' rule update. The factorial structure of the likelihood function which we exploit arises naturally when data have known spatial or network structure. We demonstrate the new algorithms on synthetic examples and a London Underground passenger flow problem, where the factor graph is effectively given by the train network.
On Clustered Statistical MIMO Millimeter Wave Channel Simulation
The use of mmWave frequencies is one of the key strategies to achieve the fascinating 1000x increase in the capacity of future 5G wireless systems. While for traditional sub-6 GHz cellular frequencies several well-developed statistical channel models are available for system simulation, similar tools are not available for mmWave frequencies, thus preventing a fair comparison of independently developed transmission and reception schemes. In this paper we provide a simple albeit accurate statistical procedure for the generation of a clustered MIMO channel model operating at mmWaves, for both the cases of slowly and rapidly time-varying channels. Matlab scripts for channel generation are also provided, along with an example of their use.
Parallel Learning by Multitasking Neural Networks
A modern challenge of Artificial Intelligence is learning multiple patterns at once (i.e.parallel learning). While this can not be accomplished by standard Hebbian associative neural networks, in this paper we show how the Multitasking Hebbian Network (a variation on theme of the Hopfield model working on sparse data-sets) is naturally able to perform this complex task. We focus on systems processing in parallel a finite (up to logarithmic growth in the size of the network) amount of patterns, mirroring the low-storage level of standard associative neural networks at work with pattern recognition. For mild dilution in the patterns, the network handles them hierarchically, distributing the amplitudes of their signals as power-laws w.r.t. their information content (hierarchical regime), while, for strong dilution, all the signals pertaining to all the patterns are raised with the same strength (parallel regime). Further, confined to the low-storage setting (i.e., far from the spin glass limit), the presence of a teacher neither alters the multitasking performances nor changes the thresholds for learning: the latter are the same whatever the training protocol is supervised or unsupervised. Results obtained through statistical mechanics, signal-to-noise technique and Monte Carlo simulations are overall in perfect agreement and carry interesting insights on multiple learning at once: for instance, whenever the cost-function of the model is minimized in parallel on several patterns (in its description via Statistical Mechanics), the same happens to the standard sum-squared error Loss function (typically used in Machine Learning).
Mean-field Chaos Diffusion Models
In this paper, we introduce a new class of score-based generative models (SGMs) designed to handle high-cardinality data distributions by leveraging concepts from mean-field theory. We present mean-field chaos diffusion models (MF-CDMs), which address the curse of dimensionality inherent in high-cardinality data by utilizing the propagation of chaos property of interacting particles. By treating high-cardinality data as a large stochastic system of interacting particles, we develop a novel score-matching method for infinite-dimensional chaotic particle systems and propose an approximation scheme that employs a subdivision strategy for efficient training. Our theoretical and empirical results demonstrate the scalability and effectiveness of MF-CDMs for managing large high-cardinality data structures, such as 3D point clouds.
Volatility Modeling of Stocks from Selected Sectors of the Indian Economy Using GARCH
Volatility clustering is an important characteristic that has a significant effect on the behavior of stock markets. However, designing robust models for accurate prediction of future volatilities of stock prices is a very challenging research problem. We present several volatility models based on generalized autoregressive conditional heteroscedasticity (GARCH) framework for modeling the volatility of ten stocks listed in the national stock exchange (NSE) of India. The stocks are selected from the auto sector and the banking sector of the Indian economy, and they have a significant impact on the sectoral index of their respective sectors in the NSE. The historical stock price records from Jan 1, 2010, to Apr 30, 2021, are scraped from the Yahoo Finance website using the DataReader API of the Pandas module in the Python programming language. The GARCH modules are built and fine-tuned on the training data and then tested on the out-of-sample data to evaluate the performance of the models. The analysis of the results shows that asymmetric GARCH models yield more accurate forecasts on the future volatility of stocks.
Nonparametric Deconvolution Models
We describe nonparametric deconvolution models (NDMs), a family of Bayesian nonparametric models for collections of data in which each observation is the average over the features from heterogeneous particles. For example, these types of data are found in elections, where we observe precinct-level vote tallies (observations) of individual citizens' votes (particles) across each of the candidates or ballot measures (features), where each voter is part of a specific voter cohort or demographic (factor). Like the hierarchical Dirichlet process, NDMs rely on two tiers of Dirichlet processes to explain the data with an unknown number of latent factors; each observation is modeled as a weighted average of these latent factors. Unlike existing models, NDMs recover how factor distributions vary locally for each observation. This uniquely allows NDMs both to deconvolve each observation into its constituent factors, and also to describe how the factor distributions specific to each observation vary across observations and deviate from the corresponding global factors. We present variational inference techniques for this family of models and study its performance on simulated data and voting data from California. We show that including local factors improves estimates of global factors and provides a novel scaffold for exploring data.
An information theoretic necessary condition for perfect reconstruction
A new information theoretic condition is presented for reconstructing a discrete random variable X based on the knowledge of a set of discrete functions of X. The reconstruction condition is derived from Shannon's 1953 lattice theory with two entropic metrics of Shannon and Rajski. Because such a theoretical material is relatively unknown and appears quite dispersed in different references, we first provide a synthetic description (with complete proofs) of its concepts, such as total, common and complementary informations. Definitions and properties of the two entropic metrics are also fully detailed and shown compatible with the lattice structure. A new geometric interpretation of such a lattice structure is then investigated that leads to a necessary (and sometimes sufficient) condition for reconstructing the discrete random variable X given a set { X_1,ldots,X_{n} } of elements in the lattice generated by X. Finally, this condition is illustrated in five specific examples of perfect reconstruction problems: reconstruction of a symmetric random variable from the knowledge of its sign and absolute value, reconstruction of a word from a set of linear combinations, reconstruction of an integer from its prime signature (fundamental theorem of arithmetic) and from its remainders modulo a set of coprime integers (Chinese remainder theorem), and reconstruction of the sorting permutation of a list from a minimal set of pairwise comparisons.
Uniform approximation in classical weak convergence theory
A common statistical task lies in showing asymptotic normality of certain statistics. In many of these situations, classical textbook results on weak convergence theory suffice for the problem at hand. However, there are quite some scenarios where stronger results are needed in order to establish an asymptotic normal approximation uniformly over a family of probability measures. In this note we collect some results in this direction. We restrict ourselves to weak convergence in mathbb R^d with continuous limit measures.
The fractional chromatic number of double cones over graphs
Assume n, m are positive integers and G is a graph. Let P_{n,m} be the graph obtained from the path with vertices {-m, -(m-1), ldots, 0, ldots, n} by adding a loop at vertex 0. The double cone Delta_{n,m}(G) over a graph G is obtained from the direct product G times P_{n,m} by identifying V(G) times {n} into a single vertex (star, n), identifying V(G) times {-m} into a single vertex (star, -m), and adding an edge connecting (star, -m) and (star, n). This paper determines the fractional chromatic number of Delta_{n,m}(G). In particular, if n < m or n=m is even, then chi_f(Delta_{n,m}(G)) = chi_f(Delta_n(G)), where Delta_n(G) is the nth cone over G. If n=m is odd, then chi_f(Delta_{n,m}(G)) > chi_f(Delta_n(G)). The chromatic number of Delta_{n,m}(G) is also discussed.
Convergence of local times of stochastic processes associated with resistance forms
In this paper, it is shown that if a sequence of resistance metric spaces equipped with measures converges with respect to the local Gromov-Hausdorff-vague topology, and certain non-explosion and metric-entropy conditions are satisfied, then the associated stochastic processes and their local times also converge. The metric-entropy condition can be checked by applying volume estimates of balls. Whilst similar results have been proved previously, the approach of this article is more widely applicable. Indeed, we recover various known conclusions for scaling limits of some deterministic self-similar fractal graphs, critical Galton-Watson trees, the critical Erdos-R\'enyi random graph and the configuration model (in the latter two cases, we prove for the first time the convergence of the models with respect to the resistance metric and also, for the configuration model, we overcome an error in the existing proof of local time convergence). Moreover, we derive new ones for scaling limits of uniform spanning trees and random recursive fractals. The metric-entropy condition also implies convergence of associated Gaussian processes.
Real-valued continued fraction of straight lines
In an unbounded plane, straight lines are used extensively for mathematical analysis. They are tools of convenience. However, those with high slope values become unbounded at a faster rate than the independent variable. So, straight lines, in this work, are made to be bounded by introducing a parametric nonlinear term that is positive. The straight lines are transformed into bounded nonlinear curves that become unbounded at a much slower rate than the independent variable. This transforming equation can be expressed as a continued fraction of straight lines. The continued fraction is real-valued and converges to the solutions of the transforming equation. Following Euler's method, the continued fraction has been reduced into an infinite series. The usefulness of the bounding nature of continued fraction is demonstrated by solving the problem of image classification. Parameters estimated on the Fashion-MNIST dataset of greyscale images using continued fraction of regression lines have less variance, converge quickly and are more accurate than the linear counterpart. Moreover, this multi-dimensional parametric estimation problem can be expressed on xy- plane using the parameters of the continued fraction and patterns emerge on planar plots.
Causal de Finetti: On the Identification of Invariant Causal Structure in Exchangeable Data
Learning causal structure from observational data often assumes that we observe independent and identically distributed (i.\,i.\,d) data. The traditional approach aims to find a graphical representation that encodes the same set of conditional independence relationships as those present in the observed distribution. It is known that under i.\,i.\,d assumption, even with infinite data, there is a limit to how fine-grained a causal structure we can identify. To overcome this limitation, recent work has explored using data originating from different, related environments to learn richer causal structure. These approaches implicitly rely on the independent causal mechanisms (ICM) principle, which postulates that the mechanism giving rise to an effect given its causes and the mechanism which generates the causes do not inform or influence each other. Thus, components of the causal model can independently change from environment to environment. Despite its wide application in machine learning and causal inference, there is a lack of statistical formalization of the ICM principle and how it enables identification of richer causal structures from grouped data. Here we present new causal de Finetti theorems which offer a first statistical formalization of ICM principle and show how causal structure identification is possible from exchangeable data. Our work provides theoretical justification for a broad range of techniques leveraging multi-environment data to learn causal structure.
Contribution of the Extreme Term in the Sum of Samples with Regularly Varying Tail
For a sequence of random variables (X_1, X_2, ldots, X_n), n geq 1, that are independent and identically distributed with a regularly varying tail with index -alpha, alpha geq 0, we show that the contribution of the maximum term M_n triangleq max(X_1,ldots,X_n) in the sum S_n triangleq X_1 + cdots +X_n, as n to infty, decreases monotonically with alpha in stochastic ordering sense.
Nonparametric extensions of randomized response for private confidence sets
This work derives methods for performing nonparametric, nonasymptotic statistical inference for population means under the constraint of local differential privacy (LDP). Given bounded observations (X_1, dots, X_n) with mean mu^star that are privatized into (Z_1, dots, Z_n), we present confidence intervals (CI) and time-uniform confidence sequences (CS) for mu^star when only given access to the privatized data. To achieve this, we introduce a nonparametric and sequentially interactive generalization of Warner's famous ``randomized response'' mechanism, satisfying LDP for arbitrary bounded random variables, and then provide CIs and CSs for their means given access to the resulting privatized observations. For example, our results yield private analogues of Hoeffding's inequality in both fixed-time and time-uniform regimes. We extend these Hoeffding-type CSs to capture time-varying (non-stationary) means, and conclude by illustrating how these methods can be used to conduct private online A/B tests.
Tight High Probability Bounds for Linear Stochastic Approximation with Fixed Stepsize
This paper provides a non-asymptotic analysis of linear stochastic approximation (LSA) algorithms with fixed stepsize. This family of methods arises in many machine learning tasks and is used to obtain approximate solutions of a linear system Atheta = b for which A and b can only be accessed through random estimates {({bf A}_n, {bf b}_n): n in N^*}. Our analysis is based on new results regarding moments and high probability bounds for products of matrices which are shown to be tight. We derive high probability bounds on the performance of LSA under weaker conditions on the sequence {({bf A}_n, {bf b}_n): n in N^*} than previous works. However, in contrast, we establish polynomial concentration bounds with order depending on the stepsize. We show that our conclusions cannot be improved without additional assumptions on the sequence of random matrices {{bf A}_n: n in N^*}, and in particular that no Gaussian or exponential high probability bounds can hold. Finally, we pay a particular attention to establishing bounds with sharp order with respect to the number of iterations and the stepsize and whose leading terms contain the covariance matrices appearing in the central limit theorems.
Sketched Ridgeless Linear Regression: The Role of Downsampling
Overparametrization often helps improve the generalization performance. This paper proposes a dual view of overparametrization suggesting that downsampling may also help generalize. Motivated by this dual view, we characterize two out-of-sample prediction risks of the sketched ridgeless least square estimator in the proportional regime masymp n asymp p, where m is the sketching size, n the sample size, and p the feature dimensionality. Our results reveal the statistical role of downsampling. Specifically, downsampling does not always hurt the generalization performance, and may actually help improve it in some cases. We identify the optimal sketching sizes that minimize the out-of-sample prediction risks, and find that the optimally sketched estimator has stabler risk curves that eliminates the peaks of those for the full-sample estimator. We then propose a practical procedure to empirically identify the optimal sketching size. Finally, we extend our results to cover central limit theorems and misspecified models. Numerical studies strongly support our theory.
A kernel Stein test of goodness of fit for sequential models
We propose a goodness-of-fit measure for probability densities modeling observations with varying dimensionality, such as text documents of differing lengths or variable-length sequences. The proposed measure is an instance of the kernel Stein discrepancy (KSD), which has been used to construct goodness-of-fit tests for unnormalized densities. The KSD is defined by its Stein operator: current operators used in testing apply to fixed-dimensional spaces. As our main contribution, we extend the KSD to the variable-dimension setting by identifying appropriate Stein operators, and propose a novel KSD goodness-of-fit test. As with the previous variants, the proposed KSD does not require the density to be normalized, allowing the evaluation of a large class of models. Our test is shown to perform well in practice on discrete sequential data benchmarks.
Template estimation in computational anatomy: Fréchet means in top and quotient spaces are not consistent
In this article, we study the consistency of the template estimation with the Fr\'echet mean in quotient spaces. The Fr\'echet mean in quotient spaces is often used when the observations are deformed or transformed by a group action. We show that in most cases this estimator is actually inconsistent. We exhibit a sufficient condition for this inconsistency, which amounts to the folding of the distribution of the noisy template when it is projected to the quotient space. This condition appears to be fulfilled as soon as the support of the noise is large enough. To quantify this inconsistency we provide lower and upper bounds of the bias as a function of the variability (the noise level). This shows that the consistency bias cannot be neglected when the variability increases.
On the impossibility of discovering a formula for primes using AI
The present work explores the theoretical limits of Machine Learning (ML) within the framework of Kolmogorov's theory of Algorithmic Probability, which clarifies the notion of entropy as Expected Kolmogorov Complexity and formalizes other fundamental concepts such as Occam's razor via Levin's Universal Distribution. As a fundamental application, we develop Maximum Entropy methods that allow us to derive the Erdos--Kac Law in Probabilistic Number Theory, and establish the impossibility of discovering a formula for primes using Machine Learning via the Prime Coding Theorem.
Subtractive Mixture Models via Squaring: Representation and Learning
Mixture models are traditionally represented and learned by adding several distributions as components. Allowing mixtures to subtract probability mass or density can drastically reduce the number of components needed to model complex distributions. However, learning such subtractive mixtures while ensuring they still encode a non-negative function is challenging. We investigate how to learn and perform inference on deep subtractive mixtures by squaring them. We do this in the framework of probabilistic circuits, which enable us to represent tensorized mixtures and generalize several other subtractive models. We theoretically prove that the class of squared circuits allowing subtractions can be exponentially more expressive than traditional additive mixtures; and, we empirically show this increased expressiveness on a series of real-world distribution estimation tasks.
Copula Conformal Prediction for Multi-step Time Series Forecasting
Accurate uncertainty measurement is a key step to building robust and reliable machine learning systems. Conformal prediction is a distribution-free uncertainty quantification algorithm popular for its ease of implementation, statistical coverage guarantees, and versatility for underlying forecasters. However, existing conformal prediction algorithms for time series are limited to single-step prediction without considering the temporal dependency. In this paper, we propose a Copula Conformal Prediction algorithm for multivariate, multi-step Time Series forecasting, CopulaCPTS. We prove that CopulaCPTS has finite sample validity guarantee. On several synthetic and real-world multivariate time series datasets, we show that CopulaCPTS produces more calibrated and sharp confidence intervals for multi-step prediction tasks than existing techniques.
Conditionally Strongly Log-Concave Generative Models
There is a growing gap between the impressive results of deep image generative models and classical algorithms that offer theoretical guarantees. The former suffer from mode collapse or memorization issues, limiting their application to scientific data. The latter require restrictive assumptions such as log-concavity to escape the curse of dimensionality. We partially bridge this gap by introducing conditionally strongly log-concave (CSLC) models, which factorize the data distribution into a product of conditional probability distributions that are strongly log-concave. This factorization is obtained with orthogonal projectors adapted to the data distribution. It leads to efficient parameter estimation and sampling algorithms, with theoretical guarantees, although the data distribution is not globally log-concave. We show that several challenging multiscale processes are conditionally log-concave using wavelet packet orthogonal projectors. Numerical results are shown for physical fields such as the varphi^4 model and weak lensing convergence maps with higher resolution than in previous works.
Introducing Quantum Computing into Statistical Physics: Random Walks and the Ising Model with Qiskit
Quantum computing offers a powerful new perspective on probabilistic and collective behaviors traditionally taught in statistical physics. This paper presents two classroom-ready modules that integrate quantum computing into the undergraduate curriculum using Qiskit: the quantum random walk and the Ising model. Both modules allow students to simulate and contrast classical and quantum systems, deepening their understanding of concepts such as superposition, interference, and statistical distributions. We outline the quantum circuits involved, provide sample code and student activities, and discuss how each example can be used to enhance student engagement with statistical physics. These modules are suitable for integration into courses in statistical mechanics, modern physics, or as part of an introductory unit on quantum computing.
Federated Causal Discovery from Heterogeneous Data
Conventional causal discovery methods rely on centralized data, which is inconsistent with the decentralized nature of data in many real-world situations. This discrepancy has motivated the development of federated causal discovery (FCD) approaches. However, existing FCD methods may be limited by their potentially restrictive assumptions of identifiable functional causal models or homogeneous data distributions, narrowing their applicability in diverse scenarios. In this paper, we propose a novel FCD method attempting to accommodate arbitrary causal models and heterogeneous data. We first utilize a surrogate variable corresponding to the client index to account for the data heterogeneity across different clients. We then develop a federated conditional independence test (FCIT) for causal skeleton discovery and establish a federated independent change principle (FICP) to determine causal directions. These approaches involve constructing summary statistics as a proxy of the raw data to protect data privacy. Owing to the nonparametric properties, FCIT and FICP make no assumption about particular functional forms, thereby facilitating the handling of arbitrary causal models. We conduct extensive experiments on synthetic and real datasets to show the efficacy of our method. The code is available at https://github.com/lokali/FedCDH.git.
Categorical Stochastic Processes and Likelihood
In this work we take a Category Theoretic perspective on the relationship between probabilistic modeling and function approximation. We begin by defining two extensions of function composition to stochastic process subordination: one based on the co-Kleisli category under the comonad (Omega x -) and one based on the parameterization of a category with a Lawvere theory. We show how these extensions relate to the category Stoch and other Markov Categories. Next, we apply the Para construction to extend stochastic processes to parameterized statistical models and we define a way to compose the likelihood functions of these models. We conclude with a demonstration of how the Maximum Likelihood Estimation procedure defines an identity-on-objects functor from the category of statistical models to the category of Learners. Code to accompany this paper can be found at https://github.com/dshieble/Categorical_Stochastic_Processes_and_Likelihood
Talagrand's convolution conjecture up to loglog via perturbed reverse heat
We prove that under the heat semigroup (P_τ) on the Boolean hypercube, any nonnegative function f: {-1,1}^n to R_+ exhibits a uniform tail bound that is better than that by Markov's inequality. Specifically, for any η> e^3 and τ> 0, align* P_{X \sim μ}\left( P_τf(X) > η\int f dμ\right) \leq c_τ \log \log η{η\log η}, align* where μ is the uniform measure on the Boolean hypercube {-1,1}^n and c_τ is a constant that only depends on τ. This resolves Talagrand's convolution conjecture up to a dimension-free loglog η factor. Its proof relies on properties of the reverse heat process on the Boolean hypercube and a coupling construction based on carefully engineered perturbations of this reverse heat process.
Scaling Properties of Avalanche Activity in the Two-Dimensional Abelian Sandpile Model
We study the scaling properties of avalanche activity in the two-dimensional Abelian sandpile model. Instead of the conventional avalanche size distribution, we analyze the site activity distribution, which measures how often a site participates in avalanches when grains are added across the lattice. Using numerical simulations for system sizes up to \(L = 160\), averaged over \(10^4\) configurations, we determine the probability distribution \(P(A, L)\) of site activities. The results show that \(P(A, L)\) follows a finite-size scaling form \[ P(A, L) \sim L^{-2} F\Big(A{L^2}\Big). \] For small values \(A \ll L^2\) the scaling function behaves as \[ F(u) \sim u^{-1/2}, \quad corresponding to \quad P(A) \sim 1{L}, \] while for large activities \(A \sim O(L^2)\) the distribution decays as \[ F(u) \sim \exp\big(-c_3 u - c_4 u^2\big). \] The crossover between these two regimes occurs at \[ A^* \sim 0.1 \, L^2, \] marking the threshold between typical and highly excitable sites. This characterization of local avalanche activity provides complementary information to the usual avalanche size statistics, highlighting how local regions serve as frequent conduits for critical dynamics. These results may help connect sandpile models to real-world self-organized critical systems where only partial local activity can be observed.
Dimensionality Reduction for General KDE Mode Finding
Finding the mode of a high dimensional probability distribution D is a fundamental algorithmic problem in statistics and data analysis. There has been particular interest in efficient methods for solving the problem when D is represented as a mixture model or kernel density estimate, although few algorithmic results with worst-case approximation and runtime guarantees are known. In this work, we significantly generalize a result of (LeeLiMusco:2021) on mode approximation for Gaussian mixture models. We develop randomized dimensionality reduction methods for mixtures involving a broader class of kernels, including the popular logistic, sigmoid, and generalized Gaussian kernels. As in Lee et al.'s work, our dimensionality reduction results yield quasi-polynomial algorithms for mode finding with multiplicative accuracy (1-epsilon) for any epsilon > 0. Moreover, when combined with gradient descent, they yield efficient practical heuristics for the problem. In addition to our positive results, we prove a hardness result for box kernels, showing that there is no polynomial time algorithm for finding the mode of a kernel density estimate, unless P = NP. Obtaining similar hardness results for kernels used in practice (like Gaussian or logistic kernels) is an interesting future direction.
Embedded Pilot-Aided Channel Estimation for OTFS in Delay-Doppler Channels
Orthogonal time frequency space (OTFS) modulation was shown to provide significant error performance advantages over orthogonal frequency division multiplexing (OFDM) in delay--Doppler channels. In order to detect OTFS modulated data, the channel impulse response needs to be known at the receiver. In this paper, we propose embedded pilot-aided channel estimation schemes for OTFS. In each OTFS frame, we arrange pilot, guard, and data symbols in the delay--Doppler plane to suitably avoid interference between pilot and data symbols at the receiver. We develop such symbol arrangements for OTFS over multipath channels with integer and fractional Doppler shifts, respectively. At the receiver, channel estimation is performed based on a threshold method and the estimated channel information is used for data detection via a message passing (MP) algorithm. Thanks to our specific embedded symbol arrangements, both channel estimation and data detection are performed within the same OTFS frame with a minimum overhead. We compare by simulations the error performance of OTFS using the proposed channel estimation and OTFS with ideally known channel information and observe only a marginal performance loss. We also demonstrate that the proposed channel estimation in OTFS significantly outperforms OFDM with known channel information. Finally, we present extensions of the proposed schemes to MIMO and multi-user uplink/downlink.
Effective Probabilistic Time Series Forecasting with Fourier Adaptive Noise-Separated Diffusion
We propose the Fourier Adaptive Lite Diffusion Architecture (FALDA), a novel probabilistic framework for time series forecasting. First, we introduce the Diffusion Model for Residual Regression (DMRR) framework, which unifies diffusion-based probabilistic regression methods. Within this framework, FALDA leverages Fourier-based decomposition to incorporate a component-specific architecture, enabling tailored modeling of individual temporal components. A conditional diffusion model is utilized to estimate the future noise term, while our proposed lightweight denoiser, DEMA (Decomposition MLP with AdaLN), conditions on the historical noise term to enhance denoising performance. Through mathematical analysis and empirical validation, we demonstrate that FALDA effectively reduces epistemic uncertainty, allowing probabilistic learning to primarily focus on aleatoric uncertainty. Experiments on six real-world benchmarks demonstrate that FALDA consistently outperforms existing probabilistic forecasting approaches across most datasets for long-term time series forecasting while achieving enhanced computational efficiency without compromising accuracy. Notably, FALDA also achieves superior overall performance compared to state-of-the-art (SOTA) point forecasting approaches, with improvements of up to 9%.
Continued Fractions and Probability Estimations in the Shor Algorithm -- A Detailed and Self-Contained Treatise
The algorithm of Shor for prime factorization is a hybrid algorithm consisting of a quantum part and a classical part. The main focus of the classical part is a continued fraction analysis. The presentation of this is often short, pointing to text books on number theory. In this contribution, we present the relevant results and proofs from the theory of continued fractions in detail (even in more detail than in text books) filling the gap to allow a complete comprehension of the algorithm of Shor. Similarly, we provide a detailed computation of the estimation of the probability that convergents will provide the period required for determining a prime factor.
The boundary of neural network trainability is fractal
Some fractals -- for instance those associated with the Mandelbrot and quadratic Julia sets -- are computed by iterating a function, and identifying the boundary between hyperparameters for which the resulting series diverges or remains bounded. Neural network training similarly involves iterating an update function (e.g. repeated steps of gradient descent), can result in convergent or divergent behavior, and can be extremely sensitive to small changes in hyperparameters. Motivated by these similarities, we experimentally examine the boundary between neural network hyperparameters that lead to stable and divergent training. We find that this boundary is fractal over more than ten decades of scale in all tested configurations.
Weighted least-squares approximation with determinantal point processes and generalized volume sampling
We consider the problem of approximating a function from L^2 by an element of a given m-dimensional space V_m, associated with some feature map varphi, using evaluations of the function at random points x_1,dots,x_n. After recalling some results on optimal weighted least-squares using independent and identically distributed points, we consider weighted least-squares using projection determinantal point processes (DPP) or volume sampling. These distributions introduce dependence between the points that promotes diversity in the selected features varphi(x_i). We first provide a generalized version of volume-rescaled sampling yielding quasi-optimality results in expectation with a number of samples n = O(mlog(m)), that means that the expected L^2 error is bounded by a constant times the best approximation error in L^2. Also, further assuming that the function is in some normed vector space H continuously embedded in L^2, we further prove that the approximation is almost surely bounded by the best approximation error measured in the H-norm. This includes the cases of functions from L^infty or reproducing kernel Hilbert spaces. Finally, we present an alternative strategy consisting in using independent repetitions of projection DPP (or volume sampling), yielding similar error bounds as with i.i.d. or volume sampling, but in practice with a much lower number of samples. Numerical experiments illustrate the performance of the different strategies.
Improved Analysis of Score-based Generative Modeling: User-Friendly Bounds under Minimal Smoothness Assumptions
We give an improved theoretical analysis of score-based generative modeling. Under a score estimate with small L^2 error (averaged across timesteps), we provide efficient convergence guarantees for any data distribution with second-order moment, by either employing early stopping or assuming smoothness condition on the score function of the data distribution. Our result does not rely on any log-concavity or functional inequality assumption and has a logarithmic dependence on the smoothness. In particular, we show that under only a finite second moment condition, approximating the following in reverse KL divergence in epsilon-accuracy can be done in tilde Oleft(d log (1/delta){epsilon}right) steps: 1) the variance-delta Gaussian perturbation of any data distribution; 2) data distributions with 1/delta-smooth score functions. Our analysis also provides a quantitative comparison between different discrete approximations and may guide the choice of discretization points in practice.
Infinite products and zero-one laws in categorical probability
Markov categories are a recent category-theoretic approach to the foundations of probability and statistics. Here we develop this approach further by treating infinite products and the Kolmogorov extension theorem. This is relevant for all aspects of probability theory in which infinitely many random variables appear at a time. These infinite tensor products bigotimes_{i in J} X_i come in two versions: a weaker but more general one for families of objects (X_i)_{i in J} in semicartesian symmetric monoidal categories, and a stronger but more specific one for families of objects in Markov categories. As a first application, we state and prove versions of the zero-one laws of Kolmogorov and Hewitt-Savage for Markov categories. This gives general versions of these results which can be instantiated not only in measure-theoretic probability, where they specialize to the standard ones in the setting of standard Borel spaces, but also in other contexts.
