Statistics
- [1] arXiv:2405.08825 [pdf, ps, html, other]
-
Title: Thermodynamic limit in learning period threeComments: 26 pages, 19 figuresSubjects: Machine Learning (stat.ML); Machine Learning (cs.LG); Adaptation and Self-Organizing Systems (nlin.AO); Chaotic Dynamics (nlin.CD)
A continuous one-dimensional map with period three includes all periods. This raises the following question: Can we obtain any types of periodic orbits solely by learning three data points? We consider learning period three with random neural networks and report the universal property associated with it. We first show that the trained networks have a thermodynamic limit that depends on the choice of target data and network settings. Our analysis reveals that almost all learned periods are unstable and each network has its characteristic attractors (which can even be untrained ones). Here, we propose the concept of characteristic bifurcation expressing embeddable attractors intrinsic to the network, in which the target data points and the scale of the network weights function as bifurcation parameters. In conclusion, learning period three generates various attractors through characteristic bifurcation due to the stability change in latently existing numerous unstable periods of the system.
- [2] arXiv:2405.08841 [pdf, ps, other]
-
Title: Best practices for estimating and reporting epidemiological delay distributions of infectious diseases using public health surveillance and healthcare dataKelly Charniga (IP), Sang Woo Park, Andrei R Akhmetzhanov (NTU), Anne Cori, Jonathan Dushoff, Sebastian Funk (LSHTM), Katelyn M Gostic (CDC), Natalie M Linton, Adrian Lison, Christopher E Overton (UKHSA), Juliet R C Pulliam (CDC), Thomas Ward (UKHSA), Simon Cauchemez (IP), Sam Abbott (LSHTM)Subjects: Methodology (stat.ME)
Epidemiological delays, such as incubation periods, serial intervals, and hospital lengths of stay, are among key quantities in infectious disease epidemiology that inform public health policy and clinical practice. This information is used to inform mathematical and statistical models, which in turn can inform control strategies. There are three main challenges that make delay distributions difficult to estimate. First, the data are commonly censored (e.g., symptom onset may only be reported by date instead of the exact time of day). Second, delays are often right truncated when being estimated in real time (not all events that have occurred have been observed yet). Third, during a rapidly growing or declining outbreak, overrepresentation or underrepresentation, respectively, of recently infected cases in the data can lead to bias in estimates. Studies that estimate delays rarely address all these factors and sometimes report several estimates using different combinations of adjustments, which can lead to conflicting answers and confusion about which estimates are most accurate. In this work, we formulate a checklist of best practices for estimating and reporting epidemiological delays with a focus on the incubation period and serial interval. We also propose strategies for handling common biases and identify areas where more work is needed. Our recommendations can help improve the robustness and utility of reported estimates and provide guidance for the evaluation of estimates for downstream use in transmission models or other analyses.
- [3] arXiv:2405.08853 [pdf, ps, html, other]
-
Title: Evaluating the Uncertainty in Mean Residual Times: Estimators Based on Residence Times from Discrete Time ProcessesSubjects: Methodology (stat.ME); Data Analysis, Statistics and Probability (physics.data-an)
In this work, we propose estimators for the uncertainty in mean residual times that require, for their evaluation, statistically independent individual residence times obtained from a discrete time process. We examine their performance through numerical experiments involving well-known probability distributions, and an application example using molecular dynamics simulation results, from an aqueous NaCl solution, is provided. These computationally inexpensive estimators, capable of achieving very accurate outcomes, serve as useful tools for assessing and reporting uncertainties in mean residual times across a wide range of simulations.
- [4] arXiv:2405.08907 [pdf, ps, html, other]
-
Title: Properties of stationary cyclical processesSubjects: Statistics Theory (math.ST); Methodology (stat.ME)
The paper investigates the theoretical properties of zero-mean stationary time series with cyclical components, admitting the representation $y_t=\alpha_t \cos \lambda t + \beta_t \sin \lambda t$, with $\lambda \in (0,\pi]$ and $[\alpha_t\,\, \beta_t]$ following some bivariate process. We diagnose that in the extant literature on cyclic time series, a prevalent assumption of Gaussianity for $[\alpha_t\,\, \beta_t]$ imposes inadvertently a severe restriction on the amplitude of the process. Moreover, it is shown that other common distributions may suffer from either similar defects or fail to guarantee the stationarity of $y_t$. To address both of the issues, we propose to introduce a direct stochastic modulation of the amplitude and phase shift in an almost periodic function. We prove that this novel approach may lead, in general, to a stationary (up to any order) time series, and specifically, to a zero-mean stationary time series featuring cyclicity, with a pseudo-cyclical autocovariance function that may even decay at a very slow rate. The proposed process fills an important gap in this type of models and allows for flexible modeling of amplitude and phase shift.
- [5] arXiv:2405.08912 [pdf, ps, html, other]
-
Title: High dimensional test for functional covariatesComments: 35 pages,4 figures, 4 tablesSubjects: Methodology (stat.ME)
As medical devices become more complex, they routinely collect extensive and complicated data. While classical regressions typically examine the relationship between an outcome and a vector of predictors, it becomes imperative to identify the relationship with predictors possessing functional structures. In this article, we introduce a novel inference procedure for examining the relationship between outcomes and large-scale functional predictors. We target testing the linear hypothesis on the functional parameters under the generalized functional linear regression framework, where the number of the functional parameters grows with the sample size. We develop the estimation procedure for the high dimensional generalized functional linear model incorporating B-spline functional approximation and amenable regularization. Furthermore, we construct a procedure that is able to test the local alternative hypothesis on the linear combinations of the functional parameters. We establish the statistical guarantees in terms of non-asymptotic convergence of the parameter estimation and the oracle property and asymptotic normality of the estimators. Moreover, we derive the asymptotic distribution of the test statistic. We carry out intensive simulations and illustrate with a new dataset from an Alzheimer's disease magnetoencephalography study.
- [6] arXiv:2405.08975 [pdf, ps, html, other]
-
Title: A distribution-free valid p-value for finite samples of bounded random variablesComments: -Subjects: Machine Learning (stat.ML); Machine Learning (cs.LG); Statistics Theory (math.ST)
We build a valid p-value based on a concentration inequality for bounded random variables introduced by Pelekis, Ramon and Wang. The motivation behind this work is the calibration of predictive algorithms in a distribution-free setting. The super-uniform p-value is tighter than Hoeffding and Bentkus alternatives in certain regions. Even though we are motivated by a calibration setting in a machine learning context, the ideas presented in this work are also relevant in classical statistical inference. Furthermore, we compare the power of a collection of valid p- values for bounded losses, which are presented in previous literature.
- [7] arXiv:2405.08999 [pdf, ps, html, other]
-
Title: Robust Approximate Sampling via Stochastic Gradient Barker DynamicsJournal-ref: Proceedings of The 27th International Conference on Artificial Intelligence and Statistics, AISTATS'24, volume 238, 2024, page 2107-2115Subjects: Machine Learning (stat.ML); Machine Learning (cs.LG)
Stochastic Gradient (SG) Markov Chain Monte Carlo algorithms (MCMC) are popular algorithms for Bayesian sampling in the presence of large datasets. However, they come with little theoretical guarantees and assessing their empirical performances is non-trivial. In such context, it is crucial to develop algorithms that are robust to the choice of hyperparameters and to gradients heterogeneity since, in practice, both the choice of step-size and behaviour of target gradients induce hard-to-control biases in the invariant distribution. In this work we introduce the stochastic gradient Barker dynamics (SGBD) algorithm, extending the recently developed Barker MCMC scheme, a robust alternative to Langevin-based sampling algorithms, to the stochastic gradient framework. We characterize the impact of stochastic gradients on the Barker transition mechanism and develop a bias-corrected version that, under suitable assumptions, eliminates the error due to the gradient noise in the proposal. We illustrate the performance on a number of high-dimensional examples, showing that SGBD is more robust to hyperparameter tuning and to irregular behavior of the target gradients compared to the popular stochastic gradient Langevin dynamics algorithm.
- [8] arXiv:2405.09003 [pdf, ps, other]
-
Title: Nonparametric Inference on Dose-Response Curves Without the Positivity ConditionComments: 74 pages (23 pages for the main paper), 4 figuresSubjects: Methodology (stat.ME); Statistics Theory (math.ST); Applications (stat.AP)
Existing statistical methods in causal inference often rely on the assumption that every individual has some chance of receiving any treatment level regardless of its associated covariates, which is known as the positivity condition. This assumption could be violated in observational studies with continuous treatments. In this paper, we present a novel integral estimator of the causal effects with continuous treatments (i.e., dose-response curves) without requiring the positivity condition. Our approach involves estimating the derivative function of the treatment effect on each observed data sample and integrating it to the treatment level of interest so as to address the bias resulting from the lack of positivity condition. The validity of our approach relies on an alternative weaker assumption that can be satisfied by additive confounding models. We provide a fast and reliable numerical recipe for computing our estimator in practice and derive its related asymptotic theory. To conduct valid inference on the dose-response curve and its derivative, we propose using the nonparametric bootstrap and establish its consistency. The practical performances of our proposed estimators are validated through simulation studies and an analysis of the effect of air pollution exposure (PM$_{2.5}$) on cardiovascular mortality rates.
- [9] arXiv:2405.09080 [pdf, ps, html, other]
-
Title: Causal Inference for a Hidden TreatmentSubjects: Methodology (stat.ME)
In many empirical settings, directly observing a treatment variable may be infeasible although an error-prone surrogate measurement of the latter will often be available. Causal inference based solely on the observed surrogate measurement of the hidden treatment may be particularly challenging without an additional assumption or auxiliary data. To address this issue, we propose a method that carefully incorporates the surrogate measurement together with a proxy of the hidden treatment to identify its causal effect on any scale for which identification would in principle be feasible had contrary to fact the treatment been observed error-free. Beyond identification, we provide general semiparametric theory for causal effects identified using our approach, and we derive a large class of semiparametric estimators with an appealing multiple robustness property. A significant obstacle to our approach is the estimation of nuisance functions involving the hidden treatment, which prevents the direct application of standard machine learning algorithms. To resolve this, we introduce a novel semiparametric EM algorithm, thus adding a practical dimension to our theoretical contributions. This methodology can be adapted to analyze a large class of causal parameters in the proposed hidden treatment model, including the population average treatment effect, the effect of treatment on the treated, quantile treatment effects, and causal effects under marginal structural models. We examine the finite-sample performance of our method using simulations and an application which aims to estimate the causal effect of Alzheimer's disease on hippocampal volume using data from the Alzheimer's Disease Neuroimaging Initiative.
- [10] arXiv:2405.09149 [pdf, ps, html, other]
-
Title: Exploring uniformity and maximum entropy distribution on torus through intrinsic geometry: Application to protein-chemistryComments: arXiv admin note: text overlap with arXiv:2304.01599Subjects: Methodology (stat.ME)
A generic family of distributions, defined on the surface of a curved torus is introduced using the area element of it. The area uniformity and the maximum entropy distribution are identified using the trigonometric moments of the proposed family. A marginal distribution is obtained as a three-parameter modification of the von Mises distribution that encompasses the von Mises, Cardioid, and Uniform distributions as special cases. The proposed family of the marginal distribution exhibits both symmetric and asymmetric, unimodal or bimodal shapes, contingent upon parameters. Furthermore, we scrutinize a two-parameter symmetric submodel, examining its moments, measure of variation, Kullback-Leibler divergence, and maximum likelihood estimation, among other properties. In addition, we introduce a modified acceptance-rejection sampling with a thin envelope obtained from the upper-Riemann-sum of a circular density, achieving a high rate of acceptance. This proposed sampling scheme will accelerate the empirical studies for a large-scale simulation reducing the processing time. Furthermore, we extend the Uniform, Wrapped Cauchy, and Kato-Jones distributions to the surface of the curved torus and implemented the proposed bivariate toroidal distribution for different groups of protein data, namely, $\alpha$-helix, $\beta$-sheet, and their mixture. A marginal of this proposed distribution is fitted to the wind direction data.
- [11] arXiv:2405.09167 [pdf, ps, html, other]
-
Title: Emperical Study on the Effect of Multi-Sampling in the Prediction Step of the Particle FilterG. Kitagawa (The Institute of Statistical Mathmatics and The Graduate University for Advanced Study)Comments: 15 pages, 5 tables, 10 figuresSubjects: Computation (stat.CO)
Particle filters are applicable to a wide range of nonlinear, non-Gaussian state-space models and have already been applied to a variety of problems. However, there is a problem in the calculation of smoothed distributions, where particles gradually degenerate and accuracy is reduced. The purpose of this paper is to consider the possibility of generating multiple particles in the prediction step of the particle filter and to empirically verify the effect using real data.
- [12] arXiv:2405.09196 [pdf, ps, other]
-
Title: Harnessing pattern-by-pattern linear classifiers for prediction with missing dataSubjects: Statistics Theory (math.ST)
Missing values have been thoroughly analyzed in the context of linear models, where the final aim is to build coefficient estimates. However, estimating coefficients does not directly solve the problem of prediction with missing entries: a manner to address empty components must be designed. Major approaches to deal with prediction with missing values are empirically driven and can be decomposed into two families: imputation (filling in empty fields) and pattern-by-pattern prediction, where a predictor is built on each missing pattern. Unfortunately, most simple imputation techniques used in practice (as constant imputation) are not consistent when combined with linear models. In this paper, we focus on the more flexible pattern-by-pattern approaches and study their predictive performances on Missing Completely At Random (MCAR) data. We first show that a pattern-by-pattern logistic regression model is intrinsically ill-defined, implying that even classical logistic regression is impossible to apply to missing data. We then analyze the perceptron model and show how the linear separability property extends to partially-observed inputs. Finally, we use the Linear Discriminant Analysis to prove that pattern-by-pattern LDA is consistent in a high-dimensional regime. We refine our analysis to more complex MNAR data.
- [13] arXiv:2405.09331 [pdf, ps, html, other]
-
Title: Multi-Source Conformal Inference Under Distribution ShiftComments: Accepted to ICML 2024, 39 pages, 13 figuresSubjects: Methodology (stat.ME); Machine Learning (stat.ML)
Recent years have experienced increasing utilization of complex machine learning models across multiple sources of data to inform more generalizable decision-making. However, distribution shifts across data sources and privacy concerns related to sharing individual-level data, coupled with a lack of uncertainty quantification from machine learning predictions, make it challenging to achieve valid inferences in multi-source environments. In this paper, we consider the problem of obtaining distribution-free prediction intervals for a target population, leveraging multiple potentially biased data sources. We derive the efficient influence functions for the quantiles of unobserved outcomes in the target and source populations, and show that one can incorporate machine learning prediction algorithms in the estimation of nuisance functions while still achieving parametric rates of convergence to nominal coverage probabilities. Moreover, when conditional outcome invariance is violated, we propose a data-adaptive strategy to upweight informative data sources for efficiency gain and downweight non-informative data sources for bias reduction. We highlight the robustness and efficiency of our proposals for a variety of conformal scores and data-generating mechanisms via extensive synthetic experiments. Hospital length of stay prediction intervals for pediatric patients undergoing a high-risk cardiac surgical procedure between 2016-2022 in the U.S. illustrate the utility of our methodology.
- [14] arXiv:2405.09362 [pdf, ps, html, other]
-
Title: On the Saturation Effect of Kernel Ridge RegressionComments: ICLR 2023; Minor errors are corrected in this versionSubjects: Machine Learning (stat.ML); Machine Learning (cs.LG)
The saturation effect refers to the phenomenon that the kernel ridge regression (KRR) fails to achieve the information theoretical lower bound when the smoothness of the underground truth function exceeds certain level. The saturation effect has been widely observed in practices and a saturation lower bound of KRR has been conjectured for decades. In this paper, we provide a proof of this long-standing conjecture.
- [15] arXiv:2405.09455 [pdf, ps, html, other]
-
Title: Efficient pooling designs and screening performance in group testing for two type defectivesSubjects: Computation (stat.CO); Information Theory (cs.IT)
Group testing is utilized in the case when we want to find a few defectives among large amount of items. Testing n items one by one requires n tests, but if the ratio of defectives is small, group testing is an efficient way to reduce the number of tests. Many research have been developed for group testing for a single type of defectives. In this paper, we consider the case where two types of defective A and B exist. For two types of defectives, we develop a belief propagation algorithm to compute marginal posterior probability of defectives. Furthermore, we construct several kinds of collections of pools in order to test for A and B. And by utilizing our belief propagation algorithm, we evaluate the performance of group testing by conducting simulations.
- [16] arXiv:2405.09485 [pdf, ps, html, other]
-
Title: Predicting Future Change-points in Time SeriesComments: 37 pages, 4 figuresSubjects: Methodology (stat.ME)
Change-point detection and estimation procedures have been widely developed in the literature. However, commonly used approaches in change-point analysis have mainly been focusing on detecting change-points within an entire time series (off-line methods), or quickest detection of change-points in sequentially observed data (on-line methods). Both classes of methods are concerned with change-points that have already occurred. The arguably more important question of when future change-points may occur, remains largely unexplored. In this paper, we develop a novel statistical model that describes the mechanism of change-point occurrence. Specifically, the model assumes a latent process in the form of a random walk driven by non-negative innovations, and an observed process which behaves differently when the latent process belongs to different regimes. By construction, an occurrence of a change-point is equivalent to hitting a regime threshold by the latent process. Therefore, by predicting when the latent process will hit the next regime threshold, future change-points can be forecasted. The probabilistic properties of the model such as stationarity and ergodicity are established. A composite likelihood-based approach is developed for parameter estimation and model selection. Moreover, we construct the predictor and prediction interval for future change points based on the estimated model.
- [17] arXiv:2405.09493 [pdf, ps, html, other]
-
Title: Constrained Learning for Causal Inference and Semiparametric StatisticsSubjects: Machine Learning (stat.ML); Machine Learning (cs.LG)
Causal estimation (e.g. of the average treatment effect) requires estimating complex nuisance parameters (e.g. outcome models). To adjust for errors in nuisance parameter estimation, we present a novel correction method that solves for the best plug-in estimator under the constraint that the first-order error of the estimator with respect to the nuisance parameter estimate is zero. Our constrained learning framework provides a unifying perspective to prominent first-order correction approaches including debiasing (a.k.a. augmented inverse probability weighting) and targeting (a.k.a. targeted maximum likelihood estimation). Our semiparametric inference approach, which we call the "C-Learner", can be implemented with modern machine learning methods such as neural networks and tree ensembles, and enjoys standard guarantees like semiparametric efficiency and double robustness. Empirically, we demonstrate our approach on several datasets, including those with text features that require fine-tuning language models. We observe the C-Learner matches or outperforms other asymptotically optimal estimators, with better performance in settings with less estimated overlap.
- [18] arXiv:2405.09510 [pdf, ps, html, other]
-
Title: The Instrumental Variable Model with Categorical Instrument, Treatment and OutcomeSubjects: Statistics Theory (math.ST)
Instrumental variable models are central to the inference of causal effects in many settings. We consider the instrumental variable model with discrete variables where the instrument (Z), exposure (X) and outcome (Y) take Q, K, and M levels respectively. We assume that the instrument is randomized and that there is no direct effect of Z on Y so that Y(x,z) = Y(x). We first provide a simple characterization of the set of joint distributions of the potential outcomes P(Y(x=1), ..., Y(x=K)) compatible with a given observed distribution P(X, Y | Z). We then discuss the variation (in)dependence property of the marginal probability distribution of the potential outcomes P(Y(x=1)), ..., P(Y(x=K)) which has direct implications for partial identification of average causal effect contrasts such as E[Y(x=i) - Y(x=j)]. We also include simulation results on the volume of the observed distributions not compatible with the IV model as K and Q change.
- [19] arXiv:2405.09511 [pdf, ps, html, other]
-
Title: Stability via resampling: statistical problems beyond the real lineSubjects: Statistics Theory (math.ST)
Model averaging techniques based on resampling methods (such as bootstrapping or subsampling) have been utilized across many areas of statistics, often with the explicit goal of promoting stability in the resulting output. We provide a general, finite-sample theoretical result guaranteeing the stability of bagging when applied to algorithms that return outputs in a general space, so that the output is not necessarily a real-valued -- for example, an algorithm that estimates a vector of weights or a density function. We empirically assess the stability of bagging on synthetic and real-world data for a range of problem settings, including causal inference, nonparametric regression, and Bayesian model selection.
- [20] arXiv:2405.09516 [pdf, ps, html, other]
-
Title: Generalization Bounds for Causal Regression: Insights, Guarantees and Sensitivity AnalysisSubjects: Machine Learning (stat.ML); Machine Learning (cs.LG)
Many algorithms have been recently proposed for causal machine learning. Yet, there is little to no theory on their quality, especially considering finite samples. In this work, we propose a theory based on generalization bounds that provides such guarantees. By introducing a novel change-of-measure inequality, we are able to tightly bound the model loss in terms of the deviation of the treatment propensities over the population, which we show can be empirically limited. Our theory is fully rigorous and holds even in the face of hidden confounding and violations of positivity. We demonstrate our bounds on semi-synthetic and real data, showcasing their remarkable tightness and practical utility.
- [21] arXiv:2405.09523 [pdf, ps, html, other]
-
Title: On Semi-supervised Estimation of Discrete Distributions under f-divergencesComments: Full version. Presented in ISIT-24. arXiv admin note: text overlap with arXiv:2305.07955Subjects: Statistics Theory (math.ST); Information Theory (cs.IT)
We study the problem of estimating the joint probability mass function (pmf) over two random variables. In particular, the estimation is based on the observation of $m$ samples containing both variables and $n$ samples missing one fixed variable. We adopt the minimax framework with $l^p_p$ loss functions. Recent work established that univariate minimax estimator combinations achieve minimax risk with the optimal first-order constant for $p \ge 2$ in the regime $m = o(n)$, questions remained for $p \le 2$ and various $f$-divergences. In our study, we affirm that these composite estimators are indeed minimax optimal for $l^p_p$ loss functions, specifically for the range $1 \le p \le 2$, including the critical $l_1$ loss. Additionally, we ascertain their optimality for a suite of $f$-divergences, such as KL, $\chi^2$, Squared Hellinger, and Le Cam divergences.
- [22] arXiv:2405.09536 [pdf, ps, html, other]
-
Title: Wasserstein Gradient Boosting: A General Framework with Applications to Posterior RegressionSubjects: Methodology (stat.ME); Machine Learning (cs.LG); Machine Learning (stat.ML)
Gradient boosting is a sequential ensemble method that fits a new base learner to the gradient of the remaining loss at each step. We propose a novel family of gradient boosting, Wasserstein gradient boosting, which fits a new base learner to an exactly or approximately available Wasserstein gradient of a loss functional on the space of probability distributions. Wasserstein gradient boosting returns a set of particles that approximates a target probability distribution assigned at each input. In probabilistic prediction, a parametric probability distribution is often specified on the space of output variables, and a point estimate of the output-distribution parameter is produced for each input by a model. Our main application of Wasserstein gradient boosting is a novel distributional estimate of the output-distribution parameter, which approximates the posterior distribution over the output-distribution parameter determined pointwise at each data point. We empirically demonstrate the superior performance of the probabilistic prediction by Wasserstein gradient boosting in comparison with various existing methods.
- [23] arXiv:2405.09541 [pdf, ps, html, other]
-
Title: Spectral complexity of deep neural networksSubjects: Machine Learning (stat.ML); Machine Learning (cs.LG); Probability (math.PR)
It is well-known that randomly initialized, push-forward, fully-connected neural networks weakly converge to isotropic Gaussian processes, in the limit where the width of all layers goes to infinity. In this paper, we propose to use the angular power spectrum of the limiting field to characterize the complexity of the network architecture. In particular, we define sequences of random variables associated with the angular power spectrum, and provide a full characterization of the network complexity in terms of the asymptotic distribution of these sequences as the depth diverges. On this basis, we classify neural networks as low-disorder, sparse, or high-disorder; we show how this classification highlights a number of distinct features for standard activation functions, and in particular, sparsity properties of ReLU networks. Our theoretical results are also validated by numerical simulations.
New submissions for Thursday, 16 May 2024 (showing 23 of 23 entries )
- [24] arXiv:2405.08886 (cross-list from cs.LG) [pdf, ps, html, other]
-
Title: The Pitfalls and Promise of Conformal Inference Under Adversarial AttacksComments: ICML2024Subjects: Machine Learning (cs.LG); Machine Learning (stat.ML)
In safety-critical applications such as medical imaging and autonomous driving, where decisions have profound implications for patient health and road safety, it is imperative to maintain both high adversarial robustness to protect against potential adversarial attacks and reliable uncertainty quantification in decision-making. With extensive research focused on enhancing adversarial robustness through various forms of adversarial training (AT), a notable knowledge gap remains concerning the uncertainty inherent in adversarially trained models. To address this gap, this study investigates the uncertainty of deep learning models by examining the performance of conformal prediction (CP) in the context of standard adversarial attacks within the adversarial defense community. It is first unveiled that existing CP methods do not produce informative prediction sets under the commonly used $l_{\infty}$-norm bounded attack if the model is not adversarially trained, which underpins the importance of adversarial training for CP. Our paper next demonstrates that the prediction set size (PSS) of CP using adversarially trained models with AT variants is often worse than using standard AT, inspiring us to research into CP-efficient AT for improved PSS. We propose to optimize a Beta-weighting loss with an entropy minimization regularizer during AT to improve CP-efficiency, where the Beta-weighting loss is shown to be an upper bound of PSS at the population level by our theoretical analysis. Moreover, our empirical study on four image classification datasets across three popular AT baselines validates the effectiveness of the proposed Uncertainty-Reducing AT (AT-UR).
- [25] arXiv:2405.08917 (cross-list from cs.LG) [pdf, ps, html, other]
-
Title: Feature Importance and Explainability in Quantum Machine LearningComments: Amended final year project. 23 pagesSubjects: Machine Learning (cs.LG); Quantum Algebra (math.QA); Machine Learning (stat.ML)
Many Machine Learning (ML) models are referred to as black box models, providing no real insights into why a prediction is made. Feature importance and explainability are important for increasing transparency and trust in ML models, particularly in settings such as healthcare and finance. With quantum computing's unique capabilities, such as leveraging quantum mechanical phenomena like superposition, which can be combined with ML techniques to create the field of Quantum Machine Learning (QML), and such techniques may be applied to QML models. This article explores feature importance and explainability insights in QML compared to Classical ML models. Utilizing the widely recognized Iris dataset, classical ML algorithms such as SVM and Random Forests, are compared against hybrid quantum counterparts, implemented via IBM's Qiskit platform: the Variational Quantum Classifier (VQC) and Quantum Support Vector Classifier (QSVC). This article aims to provide a comparison of the insights generated in ML by employing permutation and leave one out feature importance methods, alongside ALE (Accumulated Local Effects) and SHAP (SHapley Additive exPlanations) explainers.
- [26] arXiv:2405.08920 (cross-list from cs.LG) [pdf, ps, html, other]
-
Title: Neural Collapse Meets Differential Privacy: Curious Behaviors of NoisyGD with Near-perfect Representation LearningComments: To appear in ICML 2024Subjects: Machine Learning (cs.LG); Cryptography and Security (cs.CR); Computer Vision and Pattern Recognition (cs.CV); Machine Learning (stat.ML)
A recent study by De et al. (2022) has reported that large-scale representation learning through pre-training on a public dataset significantly enhances differentially private (DP) learning in downstream tasks, despite the high dimensionality of the feature space. To theoretically explain this phenomenon, we consider the setting of a layer-peeled model in representation learning, which results in interesting phenomena related to learned features in deep learning and transfer learning, known as Neural Collapse (NC).
Within the framework of NC, we establish an error bound indicating that the misclassification error is independent of dimension when the distance between actual features and the ideal ones is smaller than a threshold. Additionally, the quality of the features in the last layer is empirically evaluated under different pre-trained models within the framework of NC, showing that a more powerful transformer leads to a better feature representation. Furthermore, we reveal that DP fine-tuning is less robust compared to fine-tuning without DP, particularly in the presence of perturbations. These observations are supported by both theoretical analyses and experimental evaluation. Moreover, to enhance the robustness of DP fine-tuning, we suggest several strategies, such as feature normalization or employing dimension reduction methods like Principal Component Analysis (PCA). Empirically, we demonstrate a significant improvement in testing accuracy by conducting PCA on the last-layer features. - [27] arXiv:2405.08940 (cross-list from math.DS) [pdf, ps, html, other]
-
Title: Dynamical systems and complex networks: A Koopman operator perspectiveSubjects: Dynamical Systems (math.DS); Machine Learning (stat.ML)
The Koopman operator has entered and transformed many research areas over the last years. Although the underlying concept$\unicode{x2013}$representing highly nonlinear dynamical systems by infinite-dimensional linear operators$\unicode{x2013}$has been known for a long time, the availability of large data sets and efficient machine learning algorithms for estimating the Koopman operator from data make this framework extremely powerful and popular. Koopman operator theory allows us to gain insights into the characteristic global properties of a system without requiring detailed mathematical models. We will show how these methods can also be used to analyze complex networks and highlight relationships between Koopman operators and graph Laplacians.
- [28] arXiv:2405.08971 (cross-list from cs.LG) [pdf, ps, html, other]
-
Title: Computation-Aware Kalman Filtering and SmoothingSubjects: Machine Learning (cs.LG); Numerical Analysis (math.NA); Machine Learning (stat.ML)
Kalman filtering and smoothing are the foundational mechanisms for efficient inference in Gauss-Markov models. However, their time and memory complexities scale prohibitively with the size of the state space. This is particularly problematic in spatiotemporal regression problems, where the state dimension scales with the number of spatial observations. Existing approximate frameworks leverage low-rank approximations of the covariance matrix. Since they do not model the error introduced by the computational approximation, their predictive uncertainty estimates can be overly optimistic. In this work, we propose a probabilistic numerical method for inference in high-dimensional Gauss-Markov models which mitigates these scaling issues. Our matrix-free iterative algorithm leverages GPU acceleration and crucially enables a tunable trade-off between computational cost and predictive uncertainty. Finally, we demonstrate the scalability of our method on a large-scale climate dataset.
- [29] arXiv:2405.08973 (cross-list from cs.LG) [pdf, ps, html, other]
-
Title: An adaptive approach to Bayesian Optimization with switching costsStefan Pricopie, Richard Allmendinger, Manuel Lopez-Ibanez, Clyde Fare, Matt Benatan, Joshua KnowlesSubjects: Machine Learning (cs.LG); Machine Learning (stat.ML)
We investigate modifications to Bayesian Optimization for a resource-constrained setting of sequential experimental design where changes to certain design variables of the search space incur a switching cost. This models the scenario where there is a trade-off between evaluating more while maintaining the same setup, or switching and restricting the number of possible evaluations due to the incurred cost. We adapt two process-constrained batch algorithms to this sequential problem formulation, and propose two new methods: one cost-aware and one cost-ignorant. We validate and compare the algorithms using a set of 7 scalable test functions in different dimensionalities and switching-cost settings for 30 total configurations. Our proposed cost-aware hyperparameter-free algorithm yields comparable results to tuned process-constrained algorithms in all settings we considered, suggesting some degree of robustness to varying landscape features and cost trade-offs. This method starts to outperform the other algorithms with increasing switching-cost. Our work broadens out from other recent Bayesian Optimization studies in resource-constrained settings that consider a batch setting only. While the contributions of this work are relevant to the general class of resource-constrained problems, they are particularly relevant to problems where adaptability to varying resource availability is of high importance
- [30] arXiv:2405.09076 (cross-list from cs.LG) [pdf, ps, html, other]
-
Title: Enhancing Airline Customer Satisfaction: A Machine Learning and Causal Analysis ApproachTejas Mirthipati (Georgia Institute Of Technology)Comments: 7 pages, 19 figuresSubjects: Machine Learning (cs.LG); Methodology (stat.ME)
This study explores the enhancement of customer satisfaction in the airline industry, a critical factor for retaining customers and building brand reputation, which are vital for revenue growth. Utilizing a combination of machine learning and causal inference methods, we examine the specific impact of service improvements on customer satisfaction, with a focus on the online boarding pass experience. Through detailed data analysis involving several predictive and causal models, we demonstrate that improvements in the digital aspects of customer service significantly elevate overall customer satisfaction. This paper highlights how airlines can strategically leverage these insights to make data-driven decisions that enhance customer experiences and, consequently, their market competitiveness.
- [31] arXiv:2405.09327 (cross-list from q-bio.PE) [pdf, ps, other]
-
Title: Leveraging graphical model techniques to study evolution on phylogenetic networksSubjects: Populations and Evolution (q-bio.PE); Computation (stat.CO)
The evolution of molecular and phenotypic traits is commonly modelled using Markov processes along a rooted phylogeny. This phylogeny can be a tree, or a network if it includes reticulations, representing events such as hybridization or admixture. Computing the likelihood of data observed at the leaves is costly as the size and complexity of the phylogeny grows. Efficient algorithms exist for trees, but cannot be applied to networks. We show that a vast array of models for trait evolution along phylogenetic networks can be reformulated as graphical models, for which efficient belief propagation algorithms exist. We provide a brief review of belief propagation on general graphical models, then focus on linear Gaussian models for continuous traits. We show how belief propagation techniques can be applied for exact or approximate (but more scalable) likelihood and gradient calculations, and prove novel results for efficient parameter inference of some models. We highlight the possible fruitful interactions between graphical models and phylogenetic methods. For example, approximate likelihood approaches have the potential to greatly reduce computational costs for phylogenies with reticulations.
- [32] arXiv:2405.09360 (cross-list from cs.LG) [pdf, ps, html, other]
-
Title: The Unfairness of $\varepsilon$-FairnessSubjects: Machine Learning (cs.LG); Theoretical Economics (econ.TH); Mathematical Finance (q-fin.MF); Machine Learning (stat.ML)
Fairness in decision-making processes is often quantified using probabilistic metrics. However, these metrics may not fully capture the real-world consequences of unfairness. In this article, we adopt a utility-based approach to more accurately measure the real-world impacts of decision-making process. In particular, we show that if the concept of $\varepsilon$-fairness is employed, it can possibly lead to outcomes that are maximally unfair in the real-world context. Additionally, we address the common issue of unavailable data on false negatives by proposing a reduced setting that still captures essential fairness considerations. We illustrate our findings with two real-world examples: college admissions and credit risk assessment. Our analysis reveals that while traditional probability-based evaluations might suggest fairness, a utility-based approach uncovers the necessary actions to truly achieve equality. For instance, in the college admission case, we find that enhancing completion rates is crucial for ensuring fairness. Summarizing, this paper highlights the importance of considering the real-world context when evaluating fairness.
- [33] arXiv:2405.09500 (cross-list from econ.TH) [pdf, ps, html, other]
-
Title: Identifying Heterogeneous Decision Rules From Choices When Menus Are UnobservedSubjects: Theoretical Economics (econ.TH); Econometrics (econ.EM); Statistics Theory (math.ST)
Given only aggregate choice data and limited information about how menus are distributed across the population, we describe what can be inferred robustly about the distribution of preferences (or more general decision rules). We strengthen and generalize existing results on such identification and provide an alternative analytical approach to study the problem. We show further that our model and results are applicable, after suitable reinterpretation, to other contexts. One application is to the robust identification of the distribution of updating rules given only the population distribution of beliefs and limited information about heterogeneous information sources.
Cross submissions for Thursday, 16 May 2024 (showing 10 of 10 entries )
- [34] arXiv:2209.07295 (replaced) [pdf, ps, html, other]
-
Title: A new set of tools for goodness-of-fit validationComments: 35 pages, 10 figures, submitted to the Electronic Journal of StatisticSubjects: Methodology (stat.ME); Statistics Theory (math.ST)
We introduce two new tools to assess the validity of statistical distributions. These tools are based on components derived from a new statistical quantity, the $comparison$ $curve$. The first tool is a graphical representation of these components on a $bar$ $plot$ (B plot), which can provide a detailed appraisal of the validity of the statistical model, in particular when supplemented by acceptance regions related to the model. The knowledge gained from this representation can sometimes suggest an existing $goodness$-$of$-$fit$ test to supplement this visual assessment with a control of the type I error. Otherwise, an adaptive test may be preferable and the second tool is the combination of these components to produce a powerful $\chi^2$-type goodness-of-fit test. Because the number of these components can be large, we introduce a new selection rule to decide, in a data driven fashion, on their proper number to take into consideration. In a simulation, our goodness-of-fit tests are seen to be powerwise competitive with the best solutions that have been recommended in the context of a fully specified model as well as when some parameters must be estimated. Practical examples show how to use these tools to derive principled information about where the model departs from the data.
- [35] arXiv:2212.01792 (replaced) [pdf, ps, html, other]
-
Title: Classification by sparse generalized additive modelsSubjects: Statistics Theory (math.ST); Machine Learning (cs.LG); Methodology (stat.ME)
We consider (nonparametric) sparse (generalized) additive models (SpAM) for classification. The design of a SpAM classifier is based on minimizing the logistic loss with a sparse group Lasso/Slope-type penalties on the coefficients of univariate additive components' expansions in orthonormal series (e.g., Fourier or wavelets). The resulting classifier is inherently adaptive to the unknown sparsity and smoothness. We show that under certain sparse group restricted eigenvalue condition it is nearly-minimax (up to log-factors) simultaneously across the entire range of analytic, Sobolev and Besov classes. The performance of the proposed classifier is illustrated on a simulated and a real-data examples.
- [36] arXiv:2212.11304 (replaced) [pdf, ps, html, other]
-
Title: Powerful Partial Conjunction Hypothesis Testing via ConditioningSubjects: Methodology (stat.ME)
A Partial Conjunction Hypothesis (PCH) test combines information across a set of base hypotheses to determine whether some subset is non-null. PCH tests arise in a diverse array of fields, but standard PCH testing methods can be highly conservative, leading to low power especially in low signal settings commonly encountered in applications. In this paper, we introduce the conditional PCH (cPCH) test, a new method for testing a single PCH that directly corrects the conservativeness of standard approaches by conditioning on certain order statistics of the base p-values. Under distributional assumptions commonly encountered in PCH testing, the cPCH test is valid and produces nearly uniformly distributed p-values under the null (i.e., cPCH p-values are only very slightly conservative). We demonstrate that the cPCH test matches or outperforms existing single PCH tests with particular power gains in low signal settings, maintains Type I error control even under model misspecification, and can be used to outperform state-of-the-art multiple PCH testing procedures in certain settings, particularly when side information is present. Finally, we illustrate an application of the cPCH test through a replicability analysis across DNA microarray studies.
- [37] arXiv:2305.13152 (replaced) [pdf, ps, html, other]
-
Title: Covariate-informed reconstruction of partially observed functional data via factor modelsSubjects: Statistics Theory (math.ST)
This paper studies linear reconstruction of partially observed functional data which are recorded on a discrete grid. We propose a novel estimation approach based on approximate factor models with increasing rank taking into account potential covariate information. Whereas alternative reconstruction procedures commonly involve some preliminary smoothing, our method separates the signal from noise and reconstructs missing fragments at once. We establish uniform convergence rates of our estimator and introduce a new method for constructing simultaneous prediction bands for the missing trajectories. A simulation study examines the performance of the proposed methods in finite samples. Finally, a real data application of temperature curves demonstrates that our theory provides a simple and effective method to recover missing fragments.
- [38] arXiv:2306.08321 (replaced) [pdf, ps, html, other]
-
Title: Nonparametric regression using over-parameterized shallow ReLU neural networksSubjects: Machine Learning (stat.ML); Machine Learning (cs.LG); Statistics Theory (math.ST)
It is shown that over-parameterized neural networks can achieve minimax optimal rates of convergence (up to logarithmic factors) for learning functions from certain smooth function classes, if the weights are suitably constrained or regularized. Specifically, we consider the nonparametric regression of estimating an unknown $d$-variate function by using shallow ReLU neural networks. It is assumed that the regression function is from the Hölder space with smoothness $\alpha<(d+3)/2$ or a variation space corresponding to shallow neural networks, which can be viewed as an infinitely wide neural network. In this setting, we prove that least squares estimators based on shallow neural networks with certain norm constraints on the weights are minimax optimal, if the network width is sufficiently large. As a byproduct, we derive a new size-independent bound for the local Rademacher complexity of shallow ReLU neural networks, which may be of independent interest.
- [39] arXiv:2307.09546 (replaced) [pdf, ps, other]
-
Title: Spatio-temporal quasi-experimental methods for rare disease outcomes: The impact of reformulated gasoline on childhood hematologic cancerSubjects: Applications (stat.AP)
Although some pollutants emitted in vehicle exhaust, such as benzene, are known to cause leukemia in adults with high exposure levels, less is known about the relationship between traffic-related air pollution (TRAP) and childhood hematologic cancer. In the 1990s, the US EPA enacted the reformulated gasoline program in select areas of the US, which drastically reduced ambient TRAP in affected areas. This created an ideal quasi-experiment to study the effects of TRAP on childhood hematologic cancers. However, existing methods for quasi-experimental analyses can perform poorly when outcomes are rare and unstable, as with childhood cancer incidence. We develop Bayesian spatio-temporal matrix completion methods to conduct causal inference in quasi-experimental settings with rare outcomes. Selective information sharing across space and time enables stable estimation, and the Bayesian approach facilitates uncertainty quantification. We evaluate the methods through simulations and apply them to estimate the causal effects of TRAP on childhood leukemia and lymphoma.
- [40] arXiv:2308.01156 (replaced) [pdf, ps, html, other]
-
Title: A new adaptive local polynomial density estimation procedure on complicated domainsComments: 43 pages, 4 figuresSubjects: Statistics Theory (math.ST); Probability (math.PR); Applications (stat.AP); Methodology (stat.ME)
This paper presents a novel approach for pointwise estimation of multivariate density functions on known domains of arbitrary dimensions using nonparametric local polynomial estimators. Our method is highly flexible, as it applies to both simple domains, such as open connected sets, and more complicated domains that are not star-shaped around the point of estimation. This enables us to handle domains with sharp concavities, holes, and local pinches, such as polynomial sectors. Additionally, we introduce a data-driven selection rule based on the general ideas of Goldenshluger and Lepski. Our results demonstrate that the local polynomial estimators are minimax under a $L^2$ risk across a wide range of Hölder-type functional classes. In the adaptive case, we provide oracle inequalities and explicitly determine the convergence rate of our statistical procedure. Simulations on polynomial sectors show that our oracle estimates outperform those of the most popular alternative method, found in the sparr package for the R software. Our statistical procedure is implemented in an online R package which is readily accessible.
- [41] arXiv:2310.02877 (replaced) [pdf, ps, html, other]
-
Title: Stationarity without mean reversion in improper Gaussian processesSubjects: Machine Learning (stat.ML); Machine Learning (cs.LG)
The behavior of a GP regression depends on the choice of covariance function. Stationary covariance functions are preferred in machine learning applications. However, (non-periodic) stationary covariance functions are always mean reverting and can therefore exhibit pathological behavior when applied to data that does not relax to a fixed global mean value. In this paper we show that it is possible to use improper GP priors with infinite variance to define processes that are stationary but not mean reverting. To this aim, we use of non-positive kernels that can only be defined in this limit regime. The resulting posterior distributions can be computed analytically and it involves a simple correction of the usual formulas. The main contribution of the paper is the introduction of a large family of smooth non-reverting covariance functions that closely resemble the kernels commonly used in the GP literature (e.g. squared exponential and Matérn class). By analyzing both synthetic and real data, we demonstrate that these non-positive kernels solve some known pathologies of mean reverting GP regression while retaining most of the favorable properties of ordinary smooth stationary kernels.
- [42] arXiv:2310.14448 (replaced) [pdf, ps, html, other]
-
Title: Semiparametrically Efficient Score for the Survival Odds RatioSubjects: Methodology (stat.ME)
We consider a general proportional odds model for survival data under binary treatment, where the functional form of the covariates is left unspecified. We derive the efficient score for the conditional survival odds ratio given the covariates using modern semiparametric theory. The efficient score may be useful in the development of doubly robust estimators, although computational challenges remain.
- [43] arXiv:2402.00183 (replaced) [pdf, ps, html, other]
-
Title: A review of regularised estimation methods and cross-validation in spatiotemporal statisticsSubjects: Methodology (stat.ME); Computation (stat.CO); Other Statistics (stat.OT)
This review article focuses on regularised estimation procedures applicable to geostatistical and spatial econometric models. These methods are particularly relevant in the case of big geospatial data for dimensionality reduction or model selection. To structure the review, we initially consider the most general case of multivariate spatiotemporal processes (i.e., $g > 1$ dimensions of the spatial domain, a one-dimensional temporal domain, and $q \geq 1$ random variables). Then, the idea of regularised/penalised estimation procedures and different choices of shrinkage targets are discussed. Finally, guided by the elements of a mixed-effects model setup, which allows for a variety of spatiotemporal models, we show different regularisation procedures and how they can be used for the analysis of geo-referenced data, e.g. for selection of relevant regressors, dimensionality reduction of the covariance matrices, detection of conditionally independent locations, or the estimation of a full spatial interaction matrix.
- [44] arXiv:2402.01493 (replaced) [pdf, ps, html, other]
-
Title: Sliced-Wasserstein Estimation with Spherical Harmonics as Control VariatesComments: Accepted to ICML 2024Subjects: Machine Learning (stat.ML); Machine Learning (cs.LG)
The Sliced-Wasserstein (SW) distance between probability measures is defined as the average of the Wasserstein distances resulting for the associated one-dimensional projections. As a consequence, the SW distance can be written as an integral with respect to the uniform measure on the sphere and the Monte Carlo framework can be employed for calculating the SW distance. Spherical harmonics are polynomials on the sphere that form an orthonormal basis of the set of square-integrable functions on the sphere. Putting these two facts together, a new Monte Carlo method, hereby referred to as Spherical Harmonics Control Variates (SHCV), is proposed for approximating the SW distance using spherical harmonics as control variates. The resulting approach is shown to have good theoretical properties, e.g., a no-error property for Gaussian measures under a certain form of linear dependency between the variables. Moreover, an improved rate of convergence, compared to Monte Carlo, is established for general measures. The convergence analysis relies on the Lipschitz property associated to the SW integrand. Several numerical experiments demonstrate the superior performance of SHCV against state-of-the-art methods for SW distance computation.
- [45] arXiv:2402.09623 (replaced) [pdf, ps, html, other]
-
Title: Conformalized Adaptive Forecasting of Heterogeneous TrajectoriesSubjects: Machine Learning (stat.ML); Machine Learning (cs.LG)
This paper presents a new conformal method for generating simultaneous forecasting bands guaranteed to cover the entire path of a new random trajectory with sufficiently high probability. Prompted by the need for dependable uncertainty estimates in motion planning applications where the behavior of diverse objects may be more or less unpredictable, we blend different techniques from online conformal prediction of single and multiple time series, as well as ideas for addressing heteroscedasticity in regression. This solution is both principled, providing precise finite-sample guarantees, and effective, often leading to more informative predictions than prior methods.
- [46] arXiv:2404.17222 (replaced) [pdf, ps, other]
-
Title: Asymptotic analysis for covariance parameter estimation of Gaussian processes with functional inputsSubjects: Statistics Theory (math.ST)
We consider covariance parameter estimation for Gaussian processes with functional inputs. From an increasing-domain asymptotics perspective, we prove the asymptotic consistency and normality of the maximum likelihood estimator. We extend these theoretical guarantees to encompass scenarios accounting for approximation errors in the inputs, which allows robustness of practical implementations relying on conventional sampling methods or projections onto a functional basis. Loosely speaking, both consistency and normality hold when the approximation error becomes negligible, a condition that is often achieved as the number of samples or basis functions becomes large. These later asymptotic properties are illustrated through analytical examples, including one that covers the case of non-randomly perturbed grids, as well as several numerical illustrations.
- [47] arXiv:2405.00884 (replaced) [pdf, ps, html, other]
-
Title: What's So Hard about the Monty Hall Problem?Subjects: Other Statistics (stat.OT)
The Monty Hall problem is notorious for its deceptive simplicity. Although today it is widely used as a provocative thought experiment to introduce Bayesian thinking to students of probability, in the not so distant past it was rejected by established mathematicians. This essay provides some historical background to the problem and explains why it is considered so counter-intuitive to many. It is argued that the main barrier to understanding the problem is the back-grounding of the concept of dependence in probability theory as it is commonly taught. To demonstrate this, a Bayesian solution is provided and augmented with a probabilistic graphical model (PGM) inspired by the work of Pearl (1988, 1998). Although the Bayesian approach produces the correct answer, without a representation of the dependency structure of events implied by the problem, the salient fact that motivates the problem's solution remains hidden.
- [48] arXiv:2405.05389 (replaced) [pdf, ps, html, other]
-
Title: On foundation of generative statistics with F-entropy: a gradient-based approachComments: 29 pagesSubjects: Methodology (stat.ME)
This paper explores the interplay between statistics and generative artificial intelligence. Generative statistics, an integral part of the latter, aims to construct models that can {\it generate} efficiently and meaningfully new data across the whole of the (usually high dimensional) sample space, e.g. a new photo. Within it, the gradient-based approach is a current favourite that exploits effectively, for the above purpose, the information contained in the observed sample, e.g. an old photo. However, often there are missing data in the observed sample, e.g. missing bits in the old photo. To handle this situation, we have proposed a gradient-based algorithm for generative modelling. More importantly, our paper underpins rigorously this powerful approach by introducing a new F-entropy that is related to Fisher's divergence. (The F-entropy is also of independent interest.) The underpinning has enabled the gradient-based approach to expand its scope. For example, it can now provide a tool for generative model selection. Possible future projects include discrete data and Bayesian variational inference.
- [49] arXiv:2202.04294 (replaced) [pdf, ps, html, other]
-
Title: Optimal Clustering with Bandit FeedbackComments: 54 pages, 4 figuresSubjects: Machine Learning (cs.LG); Information Theory (cs.IT); Machine Learning (stat.ML)
This paper considers the problem of online clustering with bandit feedback. A set of arms (or items) can be partitioned into various groups that are unknown. Within each group, the observations associated to each of the arms follow the same distribution with the same mean vector. At each time step, the agent queries or pulls an arm and obtains an independent observation from the distribution it is associated to. Subsequent pulls depend on previous ones as well as the previously obtained samples. The agent's task is to uncover the underlying partition of the arms with the least number of arm pulls and with a probability of error not exceeding a prescribed constant $\delta$. The problem proposed finds numerous applications from clustering of variants of viruses to online market segmentation. We present an instance-dependent information-theoretic lower bound on the expected sample complexity for this task, and design a computationally efficient and asymptotically optimal algorithm, namely Bandit Online Clustering (BOC). The algorithm includes a novel stopping rule for adaptive sequential testing that circumvents the need to exactly solve any NP-hard weighted clustering problem as its subroutines. We show through extensive simulations on synthetic and real-world datasets that BOC's performance matches the lower bound asymptotically, and significantly outperforms a non-adaptive baseline algorithm.
- [50] arXiv:2209.03935 (replaced) [pdf, ps, other]
-
Title: Generative Adversarial Networks Applied to Synthetic Financial Scenarios GenerationJournal-ref: Physica A: Statistical Mechanics and its Applications, 2023, 623, pp.128899Subjects: Computational Finance (q-fin.CP); Applications (stat.AP)
The finance industry is producing an increasing amount of datasets that investment professionals can consider to be influential on the price of financial assets. These datasets were initially mainly limited to exchange data, namely price, capitalization and volume. Their coverage has now considerably expanded to include, for example, macroeconomic data, supply and demand of commodities, balance sheet data and more recently extra-financial data such as ESG scores. This broadening of the factors retained as influential constitutes a serious challenge for statistical modeling. Indeed, the instability of the correlations between these factors makes it practically impossible to identify the joint laws needed to construct scenarios. Fortunately, spectacular advances in Deep Learning field in recent years have given rise to GANs. GANs are a type of generative machine learning models that produce new data samples with the same characteristics as a training data distribution in an unsupervised way, avoiding data assumptions and human induced biases. In this work, we are exploring the use of GANs for synthetic financial scenarios generation. This pilot study is the result of a collaboration between Fujitsu and Advestis and it will be followed by a thorough exploration of the use cases that can benefit from the proposed solution. We propose a GANs-based algorithm that allows the replication of multivariate data representing several properties (including, but not limited to, price, market capitalization, ESG score, controversy score,. . .) of a set of stocks. This approach differs from examples in the financial literature, which are mainly focused on the reproduction of temporal asset price scenarios. We also propose several metrics to evaluate the quality of the data generated by the GANs. This approach is well fit for the generation of scenarios, the time direction simply arising as a subsequent (eventually conditioned) generation of data points drawn from the learned distribution. Our method will allow to simulate high dimensional scenarios (compared to $\lesssim10$ features currently employed in most recent use cases) where network complexity is reduced thanks to a wisely performed feature engineering and selection. Complete results will be presented in a forthcoming study.
- [51] arXiv:2302.10886 (replaced) [pdf, ps, other]
-
Title: Some Fundamental Aspects about Lipschitz Continuity of Neural NetworksSubjects: Machine Learning (cs.LG); Machine Learning (stat.ML)
Lipschitz continuity is a crucial functional property of any predictive model, that naturally governs its robustness, generalisation, as well as adversarial vulnerability. Contrary to other works that focus on obtaining tighter bounds and developing different practical strategies to enforce certain Lipschitz properties, we aim to thoroughly examine and characterise the Lipschitz behaviour of Neural Networks. Thus, we carry out an empirical investigation in a range of different settings (namely, architectures, datasets, label noise, and more) by exhausting the limits of the simplest and the most general lower and upper bounds. As a highlight of this investigation, we showcase a remarkable fidelity of the lower Lipschitz bound, identify a striking Double Descent trend in both upper and lower bounds to the Lipschitz and explain the intriguing effects of label noise on function smoothness and generalisation.
- [52] arXiv:2307.11127 (replaced) [pdf, ps, html, other]
-
Title: Asymptotically Unbiased Synthetic Control Methods by Distribution MatchingComments: This study was presented at the Workshop on Counterfactuals in Minds and Machines at the International Conference on Machine Learning in July 2023 and at the International Conference on Econometrics and Statistics in August 2023Subjects: Econometrics (econ.EM); Machine Learning (cs.LG); Methodology (stat.ME)
Synthetic Control Methods (SCMs) have become an essential tool for comparative case studies. The fundamental idea of SCMs is to estimate the counterfactual outcomes of a treated unit using a weighted sum of the observed outcomes of untreated units. The accuracy of the synthetic control (SC) is critical for evaluating the treatment effect of a policy intervention; therefore, the estimation of SC weights has been the focus of extensive research. In this study, we first point out that existing SCMs suffer from an endogeneity problem, the correlation between the outcomes of untreated units and the error term of the synthetic control, which yields a bias in the treatment effect estimator. We then propose a novel SCM based on density matching, assuming that the density of outcomes of the treated unit can be approximated by a weighted average of the joint density of untreated units (i.e., a mixture model). Based on this assumption, we estimate SC weights by matching the moments of treated outcomes with the weighted sum of moments of untreated outcomes. Our proposed method has three advantages over existing methods: first, our estimator is asymptotically unbiased under the assumption of the mixture model; second, due to the asymptotic unbiasedness, we can reduce the mean squared error in counterfactual predictions; third, our method generates full densities of the treatment effect, not merely expected values, which broadens the applicability of SCMs. We provide experimental results to demonstrate the effectiveness of our proposed method.
- [53] arXiv:2308.14555 (replaced) [pdf, ps, other]
-
Title: Kernel Limit of Recurrent Neural Networks Trained on Ergodic Data SequencesComments: Major revision for lemma 7.1Subjects: Machine Learning (cs.LG); Probability (math.PR); Machine Learning (stat.ML)
Mathematical methods are developed to characterize the asymptotics of recurrent neural networks (RNN) as the number of hidden units, data samples in the sequence, hidden state updates, and training steps simultaneously grow to infinity. In the case of an RNN with a simplified weight matrix, we prove the convergence of the RNN to the solution of an infinite-dimensional ODE coupled with the fixed point of a random algebraic equation. The analysis requires addressing several challenges which are unique to RNNs. In typical mean-field applications (e.g., feedforward neural networks), discrete updates are of magnitude $\mathcal{O}(\frac{1}{N})$ and the number of updates is $\mathcal{O}(N)$. Therefore, the system can be represented as an Euler approximation of an appropriate ODE/PDE, which it will converge to as $N \rightarrow \infty$. However, the RNN hidden layer updates are $\mathcal{O}(1)$. Therefore, RNNs cannot be represented as a discretization of an ODE/PDE and standard mean-field techniques cannot be applied. Instead, we develop a fixed point analysis for the evolution of the RNN memory states, with convergence estimates in terms of the number of update steps and the number of hidden units. The RNN hidden layer is studied as a function in a Sobolev space, whose evolution is governed by the data sequence (a Markov chain), the parameter updates, and its dependence on the RNN hidden layer at the previous time step. Due to the strong correlation between updates, a Poisson equation must be used to bound the fluctuations of the RNN around its limit equation. These mathematical methods give rise to the neural tangent kernel (NTK) limits for RNNs trained on data sequences as the number of data samples and size of the neural network grow to infinity.
- [54] arXiv:2310.09031 (replaced) [pdf, ps, html, other]
-
Title: MINDE: Mutual Information Neural Diffusion EstimationSubjects: Machine Learning (cs.LG); Machine Learning (stat.ML)
In this work we present a new method for the estimation of Mutual Information (MI) between random variables. Our approach is based on an original interpretation of the Girsanov theorem, which allows us to use score-based diffusion models to estimate the Kullback Leibler divergence between two densities as a difference between their score functions. As a by-product, our method also enables the estimation of the entropy of random variables. Armed with such building blocks, we present a general recipe to measure MI, which unfolds in two directions: one uses conditional diffusion process, whereas the other uses joint diffusion processes that allow simultaneous modelling of two random variables. Our results, which derive from a thorough experimental protocol over all the variants of our approach, indicate that our method is more accurate than the main alternatives from the literature, especially for challenging distributions. Furthermore, our methods pass MI self-consistency tests, including data processing and additivity under independence, which instead are a pain-point of existing methods.
- [55] arXiv:2312.05134 (replaced) [pdf, ps, html, other]
-
Title: Optimal Multi-Distribution LearningSubjects: Machine Learning (cs.LG); Machine Learning (stat.ML)
Multi-distribution learning (MDL), which seeks to learn a shared model that minimizes the worst-case risk across $k$ distinct data distributions, has emerged as a unified framework in response to the evolving demand for robustness, fairness, multi-group collaboration, etc. Achieving data-efficient MDL necessitates adaptive sampling, also called on-demand sampling, throughout the learning process. However, there exist substantial gaps between the state-of-the-art upper and lower bounds on the optimal sample complexity. Focusing on a hypothesis class of Vapnik-Chervonenkis (VC) dimension d, we propose a novel algorithm that yields an varepsilon-optimal randomized hypothesis with a sample complexity on the order of (d+k)/varepsilon^2 (modulo some logarithmic factor), matching the best-known lower bound. Our algorithmic ideas and theory are further extended to accommodate Rademacher classes. The proposed algorithms are oracle-efficient, which access the hypothesis class solely through an empirical risk minimization oracle.
Additionally, we establish the necessity of randomization, revealing a large sample size barrier when only deterministic hypotheses are permitted. These findings resolve three open problems presented in COLT 2023 (i.e., citet[Problems 1, 3 and 4]{awasthi2023sample}). - [56] arXiv:2312.08174 (replaced) [pdf, ps, html, other]
-
Title: Double Machine Learning for Static Panel Models with Fixed EffectsSubjects: Econometrics (econ.EM); Machine Learning (cs.LG); Machine Learning (stat.ML)
Recent advances in causal inference have seen the development of methods which make use of the predictive power of machine learning algorithms. In this paper, we use double machine learning (DML) (Chernozhukov et al., 2018) to approximate high-dimensional and non-linear nuisance functions of the confounders to make inferences about the effects of policy interventions from panel data. We propose new estimators by adapting correlated random effects, within-group and first-difference estimation for linear models to an extension of Robinson (1988)'s partially linear regression model to static panel data models with individual fixed effects and unspecified non-linear confounder effects. Using Monte Carlo simulations, we compare the relative performance of different machine learning algorithms and find that conventional least squares estimators performs well when the data generating process is mildly non-linear and smooth, but there are substantial performance gains with DML in terms of bias reduction when the true effect of the regressors is non-linear and discontinuous. However, inference based on individual learners can lead to badly biased inference. Finally, we provide an illustrative example of DML for observational panel data showing the impact of the introduction of the minimum wage on voting behavior in the UK.
- [57] arXiv:2402.01454 (replaced) [pdf, ps, html, other]
-
Title: Integrating Large Language Models in Causal Discovery: A Statistical Causal ApproachMasayuki Takayama, Tadahisa Okuda, Thong Pham, Tatsuyoshi Ikenoue, Shingo Fukuma, Shohei Shimizu, Akiyoshi SannaiSubjects: Machine Learning (cs.LG); Artificial Intelligence (cs.AI); Methodology (stat.ME); Machine Learning (stat.ML)
In practical statistical causal discovery (SCD), embedding domain expert knowledge as constraints into the algorithm is widely accepted as significant for creating consistent meaningful causal models, despite the recognized challenges in systematic acquisition of the background knowledge. To overcome these challenges, this paper proposes a novel methodology for causal inference, in which SCD methods and knowledge based causal inference (KBCI) with a large language model (LLM) are synthesized through ``statistical causal prompting (SCP)'' for LLMs and prior knowledge augmentation for SCD. Experiments have revealed that GPT-4 can cause the output of the LLM-KBCI and the SCD result with prior knowledge from LLM-KBCI to approach the ground truth, and that the SCD result can be further improved, if GPT-4 undergoes SCP. Furthermore, by using an unpublished real-world dataset, we have demonstrated that the background knowledge provided by the LLM can improve SCD on this dataset, even if this dataset has never been included in the training data of the LLM. The proposed approach can thus address challenges such as dataset biases and limitations, illustrating the potential of LLMs to improve data-driven causal inference across diverse scientific domains.
- [58] arXiv:2402.02746 (replaced) [pdf, ps, html, other]
-
Title: Standard Gaussian Process Can Be Excellent for High-Dimensional Bayesian OptimizationSubjects: Machine Learning (cs.LG); Machine Learning (stat.ML)
There has been a long-standing and widespread belief that Bayesian Optimization (BO) with standard Gaussian process (GP), referred to as standard BO, is ineffective in high-dimensional optimization problems. While this belief sounds reasonable, strong empirical evidence is lacking. In this paper, we systematically investigated BO with standard GP regression across a variety of synthetic and real-world benchmark problems for high-dimensional optimization. We found that, surprisingly, when using Matérn kernels and Upper Confidence Bound (UCB), standard BO consistently achieves top-tier performance, often outperforming other BO methods specifically designed for high-dimensional optimization. Contrary to the stereotype, we found that standard GP equipped with Matérn kernels can serve as a capable surrogate for learning high-dimensional functions. Without strong structural assumptions, BO with standard GP not only excels in high-dimensional optimization but also is robust in accommodating various structures within target functions. Furthermore, with standard GP, achieving promising optimization performance is possible via maximum a posterior (MAP) estimation with diffuse priors or merely maximum likelihood estimation, eliminating the need for expensive Markov-Chain Monte Carlo (MCMC) sampling that might be required by more complex surrogate models. In parallel, we also investigated and analyzed alternative popular settings in running standard BO, which, however, often fail in high-dimensional optimization. This might link to the a few failure cases reported in literature. We thus advocate for a re-evaluation and in-depth study of the potential of standard BO in addressing high-dimensional problems.
- [59] arXiv:2403.08493 (replaced) [pdf, ps, other]
-
Title: Rumor Forwarding Prediction Model Based on Uncertain Time SeriesComments: 11 pages,3 figuresSubjects: Social and Information Networks (cs.SI); Applications (stat.AP)
The rapid spread of rumors in social media is mainly caused by individual retweets. This paper applies uncertainty time series analysis (UTSA) to analyze a rumor retweeting behavior on Weibo. First, the rumor forwarding is modeled using uncertain time series, including order selection, parameter estimation, residual analysis, uncertainty hypothesis testing and forecast, and the validity of using uncertain time series analysis is further supported by analyzing the characteristics of the residual plot. The experimental results show that the uncertain time series can better predict the next stage of rumor forwarding. The results of the study have important practical significance for rumor management and the management of social media information dissemination.
- [60] arXiv:2404.17358 (replaced) [pdf, ps, html, other]
-
Title: Adversarial Consistency and the Uniqueness of the Adversarial Bayes ClassifierComments: 18 pages, v2: fixed typosSubjects: Machine Learning (cs.LG); Statistics Theory (math.ST); Machine Learning (stat.ML)
Adversarial training is a common technique for learning robust classifiers. Prior work showed that convex surrogate losses are not statistically consistent in the adversarial context -- or in other words, a minimizing sequence of the adversarial surrogate risk will not necessarily minimize the adversarial classification error. We connect the consistency of adversarial surrogate losses to properties of minimizers to the adversarial classification risk, known as \emph{adversarial Bayes classifiers}. Specifically, under reasonable distributional assumptions, a convex loss is statistically consistent for adversarial learning iff the adversarial Bayes classifier satisfies a certain notion of uniqueness.
- [61] arXiv:2405.08498 (replaced) [pdf, ps, html, other]
-
Title: Learning Decision Policies with Instrumental Variables through Double Machine LearningComments: Accepted at ICML 2024Subjects: Machine Learning (cs.LG); Machine Learning (stat.ML)
A common issue in learning decision-making policies in data-rich settings is spurious correlations in the offline dataset, which can be caused by hidden confounders. Instrumental variable (IV) regression, which utilises a key unconfounded variable known as the instrument, is a standard technique for learning causal relationships between confounded action, outcome, and context variables. Most recent IV regression algorithms use a two-stage approach, where a deep neural network (DNN) estimator learnt in the first stage is directly plugged into the second stage, in which another DNN is used to estimate the causal effect. Naively plugging the estimator can cause heavy bias in the second stage, especially when regularisation bias is present in the first stage estimator. We propose DML-IV, a non-linear IV regression method that reduces the bias in two-stage IV regressions and effectively learns high-performing policies. We derive a novel learning objective to reduce bias and design the DML-IV algorithm following the double/debiased machine learning (DML) framework. The learnt DML-IV estimator has strong convergence rate and $O(N^{-1/2})$ suboptimality guarantees that match those when the dataset is unconfounded. DML-IV outperforms state-of-the-art IV regression methods on IV regression benchmarks and learns high-performing policies in the presence of instruments.